Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

freeCodeCamp scraper 2024: what's next #34

Open
benoit74 opened this issue Dec 5, 2024 · 8 comments
Open

freeCodeCamp scraper 2024: what's next #34

benoit74 opened this issue Dec 5, 2024 · 8 comments
Assignees
Labels
task Something to track tasks that have to be done

Comments

@benoit74
Copy link
Collaborator

benoit74 commented Dec 5, 2024

We have been sponsored to move freeCodeCamp scraper to the next level.

We now have to define clearly what this next level is going to be, and more especially what is the strategy and the roadmap.

I see two alternatives strategies to investigate:

  1. continue the evolve the scraper with current architecture:
  • pros: since we recode the whole UI, it is deemed to work (we can do whatever we want) and we are independent of freeCodeCamp changes
  • cons: we have to recode the whole UI (significant work) and especially the exercises evaluation (significant work as well) ; any freeCodeCamp UI change will have to be considered for "back-porting" in the scraper ; should freeCodeCamp change its data format, our scraper will break
  1. since freeCodeCamp UI is in Gatsby, which renders a static website, it is maybe possible to reuse this
  • pros: we adhere to freeCodeCamp UI, we have less code to write and maintain, all spoken languages are automatically supported, on ZIM for all spoken and programming languages is probably sufficient
  • cons: we might have to patch code significantly to deactivate features which are deemed to fail when offline (but it could be an opportunity to contribute them upstream); if freeCodeCamp abandon current architecture, scraper is not working anymore

I will start by investigating a bit strategy no 2 to assess feasibility

@benoit74 benoit74 added the task Something to track tasks that have to be done label Dec 5, 2024
@benoit74
Copy link
Collaborator Author

benoit74 commented Dec 10, 2024

Assessment of strategy 2 is complete:

  • Gatsby is not easy to put inside a ZIM because it assumes that the path under which the static website is known at compilation time and this path is pushed "everywhere"
  • We hence need HTML/JS/CSS rewrite to make all this work properly
  • We also need wombat because many paths are computed by JS code
  • Putting this all together and create a ZIM with this approach makes it "work a bit" but it there are some nasty bugs showing up which seems hard to solve (not speaking about some problems I had to patch manually)
  • See https://github.com/openzim/freecodecamp/blob/wombat_poc/scraper/src/fcc2zim/create_zim.md for more details on this PoC

Following this strategy 2 is hence promising (we have a mostly working UI) but the last mile seems to have significant roadblockers to overcome.

Strategy 2 is also very hard to industrialize: we need to build the Gatsby website with a Docker container and then extract files from the container to feed them to the rewriting logic ; we will also probably need to patch code (before static website generation) and patching is never something pleasant to industrialize / maintain in the long run.

I'm now continuing with assessment of strategy 1, and more precisely defining which tasks can reasonably be undertaken during 2024/2025 project, and which ZIMs we will get out of it.

@benoit74
Copy link
Collaborator Author

Currently, there are 21 curriculum on https://www.freecodecamp.org/learn

Some (6) are not going to happen inside a ZIM (at least not without massive investments):

  • relational-database, back-end-development-and-apis, quality-assurance, information-security relies on Gitpod for assignments
  • college-algebra-with-python relies on Google Colaboratory
  • foundational-c-sharp-with-microsoft relies on learn.microsoft.com

Some (10) are going to be tough:

  • data-visualization, front-end-development-libraries, responsive-web-design, 2022/responsive-web-design, javascript-algorithms-and-data-structures-v8 needs advanced UIs with HTML/JS/CSS code preview in an iFrame
  • scientific-computing-with-python needs a Python interpreter (+ new UI)
  • python-for-everybody and machine-learning-with-python needs videos (big ZIMs) + multiple choice (and we probably need to exclude machine-learning-with-python projects which are done on Gitpod)
  • data-analysis-with-python is divided in 3 parts: part 1 Data Analysis with Python assumes you will run assignments on Google Colaboratory or notebooks.ai => not really feasible ; part 2 Numpy is videos + multiple choice, so as feasible as points above ; part 3 are projects to run on Gitpod.io => not really feasible
  • a2-english-for-developers needs a text to speach module + "nice" graphical animation + multiple choice

Some (2) seems feasible:

  • the-odin-project needs a UI based on multiple choices
  • coding-interview-prep is already supposed to work for non-projects assignments (~75% of the course) and backend projects ; frontend projects probably have to be excluded somehow because they rely codepen.io to have a basic structure and submit assignements

Some (3) have to be done:

  • project-euler and rosetta-code are already supposed to "work"
  • javascript-algorithms-and-data-structures is the v1 project, but it is incomplete: it misses the ES6 part, but this part has one challenge which is not working) ; we should probably add what can be added

I will probably continue to dive in this direction with a list of issues grouped in milestones

@benoit74
Copy link
Collaborator Author

benoit74 commented Dec 10, 2024

For now I've built-up 3 milestones:

  • 1.2.0 will fix all known issues without touching the (ugly) UI issues ; includes been able to generate ZIMs in all spoken languages supported by FCC
    • once done, we should have 4 curriculum (project-euler, rosetta-code, javascript-algorithms-and-data-structures, coding-interview-prep) * 11 languages = 44 ZIMs
  • 2.0.0 will update the UI to reach something "acceptable" (i.e. close to what was done on youtube revamp)
    • no new ZIMs here, but much nicer ones
  • 2.1.0 will add support for multiple choices challenges
    • once done, we should have one more curriculum, the-odin-project (11 more ZIMs)

I'm afraid this (especially 2.0.0) will already consume significant time. I'm not even totally convinced that 2.1.0 will make it (especially since it opens the door only to one more curriculum which is not a big added value). This is also the reason why I prefer to push this in 2.0.0 to avoid starting with "big rewrite" but rather prefer to continue getting my hands on with the "limited" fixes of 1.2.0 which also open the door to much added value.

After that, we have following options (sorted by my personal order of preference):

  • work on challenge types relying on a mix of videos (which are "just Youtube videos to download") + multiple choices questions (which will already work since 2.1.0)
    • open the door to python-for-everybody, most of machine-learning-with-python and Numpy (single course of data-analysis-with-python)a
  • work on advanced UI with HTML/JS/CSS preview
    • open the door to the huge collection of data-visualization, front-end-development-libraries, responsive-web-design, 2022/responsive-web-design, javascript-algorithms-and-data-structures-v8
  • work on text to speech + animation for "spoken language" courses
    • open the door to a2-english-for-developers ZIMs (and I just saw a b1 curriculum in preparation in the source code)

Given that this path seems not "as horrible as expected" and that strategy 2 is not "as feasible as expected", I propose that strategy 1 is the plan, and than for now we commit on 1.2.0 and 2.0.0 scopes, with an option for 2.1.0 if 2.0.0 is simpler than expected.

@Popolechien @kelson42 @rgaudin WDYT?

@benoit74
Copy link
Collaborator Author

For the record, the analysis spreadsheet I've made about challenge types usage: https://docs.google.com/spreadsheets/d/11Y5aTF934teHOr8DEPXp9U9FOsQ4MHpxNZ71dXYsO_A/edit?usp=sharing (accessible only to everyone with the link can read, whole Kiwix can write)

@Popolechien
Copy link

@benoit74 Not sure I understand the HasNoSolution column

@benoit74
Copy link
Collaborator Author

It is just a property of the challenge type in code base, I don't understand yet what information we can get from this (maybe it is worthless).

@rgaudin
Copy link
Member

rgaudin commented Dec 11, 2024

I think we should discuss it live

@benoit74
Copy link
Collaborator Author

We've discussed it live.

Conclusion are (correct me if I'm wrong):

  • we agree on strategy 1 (not totally discussed, but not really a choice from my PoV either)
  • we agree that milestones designed so far make sense in term of progress and size
  • we need to discuss again about (both points being linked together):
    • where we stop for this project (2.0, 2.1, 2.x?)
    • who is in charge of doing this development work (can we externalize this to avoid getting me stuck on this for months?)

So far I have enough info to start working on 1.2 and release this. But it is quite important to clarify the scope / WoW anyway so that we do not get stuck with no plan to continue.

@benoit74 benoit74 removed this from the 1.2.0 milestone Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Something to track tasks that have to be done
Projects
None yet
Development

No branches or pull requests

4 participants