Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTA engine robustness #1174

Closed
Tracked by #1167
michielbdejong opened this issue Sep 13, 2024 · 18 comments
Closed
Tracked by #1167

OTA engine robustness #1174

michielbdejong opened this issue Sep 13, 2024 · 18 comments

Comments

@michielbdejong
Copy link
Member

Initial tests with #1164 show a number of problems:

  • even with git commit-graph, the runs are slow, this may be because the repo starts out shallowed to 5 commits but can then build up 6000 commits in a single run. So it's probably necessary to avoid wildcards - will look into that today.
  • the script sometimes trips over itself, and errors with 'git in use' or 'expected only one file to be changed' and then fails
  • if git is left with uncommitted changes, the next run will fail too.
  • the script builds up a lot of memory usage (several Gigabytes)

To reduce the effect of a git repo getting into the wrong state, it would help to shard the snapshots and versions repos -> #1172
Apart from sharding the repos, I also want to rethink the scheduling through a cronjob -> #1173

I think these two measures combined will make the crawler a lot more stable, and also allow us to run it on the same server as the rest of ToS;DR, so it will not only be more reliable, but also cheaper for us to host.

@michielbdejong
Copy link
Member Author

I'm having some trouble triggering the recorder, whatever I do, even for a service that was not previously tracked, it says 'Recorded 0 new versions' and 'Recorded 0 new snapshots and 0 new versions'. Investigating...

@michielbdejong
Copy link
Member Author

Ah it was because I was rebased on the crawl-api branch. Fixed now. This is something to be aware of though, for production!

@michielbdejong
Copy link
Member Author

@michielbdejong
Copy link
Member Author

michielbdejong commented Sep 13, 2024

Before starting the crawl, there is a search:

getLatestSnapshot calling findLatest [
  SourceDocument {
    location: 'https://feelthemusi.com/pages/terms_of_use.html',
    executeClientScripts: undefined,
    contentSelectors: 'body',
    insignificantContentSelectors: undefined,
    filters: undefined,
    content: undefined,
    mimeType: undefined,
    id: 'pages-terms_of_use.html'
  }
] undefined

It makes sense that at this point only the location and the filters are known and the content and mimeType are not. But we could add the mimeType into the declaration, and we can even let it default totext/html.

This does change the behaviour of the crawler a bit because it means a crawl would reasonably have to fail if a pdf is encountered where this was not expected. So that's a downside of option 3.

4th option, similar to the 2nd one: we could do an ls before the git log.

And I'm now filling in option 3 ('pass the mimeType as an argument') as getting it from the declaration. But we could also imagine prohibitinggit log before fetch.

So to recap the options (adding a 6 - 8 here mainly for completeness, these are probably not better than 4 or 1):

  1. shortlist the 7 or so mimeTypes
  2. do a filtering in JS after the git log
  3. put the mimeType in the declaration
  4. do an ls before the git log
  5. prohibit git log before fetch, so that the mimeType from the fetch can be used
  6. put the mimetype in a file next to the doc
  7. add a symlink, in the same way unix commands sometimes use a symlink to their latest version.
  8. optimistically look for .html and only look for .* if the .html fails (this would still allow the slow behaviour in rare cases)

@michielbdejong
Copy link
Member Author

Option 1 cannot be done in a single git log call so then it becomes very similar to option 8.
The downside of option 2 is that it would replace the if statement that checks for existence with something more complicated - it would require inspecting the response to see if the files found are actually the ones that we were looking for.

So options 1, 8 and 2 are not really attractive. Option 3 seems doable, but I'm now thinking maybe option 5 is also not that bad.

@michielbdejong
Copy link
Member Author

OK, I think I found a solution, in two parts:

  1. make the pre-run optional
  2. pass the mimeType from the sourceDocument in the call to getLatestSnapshots

@michielbdejong
Copy link
Member Author

In fact, we could also make an even bigger step and make the whole snapshot recording optional.
We (ToS;DR, and carrying the history of the Tosback project) care much more about the versions repo than the snapshots repo.

@michielbdejong
Copy link
Member Author

And probably if we only have a versions repo then the whole wildcards problem goes away anyway! :)

@michielbdejong
Copy link
Member Author

Even if you do write the snapshots then you could still skip the read-back

@michielbdejong
Copy link
Member Author

I'll work on OpenTermsArchive/engine#1103 first and then see if OpenTermsArchive/engine#1101 is still needed.

@michielbdejong
Copy link
Member Author

Option 9: use the filename returned in the call to git ls-files to determine the extension

@michielbdejong
Copy link
Member Author

I'm now running with OpenTermsArchive/engine#1103 and crawling 12,000 documents still fails. The logs show that it gets from A to S in 3 hours and then just stops. This is my cronjobe script, maybe I should add a 2>&1 in there?

crawler@ota:~$ cat hourly.sh 
#!/bin/sh
cd /home/crawler/engine
export NODE_OPTIONS=--max_old_space_size=8000
/home/crawler/.nvm/versions/node/v20.17.0/bin/node /home/crawler/engine/bin/ota.js track --skipPreRun --skipSnapshots > ~/"`date`".log
cd data/versions
git add .
git commit -am"cleanup"
git push
git rev-parse HEAD~5 > .git/shallow
cd ../snapshots
git add .
git commit -am"cleanup"
git push
git rev-parse HEAD~5 > .git/shallow
cd ../..

@michielbdejong
Copy link
Member Author

michielbdejong commented Sep 16, 2024

This is what a run looks like now, in terms of time, bandwidth, CPU and disk. This run started at A and apparently crashed at R. Memory usage is not shown:
Screenshot 2024-09-16 at 14 02 54

@michielbdejong
Copy link
Member Author

I installed the Digital Ocean do-agent so we can now clearly see the memory leak problem:

Screenshot 2024-09-16 at 15 26 55

@michielbdejong
Copy link
Member Author

I'll chop the task into 10 smaller ones. We could also split it into 100 tasks but that would incur an overhead of starting and stopping the headless browser 100 times and it will give 100 times the risk something goes wrong with the git push so that's why I think 10 is a better balance.

@michielbdejong
Copy link
Member Author

In this graph it's also clear that disk I/O goes up (probably due to swapping!) as soon as memory usage hits 100%, and then shortly after the process ends:
Screenshot 2024-09-16 at 16 04 56

@michielbdejong
Copy link
Member Author

I'll consider this resolved now.

@michielbdejong
Copy link
Member Author

This graph nicely shows the effect! On the left around 15:00 is a run that ramps up and crashes prematurely, and then 4 successful runs (16:00, 20:00, 2:00, 8:00) that are each chopped into 10 parts.
Screenshot 2024-09-17 at 12 35 43

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant