OTA engine robustness #1174

michielbdejong · 2024-09-13T09:19:33Z

Initial tests with #1164 show a number of problems:

even with git commit-graph, the runs are slow, this may be because the repo starts out shallowed to 5 commits but can then build up 6000 commits in a single run. So it's probably necessary to avoid wildcards - will look into that today.
the script sometimes trips over itself, and errors with 'git in use' or 'expected only one file to be changed' and then fails
if git is left with uncommitted changes, the next run will fail too.
the script builds up a lot of memory usage (several Gigabytes)

To reduce the effect of a git repo getting into the wrong state, it would help to shard the snapshots and versions repos -> #1172
Apart from sharding the repos, I also want to rethink the scheduling through a cronjob -> #1173

I think these two measures combined will make the crawler a lot more stable, and also allow us to run it on the same server as the rest of ToS;DR, so it will not only be more reliable, but also cheaper for us to host.

michielbdejong · 2024-09-13T10:24:28Z

I'm having some trouble triggering the recorder, whatever I do, even for a service that was not previously tracked, it says 'Recorded 0 new versions' and 'Recorded 0 new snapshots and 0 new versions'. Investigating...

michielbdejong · 2024-09-13T12:08:33Z

Ah it was because I was rebased on the crawl-api branch. Fixed now. This is something to be aware of though, for production!

michielbdejong · 2024-09-13T12:22:54Z

Added mimeType to getLatest interface. Now should get it from https://github.com/OpenTermsArchive/engine/blob/5963f7272ba0429eda75e6146dd18ce4e14ed274/src/archivist/services/sourceDocument.js#L9

michielbdejong · 2024-09-13T12:40:42Z

Before starting the crawl, there is a search:

getLatestSnapshot calling findLatest [
  SourceDocument {
    location: 'https://feelthemusi.com/pages/terms_of_use.html',
    executeClientScripts: undefined,
    contentSelectors: 'body',
    insignificantContentSelectors: undefined,
    filters: undefined,
    content: undefined,
    mimeType: undefined,
    id: 'pages-terms_of_use.html'
  }
] undefined

It makes sense that at this point only the location and the filters are known and the content and mimeType are not. But we could add the mimeType into the declaration, and we can even let it default totext/html.

This does change the behaviour of the crawler a bit because it means a crawl would reasonably have to fail if a pdf is encountered where this was not expected. So that's a downside of option 3.

4th option, similar to the 2nd one: we could do an ls before the git log.

And I'm now filling in option 3 ('pass the mimeType as an argument') as getting it from the declaration. But we could also imagine prohibitinggit log before fetch.

So to recap the options (adding a 6 - 8 here mainly for completeness, these are probably not better than 4 or 1):

shortlist the 7 or so mimeTypes
do a filtering in JS after the git log
put the mimeType in the declaration
do an ls before the git log
prohibit git log before fetch, so that the mimeType from the fetch can be used
put the mimetype in a file next to the doc
add a symlink, in the same way unix commands sometimes use a symlink to their latest version.
optimistically look for .html and only look for .* if the .html fails (this would still allow the slow behaviour in rare cases)

michielbdejong · 2024-09-13T12:47:30Z

Option 1 cannot be done in a single git log call so then it becomes very similar to option 8.
The downside of option 2 is that it would replace the if statement that checks for existence with something more complicated - it would require inspecting the response to see if the files found are actually the ones that we were looking for.

So options 1, 8 and 2 are not really attractive. Option 3 seems doable, but I'm now thinking maybe option 5 is also not that bad.

michielbdejong · 2024-09-13T13:06:25Z

OK, I think I found a solution, in two parts:

make the pre-run optional
pass the mimeType from the sourceDocument in the call to getLatestSnapshots

michielbdejong · 2024-09-13T13:09:01Z

In fact, we could also make an even bigger step and make the whole snapshot recording optional.
We (ToS;DR, and carrying the history of the Tosback project) care much more about the versions repo than the snapshots repo.

michielbdejong · 2024-09-13T13:09:36Z

And probably if we only have a versions repo then the whole wildcards problem goes away anyway! :)

michielbdejong · 2024-09-13T13:13:44Z

Even if you do write the snapshots then you could still skip the read-back

michielbdejong · 2024-09-13T14:30:22Z

I'll work on OpenTermsArchive/engine#1103 first and then see if OpenTermsArchive/engine#1101 is still needed.

michielbdejong · 2024-09-13T14:52:52Z

Option 9: use the filename returned in the call to git ls-files to determine the extension

michielbdejong · 2024-09-16T08:46:48Z

I'm now running with OpenTermsArchive/engine#1103 and crawling 12,000 documents still fails. The logs show that it gets from A to S in 3 hours and then just stops. This is my cronjobe script, maybe I should add a 2>&1 in there?

crawler@ota:~$ cat hourly.sh 
#!/bin/sh
cd /home/crawler/engine
export NODE_OPTIONS=--max_old_space_size=8000
/home/crawler/.nvm/versions/node/v20.17.0/bin/node /home/crawler/engine/bin/ota.js track --skipPreRun --skipSnapshots > ~/"`date`".log
cd data/versions
git add .
git commit -am"cleanup"
git push
git rev-parse HEAD~5 > .git/shallow
cd ../snapshots
git add .
git commit -am"cleanup"
git push
git rev-parse HEAD~5 > .git/shallow
cd ../..

michielbdejong · 2024-09-16T12:06:29Z

This is what a run looks like now, in terms of time, bandwidth, CPU and disk. This run started at A and apparently crashed at R. Memory usage is not shown:

michielbdejong · 2024-09-16T13:28:00Z

I installed the Digital Ocean do-agent so we can now clearly see the memory leak problem:

michielbdejong · 2024-09-16T13:30:24Z

I'll chop the task into 10 smaller ones. We could also split it into 100 tasks but that would incur an overhead of starting and stopping the headless browser 100 times and it will give 100 times the risk something goes wrong with the git push so that's why I think 10 is a better balance.

michielbdejong · 2024-09-16T14:06:26Z

In this graph it's also clear that disk I/O goes up (probably due to swapping!) as soon as memory usage hits 100%, and then shortly after the process ends:

michielbdejong · 2024-09-16T14:54:37Z

I'll consider this resolved now.

michielbdejong · 2024-09-17T10:37:51Z

This graph nicely shows the effect! On the left around 15:00 is a run that ramps up and crashes prematurely, and then 4 successful runs (16:00, 20:00, 2:00, 8:00) that are each chopped into 10 parts.

This was referenced Sep 13, 2024

ToS;DR-OTA sprint September 2024 #1167

Open

New production crawler #1164

Closed

Consider taking advantage of git commit-graph OpenTermsArchive/engine#1101

Closed

michielbdejong closed this as completed Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OTA engine robustness #1174

OTA engine robustness #1174

michielbdejong commented Sep 13, 2024

michielbdejong commented Sep 13, 2024

michielbdejong commented Sep 13, 2024

michielbdejong commented Sep 13, 2024

michielbdejong commented Sep 13, 2024 •

edited

Loading

michielbdejong commented Sep 13, 2024

michielbdejong commented Sep 13, 2024

michielbdejong commented Sep 13, 2024

michielbdejong commented Sep 13, 2024

michielbdejong commented Sep 13, 2024

michielbdejong commented Sep 13, 2024

michielbdejong commented Sep 13, 2024

michielbdejong commented Sep 16, 2024

michielbdejong commented Sep 16, 2024 •

edited

Loading

michielbdejong commented Sep 16, 2024

michielbdejong commented Sep 16, 2024

michielbdejong commented Sep 16, 2024

michielbdejong commented Sep 16, 2024

michielbdejong commented Sep 17, 2024

OTA engine robustness #1174

OTA engine robustness #1174

Comments

michielbdejong commented Sep 13, 2024

michielbdejong commented Sep 13, 2024

michielbdejong commented Sep 13, 2024

michielbdejong commented Sep 13, 2024

michielbdejong commented Sep 13, 2024 • edited Loading

michielbdejong commented Sep 13, 2024

michielbdejong commented Sep 13, 2024

michielbdejong commented Sep 13, 2024

michielbdejong commented Sep 13, 2024

michielbdejong commented Sep 13, 2024

michielbdejong commented Sep 13, 2024

michielbdejong commented Sep 13, 2024

michielbdejong commented Sep 16, 2024

michielbdejong commented Sep 16, 2024 • edited Loading

michielbdejong commented Sep 16, 2024

michielbdejong commented Sep 16, 2024

michielbdejong commented Sep 16, 2024

michielbdejong commented Sep 16, 2024

michielbdejong commented Sep 17, 2024

michielbdejong commented Sep 13, 2024 •

edited

Loading

michielbdejong commented Sep 16, 2024 •

edited

Loading