Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use temporary SQLite db instead of list[dict] #45

Open
7 of 10 tasks
mikegerber opened this issue Aug 2, 2024 · 7 comments
Open
7 of 10 tasks

Use temporary SQLite db instead of list[dict] #45

mikegerber opened this issue Aug 2, 2024 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@mikegerber
Copy link
Member

mikegerber commented Aug 2, 2024

Aug 02 07:00:54 b-pc30533 kernel: Out of memory: Killed process 4030869 (mods4pandas) total-vm:70463740kB, anon-rss:28581044kB, file-rss:1232kB, shmem-rss:4kB, UID:1000 pgtables:136832kB oom_score_adj:0

That's a whopping 28 GB memory after reading just 22% of the data...

→ Need a more memory-efficient way to handle this.


Progress:

  • page_info
  • Deal with existing files
  • ValueError: No structMap[@type='PHYSICAL'] found (but not a multivolume work) (example on lx0246
  • column names case insensitive
  • Fix tests
  • Convert to Parquet again
  • Compare old files with new ones
    • Need to use JSON types for sets etc.
    • Need tests for these, it's a recurring issue
  • alto_info
@mikegerber mikegerber self-assigned this Aug 2, 2024
@mikegerber mikegerber added the bug Something isn't working label Aug 2, 2024
@mikegerber
Copy link
Member Author

Considering all options, looks like iteratively build a SQLite database would be the best option (or at least worth trying)

@mikegerber
Copy link
Member Author

I've done some experiments, and the temporary SQLite db seems to be the way to go.

@mikegerber
Copy link
Member Author

Writing the records out to SQLite works nicely, barely any memory consumption.

Not terribly happy with the code and it now needs work to convert to Parquet, but overall this seems to be an elegant and sufficiently efficient solution.

@mikegerber
Copy link
Member Author

Tests are still fine in this state as they don't read the files.

@mikegerber
Copy link
Member Author

SQLite DBs are now converted to Parquet after creation.

@mikegerber mikegerber changed the title Can't export page info due to OOM Use temporary SQLite db instead of list[dict] Nov 28, 2024
@mikegerber
Copy link
Member Author

In the previous version we had array columns:

>>> mods_info_df["classification-ZVDD"].iloc[0]
array(['Tibetica', 'Ostasiatica', 'Historische Drucke'], dtype=object)

:-\

@mikegerber
Copy link
Member Author

alto_info/alto4pandas is done now, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant