-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speeding up the Neo4j import #5
Comments
Yeah it took ~10 hours.
I personally didn't spend too much time optimizing because I didn't plan on running this Neo4j import step too often. If you're running it a lot, you may want to look into solutions. The problem with the batch TSV import is that TSV's are bad at representing properties that only exist for a single node or relationship type. However, if you don't care about properties (besides Ah now I remember another reason I didn't use the import tool. I don't think I was able to fully automate the import... i.e. there was network specific commands that had to be written... therefore it would decrease the versatility of the code. I wanted |
If you want to use the import tool, the Hetionet TSVs could be a good place to start and get benchmarks. |
After some testing it turns out that it is actually much faster to import edges individually into neo4j when using py2neo version 3. As described in this link, it seems that py2neo version 3 uses subgraphs in order to make updates to neo4j. Effect of batch size on import speed:
Based on these results, it seems that updating neo4j with a subgraph containing multiple edges is actually slower than updating the graph one edge at a time. I suspect that this is because the underlying py2neo code converts the subgraph back into individual edges anyways, and therefore spends time making redundant calculations. All speed estimates are approximate, and testing was done on an AWS |
@veleritas your benchmarks are awesome. Let's address this after #6 is merged. The easy fix would be changing the default value for But happy to consider a more dramatic code refactoring if you think it's worth it. |
Intended to speed up the hetnet importation into Neo4j. See #5
@veleritas I changed the defaults in d026d13. I made you the commit author, since you did all of the hard work! |
I've been experimenting with the batch CSV import provided by Neo4j (version 3), and it seems so far that the batch import can be made to work with Rephetio v2.0. Current initial testing shows that a half scale Rephetio with 1.2 million edges and ~20k nodes can be imported into Neo4j in roughly 10 seconds. |
@veleritas awesome, I'd be really interested in getting this feature implemented in |
At the moment the batch CSV import is implemented as a tack-on script to the Process:
At the moment things seem to work just fine, and neo4j has had no complaints so far, but I haven't tested full compatibility with the entire network yet. I'm going to need some more time to figure out if the pipeline will work with the entire network before I'm ready to push anything back into |
The main reason I avoided the CSV import is that I didn't see a way to losslessly export a graph (in its entirety) like Let's revisit this at a later time when we know more. I would say, if you find yourself constantly copying and pasting the CSV import code, then it would make sense to move it upstream. |
Hi Daniel, Just wanted to ask if we should be revisiting the Neo4j integration code issue. I've since switched over to using the built-in Neo4j CSV loader since it is so much faster, and haven't had any issues with the loss of license metadata so far. It's been working without any issues with the full network so far. Latest code is here. The CSV import method has also been easy to adapt to the matrix DWPC calculation method by @mmayers12 . Again we've been discussing on our end that any changes we make to the project should be integrated back upstream if it makes sense, so let us know if you're interested in these changes, or if we need to tweak it slightly further before you're willing to pull upstream. Toby |
Hey @veleritas, I see two options for incorporating CSV Neo4j import functionality into
If you're willing to do the work to submit a PR for option 1, then this is preferable. However, we need to make sure implementations in |
Hi Daniel,
What was the reasoning behind importing the nodes and edges of the hetnet using the py2neo interface? I'm finding that the import process is quite slow even for small sized networks, and was wondering whether I should look into the batch CSV import that neo4j comes with.
From my experiments it seems like importing 20000 nodes and 22000 edges into neo4j with the current code takes roughly 45 minutes on an AWS instance with 8 cores and 32 GB RAM. At this speed it would basically take forever to load the entire network, so I'm wondering if I'm missing anything here.
Best,
Toby
The text was updated successfully, but these errors were encountered: