-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RealTimeKernel.from_moscot(tp) Out of Memory #1146
Comments
Hi @AlinaKurjan, this is most likely a problem with adaptive thresholding when we go from dense (but not materialized) OT coupling in moscot to sparse (and materialized) transition matrices in CellRank. Do you have any advice here @MUCDK? |
Hi @Marius1311, thank you for your reply. I've managed to get it to work in the end by subsampling to 95% of my dataset. Would be great to have some sort of a different solution for scaling to larger projects in the future :) |
Hi @AlinaKurjan , Have you tried sparsifying the matrix using https://github.com/theislab/moscot/blob/ba231ea4f7511caaa1915e2b7e28e88d709d45ad/src/moscot/base/output.py#L170 / https://moscot.readthedocs.io/en/latest/genapi/moscot.backends.ott.OTTOutput.sparsify.html#moscot.backends.ott.OTTOutput.sparsify ? As Marius mentioned, CellRank needs to materialize the transport matrix, hence it becomes quadratic in memory. If you first sparsify the output, i.e. set very small values to 0, you can pass a sparse matrix. |
Hi @MUCDK. I've tried sparsifying on the full dataset with different parameters and batch sizes, but it made no difference and still gave the Resource exhausted error, usually not immediately but further down the line. |
Hi @AlinaKurjan , Thanks for this, it's super helpful feedback. Would you mind copy-pasting the code which you tried? This would be easier for us to track down the error. |
@MUCDK @Marius1311 sorry for the delay. Here's what I've tried:
with batch sizes of 100, 1000, 5000, 10000 You might notice that I am not entirely sure here how a threshold value or a good batch size should be picked. To be honest, I did not understand what the thresholding did from the description. It would be great if some guidance for a starting point could be added. Apologies if I'm missing something really obvious. While I have you, can I please ask for your opinion on whether or not CellRank can be used to work with a combined sc and snRNA-seq dataset? I'm working with human developmental data, with embryonic counts coming from sc and fetal from sn data. Obviously very interested in trying to identify trajectories, but not sure if it is possible given the source of biases. I've integrated and corrected my sc and sn data with scVI-scANVI, but since it doesn't batch-correct the counts, I am not sure how "analyseable" this combined data is for trajectory stuff... Any thoughts would be highly appreciated (and apologies if this is not the right place to ask). |
Hi @AlinaKurjan , It is literally the code you used? There might be an error as you use |
Sorry, was typing this part of code from memory as I had already deleted it. I am sure I typed it right in the code, the output it was producing was different compared to the one I was getting when running it without sparsification. I am currently running a few processes so can't rerun, but I'll post the outputs tomorrow if you'd like? |
Replying to the second part of your question @AlinaKurjan: Using the combination of moscot.time + CellRank's RealTimeKernel, your data integration strategy should not be a problem for downstream trajectory analysis, as both of these methods operate in latent space embeddings, and don't need access to corrected counts.
Thus, the final combined Markov chain can be entirely computed without access to corrected counts, so this should work! This will enable you to compute macrostates and fate probabilities. Once you go from there to gene-level analysis, like driver genes and gene trends, things could get a bit tricky, as these analyses do need counts. However, I'm not sure how much of a problem this would cause - I think you will just have to try. Up to fate probabilities, you should be fine as you can work entirely in latent spaces. |
@AlinaKurjan the issue seems that the sparsification happens on GPU, since from anndata import AnnData
problem = ...
sparsify_kwargs = ...
couplings = {}
for (t1, t2), solution in problem.solutions.items():
adata_src = problem[t1, t2].adata_src
adata_tgt = problem[t1, t2].adata_tgt
solution = solution.to(device='cpu')
solution = solution.sparsify(mode='threshold', **sparsify_kwargs)
tmat = solution.transport_matrix # should be scipy.sparse array
couplings[t1, t2] = AnnData(tmat, obs=adata_src.obs, var=adata_tgt.obs) This will come at some performance cost, but should be still fast. Lmk if this works - initializing the kernel should be straightforward from this (see here). |
@Marius1311 Thank you for such a detailed reply, that's super useful!! Essentially, when you say that I need to compute the OT couplings in the scANVI space, do you mean that I should be supplying neighbors calculated using the scANVI embeddings when creating the TemporalProblem (cause I've supplied the PCA-based ones as in the tutorial...)? If so, would that logic also apply for RNA velocity moments as well (which I'm doing separately for sc and sn)? The reason I was confused about what neighbors to supply was because I've gone through Dana Pe'er's Harmony and Palantir workflows prior to this to get an augmented affinity matrix (X_aug_aff) using scANVI embeddings, then the diffusion space based on this (X_diff), and then the multiscale space (X_msdiff). I was planning to proceed with downstream trajectory analysis using X_diff neighbors, but now realising that's likely not the right call. Any opinion on this would be extremely useful. I imagine it's not necessary to do the harmony and palantir stuff if the kernel is built on experimental time (I originally built a kernel based on pseudotime, but the results were not clear)?
|
@michalk8 Awesome, thank you so much! Will give it a go tomorrow and get back to you with the results. |
same error with batch_size=1000. Batch_size=100 crashes the kernel. Any other parameters I can try? |
No, you don't need to supply a kNN-graph, you can directly supply the scANVI embedding in the temporal problem. You would simply do something like tp = tp.prepare(time_key="day", joint_attr="X_scanvi") As to RNA velocity, that is a bit more trick to combine with batch correction, as that does rely on corrected counts, and not only that, but also on corrected spliced/unspliced counts. Importantly, batch effect correction needs to preserve the ration between spliced/unspliced counts, so here it gets tricky. Latent space-based methods, like veloVI, are probably your best guess here. See the VeloVI docs here: https://velovi.readthedocs.io/en/latest/index.html However, I'm not sure whether VeloVI currently supports data integration for RNA velocity computation - you could open an issue in their repo.
As a starting point, I would run both moscot and CellRank directly in the scANVI embedding, as I pointed out above, and see how far that gets you.
The Pe'er lab's harmony is for integration across time points, which in this context, you might not need, as you're using OT to map across time points. |
@AlinaKurjan there are 2 batch sizes, one in The one in the The The |
Agree, tracked in theislab/moscot#639 (comment) |
@michalk8 @MUCDK Sorry for the delay and thank you so much for all your help so far, your input is invaluable! I've tried reducing the batch size of the tp.solve to 1024. It took a quite a bit longer to run (maybe half an hour more), but nothing major. Unfortunately, still getting errors as below: problem = tp
sparsify_kwargs = {'batch_size':1024}
couplings = {}
for (t1, t2), solution in problem.solutions.items():
adata_src = problem[t1, t2].adata_src
adata_tgt = problem[t1, t2].adata_tgt
solution = solution.to(device='gpu')
solution = solution.sparsify(mode='min_row', **sparsify_kwargs)
tmat = solution.transport_matrix # should be scipy.sparse array
couplings[t1, t2] = AnnData(tmat, obs=adata_src.obs, var=adata_tgt.obs) 2023-12-18 10:35:05.321394: W external/xla/xla/service/hlo_rematerialization.cc:2218] Can't reduce memory use below 17.77GiB (19076431872 bytes) by rematerialization; only reduced to 37.94GiB (40734793728 bytes)
2023-12-18 10:35:15.380041: W external/tsl/tsl/framework/bfc_allocator.cc:485] Allocator (GPU_0_bfc) ran out of memory trying to allocate 37.94GiB (rounded to 40734793728)requested by op
2023-12-18 10:35:15.380329: W external/tsl/tsl/framework/bfc_allocator.cc:497] ________________________________________________*****___________________*********___*****xx_____****
2023-12-18 10:35:15.380392: E external/xla/xla/pjrt/pjrt_stream_executor_client.cc:2461] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 40734793728 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
parameter allocation: 57.92MiB
constant allocation: 0B
maybe_live_out allocation: 37.94GiB
preallocated temp allocation: 0B
total allocation: 37.99GiB
total fragmentation: 0B (0.00%)
Peak buffers:
Buffer 1:
Size: 37.94GiB
Operator: op_name="jit(_where)/jit(main)/select_n" source_file="/tmp/ipykernel_1779629/2320153915.py" source_line=1
XLA Label: fusion
Shape: f32[1024,486,20463]
==========================
Buffer 2:
Size: 37.94MiB
Entry Parameter Subshape: f32[486,20463]
==========================
Buffer 3:
Size: 19.98MiB
Entry Parameter Subshape: pred[1024,20463]
==========================
Buffer 4:
Size: 4B
Entry Parameter Subshape: f32[]
========================== Same issue with threshold specification rather than 'min_row'. |
Also, don't know if this is useful but here's the output of tp.solve: tp = tp.solve(epsilon=1e-3, tau_a=0.99, tau_b=0.999, scale_cost="mean",
batch_size=1024, device='gpu') INFO Solving `8` problems
INFO Solving problem BirthDeathProblem[stage='prepared', shape=(23014, 20463)].
INFO Solving problem BirthDeathProblem[stage='prepared', shape=(20463, 26196)].
INFO Solving problem BirthDeathProblem[stage='prepared', shape=(421, 1644)].
INFO Solving problem BirthDeathProblem[stage='prepared', shape=(468, 421)].
INFO Solving problem BirthDeathProblem[stage='prepared', shape=(21, 212)].
INFO Solving problem BirthDeathProblem[stage='prepared', shape=(1644, 23014)].
INFO Solving problem BirthDeathProblem[stage='prepared', shape=(212, 468)].
INFO Solving problem BirthDeathProblem[stage='prepared', shape=(736, 21)]. Could it be because my groups are so unbalanced? |
The output is still on GPU, can you try please? # solution = solution.to(device='gpu')
solution = solution.to(device='cpu') Another thing is that even with this batch size, you're trying to materialize array of shape |
Hi, I'm also getting the same error. this is what I'm trying to do: rna: data: shape=(50637, 120444)
problem = tp
output:
Am I missing something in my solution? |
Hi @fafa92 , Thanks for reporting this! There are two options:
Moreover, just to be sure: you get the OOM error from the |
Hi @MUCDK and @fafa92, I did get it to work in the end with no extra code modifications (i.e. with the tutorial codes provided). There are a few things that could be the reason for it working this time for me:
Those are the changes I could think of. Hope this helps :) |
Hi @AlinaKurjan , Happy it worked for you in the end! Possibly you worked on a different GPU with more memory? Otherwise not sure. @fafa92 please be aware that you are working with the |
Hi @MUCDK and @AlinaKurjan. Thanks for the help. I'm trying to use this instruction on my data. The first problem was solved using rank. Thank you @MUCDK for the hint. However, I'm facing a few other issues.
Does not converging mean I need to re-run it? if so, what are the changes you guys recommend? This came out of the visualization part, I'm not sure if this is looking like this because of tp.solve convergence or not. I had dimensionality reduction for both datasets set to LSI-l2 (used to be PCA for RNA and LSI-l2 for ATAC), but still not converging. I got a more robust figure, but datasets are not overlapping like the example! What might be causing this? |
Yes, whenever the solver hasn't converged, it's better to rerun it. In general, doing a pure GW as proposed in the SCOT paper is often not good enough to match cells, hence, as described in our tutorial (the lower part), we recommend using the extension to the fused setting. I.e. you can get gene activity, and then integrate the data with scVI. Regarding the convergence, this is hard to tell. You could try different epsilon values, e.g. 0 (which is allowed when you do low-rank), or 1.0. . @giovp do you have any recommendations regarding the convergence? |
I think at this point, it makes sense to move the discussion to the moscot repo. |
@MUCDK scVI integration helped a bit, thanks for the suggestion! More RAM came to rescue the convergence problem :) |
Hi, thank you very much for Moscot and Cellrank resources! I was wondering if you could please advise the best way to get around the "XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 2160972208 bytes." error when converting to the realtimekernel. My adata object is 73131 obs × 23552 var
I've managed to tp.solve by specifying batch=10000. But not cannot convert it to a kernel either using gpu or cpu. Any advice please?
The text was updated successfully, but these errors were encountered: