-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
suggestions for multigpu use case in rfpipe #12
Comments
Yeah, being able to use multiple GPUs from a single host thread may be useful. According to the CUDA doca, |
Yes. I was hoping it could be worked into rfgpu for the sake of the overall design. Do you think it would also be possible to make the Is there a clever way to transfer concurrently? If not, the |
Yes we should be able to do concurrent transfers no problem. Even if we didn't, the way the |
Not checked in yet but I've done some work on this. Here is a bit of info on the
|
I'm reviewing my current rfpipe code and see that the |
Yes, that's right. Everything I described in the previous comment has to do with moving chunks of raw data between CPU and GPU memory. In short the change allows for a single chunk of CPU memory to map into several GPUs, rather than a strictly one-to-one mapping as before. How I plan Does that all sound OK? |
I'm sure I could work with that. |
Having a single It should be completely backwards compatible; I've tried some of my existing simple test cases to make sure they still work, but have not really exercised the multi-gpu capability too much. If I come up with an example script for this I will let you know. Also, everything I said above about how |
Ok, I imagined there could be structural issues. I'm excited to play with this! |
Do I understand that you've only implemented the GPU memory for the
|
Sorry, ignore that last question. I see the syntax in the code. |
OK, hope that makes sense. In the "two lists" version the first is the array dimensions and the second is the list of devices. I should add some better arg names and docstrings at some point. |
Just trying this out a bit more today. While it seems to work (in the sense of not crashing or otherwise producing bad results), the GPUs are not being used in parallel in the way I thought they would when calling multiple devices from a single thread. So it probably needs a bit more work before it should be integrated "for real" into the pipeline. Also, I'm adding some docstrings and argument names to the python interface, should help with code readability, for example you can do |
I have set up some concurrent execution code around the grid/image portion of the rfgpu code. It works pretty well for 1 or 2 GPUs, but does not scale linearly beyond that. |
Do you have a simple example of this? Or is it in rfpipe somewhere? |
Yes, in the development branch. The concurrent part is done at: I've restructured this a few times to try to get it to scale well. It could be simpler than this version and still scale the same (i.e. sublinear for >2 gpus). |
Reading my code again, I suspect the poor scaling could be due to the use of a python for loop inside the thread. The individual call to |
That might be the explanation. I'd like to reproduce the rfpipe usage in a simple example and run it through nvprof to really see what is going on though. I started this but haven't finished yet. I think putting an iteration over multiple images into rfgpu is probably a good idea in any case though, and is something I've thought about before. This would also allow things like using batched FFTs and should improve efficiency for small images sizes where it looked like the current implementation is dominated by kernel launch overheads. |
OK, I figured out part of the problem, I accidentally committed a version of the Makefile that had the GPU timing code enabled. This will in some circumstances slow down multi-GPU usage because it makes many of the routines wait for the GPU to finish before returning, preventing some parallel operations from happening. That may not completely explain what you were seeing, but you can pull the latest (with those two line re-commented) and try again. Good idea to |
I have a test script set up and am able to reproduce your results; run time improves by ~50% going from 1 to 2 GPUs, and not much after that. I'll play around with it and let you know if I find out anything useful. |
I've pulled the latest on the the multi_gpu branch and rebuilt, but importing to Python fails:
Going back one commit (to 4c94821) fixes it. |
I added |
Just noting here that as discussed in #13 multi_gpu branch has been merged into master. |
I took another look at this today. Reducing the amount of I think the bad scaling is as you say due to python. I made a simple change to have the For additional improvement I will still think about processing batches of N images at a time as we discussed last week. But maybe hold of on this for now in favor of working on phasing / dynamic spectra? |
I built the new version (with nice new build scripts!), but can't get the scaling to improve beyond what it did before. I see 2x improvement for 2 GPUs, but none beyond that. |
Currently, rfpipe uses a single function to set up rfgpu on a single GPU. For realfast, that requires a one-to-one mapping of data read and data searched on a GPU. That will limit GPU utilization if large amounts of memory need to be read at once.
As an example, reading L-band with 10 ms sampling and an FRB search requires reading ~10 GB of memory. For a server with 8 GPUs, we will either need to limit the memory usage per read or share the search work for a single read over multiple GPUs.
I can see one way to use multiple GPUs in rfpipe, but it is pretty invasive. I wanted to ask for suggestions before implementing it. The basic flow is:
Is there overhead to setting
cudaSetDevice
? Could we use it within the loops to run two GPUs concurrently?The text was updated successfully, but these errors were encountered: