Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suggestions for multigpu use case in rfpipe #12

Open
caseyjlaw opened this issue Sep 29, 2018 · 25 comments
Open

suggestions for multigpu use case in rfpipe #12

caseyjlaw opened this issue Sep 29, 2018 · 25 comments

Comments

@caseyjlaw
Copy link
Contributor

Currently, rfpipe uses a single function to set up rfgpu on a single GPU. For realfast, that requires a one-to-one mapping of data read and data searched on a GPU. That will limit GPU utilization if large amounts of memory need to be read at once.
As an example, reading L-band with 10 ms sampling and an FRB search requires reading ~10 GB of memory. For a server with 8 GPUs, we will either need to limit the memory usage per read or share the search work for a single read over multiple GPUs.
I can see one way to use multiple GPUs in rfpipe, but it is pretty invasive. I wanted to ask for suggestions before implementing it. The basic flow is:

rfgpu.cudaSetDevice(devicenum)
[set up grid, etc.]
for i in resampling_list:
  for dm in dm_list:
    grid.operate()
    image.operate()
    [threshold]
return peaks

Is there overhead to settingcudaSetDevice? Could we use it within the loops to run two GPUs concurrently?

@demorest
Copy link
Collaborator

demorest commented Oct 5, 2018

Yeah, being able to use multiple GPUs from a single host thread may be useful. According to the CUDA doca, cudaSetDevice "should be considered a very low overhead call." It might be cleaner to handle this inside rfgpu though; for example each Grid, Image, etc instance will be associated with a certain device, and rfgpu will automatically switch as appropriate when each is called. That way the calling program doesn't have to manage all this. Sound good?

@caseyjlaw
Copy link
Contributor Author

Yes. I was hoping it could be worked into rfgpu for the sake of the overall design. Do you think it would also be possible to make the vis_raw.h2d() command transfer to a set of devicenums?

Is there a clever way to transfer concurrently? If not, the h2d transfer time will scale up linearly with number of devices, while the search time will scale down in the same way. When multiplexing over many GPUs, we'll eventually be spending more time moving data than searching.

@demorest
Copy link
Collaborator

demorest commented Oct 5, 2018

Yes we should be able to do concurrent transfers no problem. Even if we didn't, the way the Array class is currently structured, there is a one-to-one mapping between GPU and host memory, so this needs changing to address the overall memory usage issue mentioned earlier.

@demorest
Copy link
Collaborator

demorest commented Nov 6, 2018

Not checked in yet but I've done some work on this. Here is a bit of info on the Array interface changes. I've tried to keep it all backwards-compatible. Let me know if you have any suggestions:

a = GPUArrayComplex((1024,1024)) allocates a 1024-by-1024 array on whatever device is currently selected (via cudaSetDevice). This is the same as it always was.

a = GPUArrayComplex((1024,1024),(1,2,3)) allocates the array on devices 1, 2 and 3.

a.h2d() sends the data from CPU memory to all devices associated with this array. Transfer is done asynchronous, so should go in parallel to all the devices. Also the function returns before the transfer is finished. The GPUs will handle this automatically (ie not start other stuff til it's done) but there could possibly be issues if you modfiy the CPU array while data transfer is still happening.

a.d2h() will transfer the data from the default device (first one in the list) to the CPU memory. a.d2h(n) will transfer from device n.

@caseyjlaw
Copy link
Contributor Author

I'm reviewing my current rfpipe code and see that the GPUArrayComplex object that is moved to the GPU is actually the raw visibilities (with shape (nbl, nchan, nints)). A Grid object is then defined to use that memory to downsample, dedisperse, etc.. Would moving the raw data (rather than the complex grid) still supported by the new interface?
How do you anticipate the grid object will use different devices (e.g., when running grid.operate for a given DM, integration, etc.)?

@demorest
Copy link
Collaborator

demorest commented Nov 7, 2018

Yes, that's right. Everything I described in the previous comment has to do with moving chunks of raw data between CPU and GPU memory. In short the change allows for a single chunk of CPU memory to map into several GPUs, rather than a strictly one-to-one mapping as before.

How I plan Grid to work is that when an instance is declared, it will have an extra argument specifying which GPU it uses, ie grid = rfgpu.Grid(nbl, nchan, ntime, upix, vpix, device). Then any time you call a Grid method such as operate() it will automatically use that GPU. Will raise an error if you try to pass it an array that is not defined on that device, etc. Since each Grid instance is tied to a specific GPU, you'll need to have one per each GPU in use.

Does that all sound OK?

@caseyjlaw
Copy link
Contributor Author

I'm sure I could work with that.
I suppose another approach would be to make the methods of Grid (e.g., downsample, operate), take a device as an argument. Perhaps that would mean there'd only be a need to make a single Grid object? Either way, multiple Array objects would be needed to manage the resampled data and images.

@demorest
Copy link
Collaborator

demorest commented Nov 9, 2018

Having a single Grid instance able to work with different devices would take more work since there is a bunch of internal data that would then need to be duplicated on all the devices and kept track of. Implementing the scheme I described above was pretty straightforward so I went ahead and did it. This is checked in now on the "multi_gpu" branch. Give it a try and let me know if you see any issues.

It should be completely backwards compatible; I've tried some of my existing simple test cases to make sure they still work, but have not really exercised the multi-gpu capability too much. If I come up with an example script for this I will let you know.

Also, everything I said above about how Grid now works also applies to Image.

@caseyjlaw
Copy link
Contributor Author

Ok, I imagined there could be structural issues. I'm excited to play with this!

@caseyjlaw
Copy link
Contributor Author

caseyjlaw commented Nov 9, 2018

Do I understand that you've only implemented the GPU memory for the GPUArrayComplex class? I tried to set it up for GPUArrayReal and got this error:

TypeError: __init__(): incompatible constructor arguments. The following argument types are supported:
    1. rfgpu.GPUArrayReal()
    2. rfgpu.GPUArrayReal(arg0: int)
    3. rfgpu.GPUArrayReal(arg0: List[int])
    4. rfgpu.GPUArrayReal(arg0: List[int], arg1: List[int])

Invoked with: (1944, 2304), 0

@caseyjlaw
Copy link
Contributor Author

Sorry, ignore that last question. I see the syntax in the code.

@demorest
Copy link
Collaborator

demorest commented Nov 9, 2018

OK, hope that makes sense. In the "two lists" version the first is the array dimensions and the second is the list of devices. I should add some better arg names and docstrings at some point.

@demorest
Copy link
Collaborator

Just trying this out a bit more today. While it seems to work (in the sense of not crashing or otherwise producing bad results), the GPUs are not being used in parallel in the way I thought they would when calling multiple devices from a single thread. So it probably needs a bit more work before it should be integrated "for real" into the pipeline.

Also, I'm adding some docstrings and argument names to the python interface, should help with code readability, for example you can do a = rfgpu.GPUArrayComplex(dims=(1024,512),devices=(1,2,3)). Some of this has been committed already, some still in progress.

@caseyjlaw
Copy link
Contributor Author

I have set up some concurrent execution code around the grid/image portion of the rfgpu code. It works pretty well for 1 or 2 GPUs, but does not scale linearly beyond that.

@demorest
Copy link
Collaborator

Do you have a simple example of this? Or is it in rfpipe somewhere?

@caseyjlaw
Copy link
Contributor Author

Yes, in the development branch. The concurrent part is done at:
https://github.com/realfastvla/rfpipe/blob/development/rfpipe/search.py#L137

I've restructured this a few times to try to get it to scale well. It could be simpler than this version and still scale the same (i.e. sublinear for >2 gpus).

@caseyjlaw
Copy link
Contributor Author

Reading my code again, I suspect the poor scaling could be due to the use of a python for loop inside the thread. The individual call to grid.operate and image.operate are fast, but each time those run, the python interpreter checks the stats object and then steps to the next integration of the for loop.
The scaling probably depends on the time it takes for Python to switch context between threads and the number of GPU images being created. If that number is comparable to the time for a single image, then it will scale poorly.
If so, then solutions would be to (1) find a way to wrap the for loop in cython or (2) move the iteration over integrations into rfgpu.

@demorest
Copy link
Collaborator

That might be the explanation. I'd like to reproduce the rfpipe usage in a simple example and run it through nvprof to really see what is going on though. I started this but haven't finished yet.

I think putting an iteration over multiple images into rfgpu is probably a good idea in any case though, and is something I've thought about before. This would also allow things like using batched FFTs and should improve efficiency for small images sizes where it looked like the current implementation is dominated by kernel launch overheads.

@demorest
Copy link
Collaborator

OK, I figured out part of the problem, I accidentally committed a version of the Makefile that had the GPU timing code enabled. This will in some circumstances slow down multi-GPU usage because it makes many of the routines wait for the GPU to finish before returning, preventing some parallel operations from happening. That may not completely explain what you were seeing, but you can pull the latest (with those two line re-commented) and try again. Good idea to make clean first to make sure everything gets re-built.

@demorest
Copy link
Collaborator

I have a test script set up and am able to reproduce your results; run time improves by ~50% going from 1 to 2 GPUs, and not much after that. I'll play around with it and let you know if I find out anything useful.

@caseyjlaw
Copy link
Contributor Author

I've pulled the latest on the the multi_gpu branch and rebuilt, but importing to Python fails:

In [1]: import rfgpu                                                                                                                                                          
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-37fd4ba54c94> in <module>
----> 1 import rfgpu

ImportError: /home/cbe-master/realfast/anaconda/envs/development3/lib/python3.6/site-packages/rfgpu.so: undefined symbol: _ZN5rfgpu5Timer4stopEv

Going back one commit (to 4c94821) fixes it.

@demorest
Copy link
Collaborator

I added Makefile as a dependency so everything will get rebuilt if it's changed, sorry for the confusion.

@demorest
Copy link
Collaborator

Just noting here that as discussed in #13 multi_gpu branch has been merged into master.

@demorest
Copy link
Collaborator

demorest commented Dec 5, 2018

I took another look at this today. Reducing the amount of cudaSetDevice and cudaGetDevice calls did not have any effect on the multi-GPU performance so I think I will leave that code as-is.

I think the bad scaling is as you say due to python. I made a simple change to have the Image.stats() function release the GIL and this seemed to have helped a lot. I now get linear runtime scaling for up to 4 GPUs with 1k-by-1k images (this probably depends on image size and other params). Could you test it out sometime and see if this works in your code? Also all this GIL stuff is always semi-mysterious to me.. I think this change is safe but if you think otherwise please let me know ;)

For additional improvement I will still think about processing batches of N images at a time as we discussed last week. But maybe hold of on this for now in favor of working on phasing / dynamic spectra?

@caseyjlaw
Copy link
Contributor Author

I built the new version (with nice new build scripts!), but can't get the scaling to improve beyond what it did before. I see 2x improvement for 2 GPUs, but none beyond that.
I have some ideas, so I'll play more on my end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants