-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StochasticMux Process Pool #160
Comments
Yeah, we've thought a lot about this kind of thing. There's a slightly more fundamental issue here, which is that streamers are, at the end of the day, generators, and will therefore block until their current output is consumed. This means that even if you have a pool of streamers in parallel, each one will be idle most of the time, even if it lives in its own process. I think the best way to resolve this is to first go asynchronous #30, and then once that works, go parallel with the async generators. Making this work efficiently will probably also require some kind of buffering structure, eg so that a generator can spin for up to some fixed number of outputs (to be collected later) until it blocks. |
Hmm yeah I've never had much luck with asyncio. It just seems like you need to rewrite everything using async modules which seems quite daunting. (Like I don't understand how you would use Maybe this won't work, but from a quick poke at how (This is under the assumption that And if it is blocking, maybe you could have a thread that collects the arrays in a fixed length queue that blocks when it's full (buffering structure u mentioned). (Also,,,, does keras work with asyncio generators? because I feel like it's probably not implementing an event loop, but idk maybe it is) |
It would be super convenient if you could create a process pool with StochasticMux so that multiple files can be processed simultaneously with a fixed number of processes. I know you can wrap each file streamer in a
ZMQStreamer
or wrapStochasticMux
in one, but in the first case, you'd be spawning likelen(files) * epoch_size
processes with the overhead of starting a fresh process each time, and in the second the entire mux is in the single separate process and is still loaded sequentially.I suppose this might(?) get complicated with sharding+sampling and files so I'm assuming that's why it hasn't been implemented.
And if I'm missing something and this is already simple to do lmk!
The text was updated successfully, but these errors were encountered: