Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StochasticMux Process Pool #160

Open
beasteers opened this issue Jul 13, 2020 · 2 comments
Open

StochasticMux Process Pool #160

beasteers opened this issue Jul 13, 2020 · 2 comments

Comments

@beasteers
Copy link

It would be super convenient if you could create a process pool with StochasticMux so that multiple files can be processed simultaneously with a fixed number of processes. I know you can wrap each file streamer in a ZMQStreamer or wrap StochasticMux in one, but in the first case, you'd be spawning like len(files) * epoch_size processes with the overhead of starting a fresh process each time, and in the second the entire mux is in the single separate process and is still loaded sequentially.

I suppose this might(?) get complicated with sharding+sampling and files so I'm assuming that's why it hasn't been implemented.

And if I'm missing something and this is already simple to do lmk!

@bmcfee
Copy link
Collaborator

bmcfee commented Jul 13, 2020

Yeah, we've thought a lot about this kind of thing.

There's a slightly more fundamental issue here, which is that streamers are, at the end of the day, generators, and will therefore block until their current output is consumed. This means that even if you have a pool of streamers in parallel, each one will be idle most of the time, even if it lives in its own process.

I think the best way to resolve this is to first go asynchronous #30, and then once that works, go parallel with the async generators. Making this work efficiently will probably also require some kind of buffering structure, eg so that a generator can spin for up to some fixed number of outputs (to be collected later) until it blocks.

@beasteers
Copy link
Author

beasteers commented Jul 13, 2020

Hmm yeah I've never had much luck with asyncio. It just seems like you need to rewrite everything using async modules which seems quite daunting. (Like I don't understand how you would use librosa.load asynchronously?). But I can see how it could be the better option.

Maybe this won't work, but from a quick poke at how ZMQStreamer is implemented, you might be able to do it if you had a pool equal to the size of the number of active streamers and when one of the active streamer is exhausted and you need to activate a new streamer you pool.submit it as a new ZMQ worker.

(This is under the assumption that zmq.socket.send_multipart is non-blocking, but idk)

And if it is blocking, maybe you could have a thread that collects the arrays in a fixed length queue that blocks when it's full (buffering structure u mentioned).

(Also,,,, does keras work with asyncio generators? because I feel like it's probably not implementing an event loop, but idk maybe it is)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants