-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add pytorch example #133
Comments
First, thanks for implementing this! I haven't spent much time with pytorch yet, but the above all makes sense from what I've read in the api docs. The index thing is a real problem, and a point which I suspect that I'd disagree with the pytorch devs on. (TLDR: I prefer iid sampling with replacement to epoch/index-based training.) I don't see a great solution beyond ignoring the index right now. One potential solution down the road would be to use coroutines for generation, so that the index for the next sample can be passed back into the streamer. This would require a pretty severe rewrite of pesc, but it might be worth considering as part of #30. Maybe @cjacoby has thoughts? |
I am at a conference, but I'll try to respond on Monday! I have been
looking at pytorch a bit lately, so I've been thinking about it a little.
…On Fri, May 18, 2018, 09:45 Brian McFee ***@***.***> wrote:
First, thanks for implementing this! I haven't spent much time with
pytorch yet, but the above all makes sense from what I've read in the api
docs.
The index thing is a real problem, and a point which I suspect that I'd
disagree with the pytorch devs on. (TLDR: I prefer iid sampling with
replacement to epoch/index-based training.)
I don't see a great solution beyond ignoring the index right now. One
potential solution down the road would be to use coroutines
<https://docs.python.org/3/library/asyncio-task.html#coroutines> for
generation, so that the index for the next sample can be passed back into
the streamer. This would require a pretty severe rewrite of pesc, but it
might be worth considering as part of #30
<#30>.
Maybe @cjacoby <https://github.com/cjacoby> has thoughts?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#133 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA4t8xv__vtHLWQU_u2ySk54yTSDAoI3ks5tztCJgaJpZM4UCL4R>
.
|
Okay, several thoughts on this, after looking around pytorch a bit more and thinking about this some.
# randomly sample from streamers
mux = pescador.StochasticMux(streams, nb_tracks, rate=None, mode='exhaustive')
buffered_sample_gen = pescador.buffer_stream(mux, batch_size)
buffered_zmq = pescador.ZMQStreamer(buffered_sample_gen)
# iterate over data
for batch in buffered_zmq:
print(batch['X'].mean())
#you can still train with this batch
model(batch['X']) # not sure exactly what goes here but you get the idea
|
@cjacoby thanks for your input. I thought about this a bit more and tried many things. In the end I agree: You loose a lot of flexibility by using the pytorch @bmcfee The remaining question is, do you want to advertise pescador + pytorch = ❤️ now or just wait once one of the many (1, 2, 3, 4, 5) PyTorch high level api packages will be ready for prime time. |
This I suspect is due more to the inherent slowness of buffering (invoking many data copies), though zmq overhead can also hurt a bit. Is it much worse to use the unbuffered zmq stream and let pytorch handle buffering? This is how i typically do it with keras, and it seems to work pretty well.
It seems a bit premature, eh? I haven't looked at any of the other packages you mentioned -- do they all manage data streaming, or are they more keras-like in functionality? |
I have been working a bit on #30. Progress is a little slow, because I
haven't used asyncio in a long time and I'm having to relearn how to use
it, but I'm making progress.
…On Tue, Jun 19, 2018 at 1:24 PM Brian McFee ***@***.***> wrote:
On the downside, I found that using the pescador buffered_zmq is a lot
slower than random sampling using the pytorch samplers.
This I suspect is due more to the inherent slowness of buffering (invoking
many data copies), though zmq overhead can also hurt a bit. Is it much
worse to use the unbuffered zmq stream and let pytorch handle buffering?
This is how i typically do it with keras, and it seems to work pretty well.
The remaining question is, do you want to advertise pescador + pytorch =
now
It seems a bit premature, eh? I haven't looked at any of the other
packages you mentioned -- do they all manage data streaming, or are they
more keras-like in functionality?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#133 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA4t80ACn9xFFM2KpRSfpN7vDxHVpSZhks5t-V4HgaJpZM4UCL4R>
.
|
I now had some more time to dig into pytorch and got nice results together with pescador.
the most promising one (also developed under the pytorch umbrella) is ingnite but it doesn't come with any convenient way to consume generators, though it supports pytorch dataloaders. I'd say we wait a few more months till 1.0 appears at the horizon and evaluate again. For now, I essentially followed @cjacoby advice to stick with a for loop style training and not used pytorchs dataloader classes for obvious reasons. That seems like the best option until some of the high level packages become more popular. I think the issue can be closes till then. |
actually an |
back with some tests on pytorch 1.2 (its available as nightly build). The import numpy as np
import pescador
import torch
def excerpt_gen(track, excerpt_length, excerpt_hop):
for i in range(0, track.shape[0] - excerpt_length, excerpt_hop):
yield track[i:i+excerpt_length, :], track[i:i+excerpt_length, :]
class TrackChunksDataset(torch.utils.data.IterableDataset):
def __init__(
self,
nb_tracks=100,
track_length=1000,
excerpt_length=100,
excerpt_hop=100
):
tracks = (np.random.random((track_length, 1)) for i in range(nb_tracks))
streams = [
pescador.Streamer(
excerpt_gen, track, excerpt_length, excerpt_hop
) for track in tracks
]
self.mux = pescador.StochasticMux(
streams, nb_tracks, rate=None, mode='exhaustive'
)
def __iter__(self):
return self.mux.iterate()
dataset = TrackChunksDataset()
train_loader = torch.utils.data.DataLoader(
dataset,
batch_size=16,
num_workers=0
)
for x, y in train_loader:
x.mean()
y.mean() Increasing the number of workers does obviously not speed up the sampling since pytorch copies the datasets to the different workers. They provide a way to retrieve the @bmcfee you have an idea what would be a good example for the pescador gallery for such a splitting? |
What if |
yes, thats what I would do in practice. I can modify the example.
Sorry, I didn't get this part. To parallelize the loading I would just split the |
I don't think these would necessarily be compatible -- they're doing similar kinds of things. All I was trying to say is that it's possible to share data by reference across processes (on one machine) using current pesc stuff, and that pytorch doesn't seem to support that (yet). |
yes that is true. Despite its name, Nonetheless, should I prepare a PR for a simple pytorch 1.2 example based on the Keras example? |
I have been playing with from torch.utils.data import IterableDataset
class _PyTorchDataset(IterableDataset):
def __init__(self, stream):
super().__init__()
self.stream = stream
def __iter__(self):
return self.stream.iterate()
pescador.maps.pytorch_dataset = _PyTorchDataset ... that can then be used more or less in the same way as data_loader = DataLoader(pescador.maps.pytorch_dataset(my_pescador_stream),
batch_size=32, pin_memory=True) It does the job because all that
|
@hbredin indeed, that seems to be a good combination. Did you evaluate the performance for this? |
@faroit Not yet, no. This is on my TODO list with no ETA :-) |
I checked: using For some reason, I thought |
As more and more people use pytorch now, I wonder if we can have a pytorch example to use with pescador?
I fact, I already tried a few things....
Lets assume the infamous (among audio research) scenario of randomly sampling of small excerpts from longer audio tracks and we want to see all the data once for every epoch in random order:
While this would obviously directly work with pytorch by feeding in batches of data, I wonder if we could leverage the pytorch dataset and dataloader classes to simplify the code and maybe utilize pytorchs internal parallelisation within the dataloader.
It turns out pytorch does allow to override the
Sampler
andBatchSampler
(See here) methods for their dataloader. But since they are all based on indices in your dataset class, using pescador for this wouldn't be exactly elegant (or I just miss the point).For now, I came up with the following that works and yields the same batches as the vanilla example above. It works by extending the dataset class so that the dataloader provides just more samples and ignores the index.
I would love to hear your feedback on this and of course I would be happy to make a PR once we agreed on an elegant solution.
The text was updated successfully, but these errors were encountered: