Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace interior Reader without resetting decompress data #276

Open
jnbooth opened this issue Jul 9, 2021 · 10 comments
Open

Replace interior Reader without resetting decompress data #276

jnbooth opened this issue Jul 9, 2021 · 10 comments
Labels

Comments

@jnbooth
Copy link

jnbooth commented Jul 9, 2021

If I'm parsing a continuous encoded stream that is (unfortunately) divided up across several successive inputs, my only option appears to be using the Decompress object directly, which is a huge hassle without access to the zio:: functions.

It seems like it would be easy to create a "replace" method identical to the various "reset" methods, but with the reset_decoder_data(self); call removed. That way, I could switch out one Read object for the next while preserving the stream state.

@Gelbpunkt
Copy link

I would love this exact same feature but for ZlibEncoder - right now I can't reuse a ZlibEncoder object, but using Compress directly (https://gist.github.com/Gelbpunkt/44a6fb4cfe15f15e4c3757416b7021b1) doesn't work when I reuse the object.
It does work fine when using reset on the Compress object before decompressing, but that seems to not preserve encoding parameters. Zlib gives invalid stored block lengths when using the same decoder (in python) to decode multiple payloads that were created by encoding and then resetting the Compress object.

@Byron
Copy link
Member

Byron commented Sep 13, 2023

I am probably not following on some intricacies of the problem, but wonder if it's possible to create a custom Write/Read that is Clone and uses shared ownership internally. That way it would be possible to keep a handle to it after passing it to types in this crate, and use interior mutability to swap out any state as you see fit.

I'd find that, in theory, more suitable as first step towards solving this particular problem as it's much less involved. Of course, it also delegates the hard parts of managing the internal decompressor state correctly which certainly wouldn't be trivial, but should be possible.

@nc-x
Copy link

nc-x commented Jun 5, 2024

Tried swapping out the internal reader for GzDecoder using interior mutability but it does not work.

  • You have to swap when the first internal reader is completely read
  • However, when the first reader is completely read, you get an error: corrupt deflate stream (because the stream reader only had partial data).
  • In this case, ignoring the error and swapping the reader does nothing as GzDecoder is internally in an error state and it goes into a infinite loop.
  • Also, if there was a way to know the decoded size of a reader, we could have read_exact it. But knowing it in advance is impossible, so...

I don't think this can be done without additional support from the library itself.

@Byron
Copy link
Member

Byron commented Jun 5, 2024

Couldn't a multi-member (continuous) stream be accomplished by a MultiGzDecoder.

I think what would really help here is an example program with real (but minimal) input data.

@nc-x
Copy link

nc-x commented Jun 5, 2024

AFAICT, MultiGzDecoder is for gzip having multiple files inside it which is not the usecase here. The usecase here, for example, is to download huge gzip files using range requests. Reqwest's Response object implements Read, so it can be used with GzDecoder or MultiGzDecoder, but once a given response is decoded you make another range request to get the next piece of data and decode it in a streaming fashion, probably because the data is huge and you want to filter out some stuff or parse it directly etc.

But because every range request yields a different response, an ability to swap the internal reader is required while keeping the internal state sane.

I do have some code for the same but I don't have any public gzip url where i can try this on. Let me see if I can do something about the example code.

@the8472
Copy link
Member

the8472 commented Jun 5, 2024

However, when the first reader is completely read, you get an error: corrupt deflate stream (because the stream reader only had partial data).

Perhaps handling this as a WouldBlock error like this test does could work?

flate2-rs/src/gz/read.rs

Lines 332 to 350 in 1a0daec

impl Read for BlockingCursor {
fn read(&mut self, buf: &mut [u8]) -> Result<usize> {
//use the cursor, except it turns eof into blocking error
let r = self.cursor.read(buf);
match r {
Err(ref err) => {
if err.kind() == ErrorKind::UnexpectedEof {
return Err(ErrorKind::WouldBlock.into());
}
}
Ok(0) => {
//regular EOF turned into blocking error
return Err(ErrorKind::WouldBlock.into());
}
Ok(_n) => {}
}
return r;
}
}

Then on error the next chunk can be filled in.

@jongiddy
Copy link
Contributor

jongiddy commented Jun 5, 2024

Either:

  1. Create your own struct that implements Read. When it has no data it performs the next Range request to fill its buffer. It only returns 0 when there are no more ranges to request. Submit this struct as the reader to read::GzDecoder::new. (I believe this is the solution referred to in Byron's comment above). It is probably more efficient to implement BufRead and use bufread::GzDecoder.
  2. Use a write::GzDecoder. Create a loop to read Range requests and write the returned data to the write::GzDecoder. This is the only solution if the Range requests use async code.

@nc-x
Copy link

nc-x commented Jun 6, 2024

I have created a repo with an example program: https://github.com/nc-x/flate-276

Create your own struct that implements Read. When it has no data it performs the next Range request to fill its buffer. It only returns 0 when there are no more ranges to request. Submit this struct as the reader to read::GzDecoder::new. (I believe this is the solution referred to in #276 (comment) above). It is probably more efficient to implement BufRead and use bufread::GzDecoder.

The issue is that when you try reading from GzDecoder, it will simply go past the end of the partial data which it was provided and fail, which is expected because technically the input is incomplete and hence invalid.

So, basically the first reading itself fails and you do not even get to the point where you can swap underlying readers or return 0 etc. AFAICT end users of GzDecoder cannot do anything about this and there needs to be a way tell GzDecoder that it will be given partial data, so it should't fail.

Have not tried your 2nd solution and am busy for a couple of days now but presumably that would have the same issue as well?

@jongiddy
Copy link
Contributor

jongiddy commented Jun 8, 2024

I made a pull request to your repo to demonstrate the correct solution using a Read implementation.

The Read implementation of Download turns the chunked range requests into a single Read stream returning the compressed data. The GzDecoder is outside Download and only has its reader set once in the new method. It never replaces the reader.

@nc-x
Copy link

nc-x commented Jun 13, 2024

thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants