-
-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling (serialization) of nested containers #342
Comments
The docs could be better. The answer to this should probably be in the crate docs. But the answer is in other places:
You are indeed correct that there really is no one agreed upon CSV specification that one can adhere to. As the links above elaborate on, the problem with RFC 4180 is that it's too strict. Essentially, the CSV world needs its version of HTML 5 (nothing is invalid). RFC 4180 is like XHTML (lots are invalid). For pretty much exactly the same reason: tons of real world CSV data is messy, and erroring on it just isn't useful (in most cases).
Eh?
All of your proposals appear to already be possible today. All you should need to do is implement My understanding is that folks run into this nested container issue and expect it to be resolved automatically. It can't be. Not in any way that I can see without choosing a new serialization format to layer inside of CSV fields. It can't be resolved automatically because CSV doesn't support nested data. If you need to serialize nested data, then this crate can't really figure that out for you (aside from a few special cases). You either need to figure out how to convert it to a flattened representation or how to encode richer data inside of a single CSV field. I suspect this is why all three of your proposals appear to have a common thread between them: they ask for the user to resolve the nested container issue manually by supplying some kind of transformation function. (Your third proposal does suggest providing some sensible defaults, but presumably the user would still need to make a choice.) As far as I know, resolving the nested container issue manually is not actually a problem because I believe it can already be done via Serde's framework. So there really isn't an issue there. Instead, what I suspect people would like is something like, "hey if I have a nested container I don't want to hear about it, just JSON serialize it and stuff it into a single CSV field." But I'm pretty philosophically opposed to something like personally, and I don't perceive it as a common enough problem to really worry about too much. |
Oh, thanks, I missed that 🙈. I looked in multiple places (but focused on the
Yes, I was focused on the For the
Oops, sorry - that's what I meant but I somehow wrote
Yes, that would be ideal. In my "exotic" use case (sorry that it was quite hidden and too brief at the bottom in the "PS:") this isn't easily possible (AFAIK) as the type gets generated during compile time and depends on the API specification. It should theoretically be possible to either generate the types in advance (but this becomes problematic when the API specification changes) or to use reflection to transform the data at run time (this is what we're currently exploring but makes it more complex). I guess another limitation should be that this approach becomes problematic when serializing to multiple different formats as there can only be a single
Yes, agreed :)
Right
Yes. That could make sense if we find a few good, generic, and "universal" solutions but I guess it'd be better to just support the custom transformation function and put the possible solutions as examples in the documentation.
Ideally :) But I agree that this is just not possible with CSV. Anyway, that custom user function (more of a hack than a proper solution) could make sense if the two potential limitations of the |
You would likely need to generate the
Yeah you probably need to use newtypes or even build up the infrastructure yourself to call different serialization functions. (This might require embedding the functions on the data type? Not sure.)
My suspicion is that there is an isomorphism here where anything the The other hesitance you'll run into here is that the Serde integration in this crate is absolutely hideous. I would encourage you to look at it. And the vast majority of all issues/bugs/feature-requests on this repository are related to Serde integration. It is just overall simultaneously miserable to support (for CSV specifically) but also extremely convenient. What this means is that even if you're stuck, it's unlikely there's any reasonable path forward in this crate itself in any reasonable time span. |
Yes, my initial hope was that I could override the Having to replace all vectors with a custom vector type is unfortunately likely a dealbreaker (at least with my limited Rust knowledge) for my use case since the types are structures with quite a few fields and everything gets generated from the API specification. At that point it would probaboy be much easier to use a different approach/hack to handle the serilization.
That would also be handy for my use case. I'd like to call a custom serialization function for I'll look around a bit more but it doesn't look good for avoiding custom types. In case it helps someone: Here's a PoC/hack how one could serialize vectors into a single CSV field using a custom type: use anyhow::Result;
use csv::WriterBuilder;
use serde::Serialize;
use serde::Serializer;
struct MyVec<T>(Vec<T>);
impl<T> Serialize for MyVec<T>
where
T: Serialize + std::fmt::Debug,
{
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where
S: Serializer,
{
let s = format!("{:?}", self.0);
serializer.serialize_str(&s)
}
}
#[derive(serde::Serialize)]
struct Record {
a: MyVec<i32>,
b: MyVec<i32>,
}
fn main() -> Result<()> {
let record = Record {
a: MyVec(vec![1, 2, 3]),
b: MyVec(vec![4, 5, 6]),
};
let mut wtr = WriterBuilder::new()
.has_headers(false)
.from_path("out.csv")?;
wtr.serialize(record)?;
wtr.flush()?;
Ok(())
} The result: $ cat out.csv
"[1, 2, 3]","[4, 5, 6]" |
I wasn't necessarily thinking about custom impls for vecs, but custom impls for your API types. |
And serde_derive lets you specify custom serialization functions for individual fields without introducing your own newtype wrapper. |
Yes, but again, the problem is that the implementation/source-code of the API types gets auto-generated by a crate and a custom build script (plus I wouldn't really like to write a serializer for a large struct just to modify how a few fields should get serialized). But I'd call that an exotic use case and it's pretty broken tbh (I never liked the idea of it in the first place). That crate is also very limited and should even be unmaintained so we'll better replace that part :)
Right, the Anyway, we can close this issue if you want. I'm not that happy with the current default but, considering your responses, I also don't really see a significantly better solution anymore (and prohibiting nested containers by default for correctness also doesn't sound really good). |
Aye. I'll leave this issue open for now. I will likely (some day) revisit the Serde integration in this crate and see whether it can be improved holistically. At the very least, we'll get better docs. |
I haven't found any prior discussions on this so I though it might be useful to open an issue for it as it's also a feature request and somewhat of a correctness bug.
The current behavior of this crate is to try to flatten nested containers:
https://docs.rs/csv/1.3.0/csv/struct.Writer.html#rules
This design decision is understandable as CSV is pretty limited and simply doesn't support nested containers (vs. JSON, etc.).
However, I do wonder if this is a good default as it interferes with correctness.
Disclaimer: My knowledge of CSV, Rust, and especially this crate is limited but here are my considerations:
The main problem seems to be that there is no official CSV specification. I've used https://en.wikipedia.org/wiki/Comma-separated_values and https://datatracker.ietf.org/doc/html/rfc4180 as references.
(Aside: Perhaps it would also make sense to document to which specification this crate intends to conform to?)
A good rule seems to be the following:
IMO this is a requirement for correctly parsing CSV files (with some exceptions like time series), especially if the first record is a "header" - otherwise there just isn't enough context to understand the structure of the data.
This crate seems to support that rule as well:
https://docs.rs/csv/1.3.0/csv/struct.Reader.html#error-handling
The writer also enforces that rule by default:
https://docs.rs/csv/1.3.0/csv/struct.WriterBuilder.html#method.flexible
So the default is to flatten nested containers (like a
Vec
in aStruct
) and error untilflexible(false)
is used.It certainly makes sense that
flexible
defaults totrue
but I'd prefer if there was an option to support nested containers without changing the number of fields.I propose the following ideas (not sure how feasible they are though):
1. Let the user supply a custom string function to handle the transformation
If possible, this crate would call that function each time it backtracks from such a nested container (optionally with the depth level?). The user could then choose an escaping technique to ensure that the string will represent a single CSV field. This approach is limted as the type information / context is lost but one could at least do simple transformations like quoting and escaping or replacing the delimiters.
Using #254 as an example, the user could use that custom funtion to rewrite
4,5,6,7
(Vec<i64>
) to, e.g.,"4,5,6,7"
,4;5;6;7
, or[4,5,6,7]
(like in the desired example). Additional recursion levels could be supported through nested quoting, more delimiters (in that case the function should be called with the depth level), or custom approaches.Ideally, this would be pretty easy to integrate and flexible enough for most use cases.
2. Let the user supply a custom function that handles the nested containers
Similar to 1. but the function would get the raw data and produce the string. The type information would be preserved but it would require reflection, increase the complexitiy, and could be considered out of scope.
3. Offer generic techniques to ensure nested containers can be put into a single CSV field
This crate would implement one (or multiple) of the (hopefully pretty universal) custom string functions mentioned in 1. (possibly forcing the introduction of additional constraints).
IMO my first proposal could be a decent tradeoff but I might've missed something.
What do you think @BurntSushi?
PS: I'm currently looking into a somewhat "exotic" use case with @ammernico (#254 (comment)) where the types come from an API specification and mainly consist of structures that contain some vectors. This use case makes it difficult to flatten/convert the vectors into a string before passing the data to this CSV crate.
PPS: Huge thanks for this very useful crate and amazing documentation! :)
The text was updated successfully, but these errors were encountered: