Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffered parsing: add parse() method, and return Tokens from take_while() and take() #13

Merged
merged 20 commits into from
Nov 14, 2023

Conversation

jsdw
Copy link
Owner

@jsdw jsdw commented Nov 11, 2023

This takes the [BufferedTokens] stuff from @Easyoakland's PR (#11) and makes a few tweaks with the following rationales:

  • Don't add a separate Buffer associated type for now, because I'm not sure how I feel about it:
    • When you need eg a &str buffer, is it better to use StrTokens directly (perhaps we could allow one to easily take slices of this given some locations? Does exposing the buffer type sortof go against the "iterator centric" style of this library.
    • If you do set the buffer type to eg &str, do you limit the possible Tokens impls that would work with that enough anyway that there's no point trying to be generic anyway now?
  • Buffered values return Result and not Option, because FromStr already has a variant for handling empty cases anyway, so we don't need separate handling (I tweaked a function to avoid this also).
  • Tweak StackString (I think I prefer that name to ArrayString; the differentiating point afterall is that it's on the stack) to limit it to taking chars only and having a push fn which accepts chars and not bytes. My goal here was to limit the ways that it can panic (now, only if it runs out of space it will panic, which seems reasonable).
  • Tweak some of the examples and tests.
  • Upgrade a couple of FnMuts to FnOnces because we only need to call them once.
  • Make the buffered functions not consume any tokens if parsing fails (this is consistent with how other functions work)

@Easyoakland what do you think?

@jsdw jsdw changed the title Buffered parsing Buffered parsing: buffered() method and associated bits. Nov 11, 2023
@jsdw
Copy link
Owner Author

jsdw commented Nov 11, 2023

I'm still feeling somewhat on the fence about this:

  • The functions that you can call in buffered() are (currently) all string parsing related (unlike the core combinators of this library)
  • Some of them are just very tiny wrappers around parse_while anyway, so maybe they don't carry their weight
  • signed_digits is sortof useful but also stands somewhat on its own here; it's even more specific than the rest.

So a part of me is wondering about moving the parse_* methods onto the main Tokens trait, or perhaps putting them behind a parse() function instead of buffered() so that they are a little more discoverable and it's more obviously just "string parsing functions that happen to require a buffer". And then I would probably ditch the other methods for now (and maybe StackString too; it can panic quite easily and maybe it's best if that is user provided; nice that it's possible though!)

@jsdw
Copy link
Owner Author

jsdw commented Nov 11, 2023

I went and renamed buffered to parse, and removed all but the 3 "core" methods (now named from_n, from_remaining and from_filtered). I think I'm feeling better about this for now!

@jsdw jsdw changed the title Buffered parsing: buffered() method and associated bits. Buffered parsing: add parse() method, and return Tokens from take_while() and take() Nov 12, 2023
@jsdw
Copy link
Owner Author

jsdw commented Nov 12, 2023

I was still a bit unhappy that things didn't play quite as well together as I wanted, but I think I am now happy with this next reworking. Now:

  • Tokens::parse() is the single method for buffered parsing, to replace parse_n, parse_reamining and parse_while.
  • Tokens::tokens_while(..) is now Tokens::take_while, and the thing it returns impls Tokens too.
  • Tokens::take(n) has been added and also returns a thing which impls Tokens.

Together, we can get back the prev functionality; we can now do:

// Equivalent to tokens.parse_n
tokens.take(3).parse::<u16, String>();

// Equivalent to tokens.parse_while
tokens.take_while(|t| t.is_alphabetic()).parse::<u8, String>();

// Equivalent to tokens.parse_remaining
tokens.parse::<Foo, String>();

I've kept buffer::StackString around for now, as much as anything as a nice example of how to make such a buffer.

Maybe helper functions like signed_digit would be nice to have around, although I think perhaps that if I were to start adding such things, I'd add a helpers/utils module which just exposed a bunch of freestanding methods to do some common parsing tasks.

@Easyoakland
Copy link
Contributor

Replying in order:

When you need eg a &str buffer, is it better to use StrTokens directly (perhaps we could allow one to easily take slices of this given some locations? Does exposing the buffer type sortof go against the "iterator centric" style of this library.

It is not strictly better. I made it this way is so yap_streaming could be used by swapping out the initial construction of the token. It would benefit there too as it already has to collect things into an internal buffer for backtracking, So you could still get minimal copies while parsing a source that didn't start in memory.

If you do set the buffer type to eg &str, do you limit the possible Tokens impls that would work with that enough anyway that there's no point trying to be generic anyway now?

See above

Buffered values return Result and not Option, because FromStr already has a variant for handling empty cases anyway, so we don't need separate handling (I tweaked a function to avoid this also).

Good point. That's better.

Tweak StackString (I think I prefer that name to ArrayString; the differentiating point afterall is that it's on the stack) to limit it to taking chars only and having a push fn which accepts chars and not bytes. My goal here was to limit the ways that it can panic (now, only if it runs out of space it will panic, which seems reasonable).

I changed it for two main reasons. First, it matches ArrayString. Second, its defining feature isn't really that it's on the stack. It's that it's inline (e.g. Box<StackString>)

I think it's reasonable to expect the library user to import the type they are interested in from another crate and only use StackString/ArrayString for testing (or just add a new dev-dependency). The only reason I can think of to keep it here is for convenience, but I question if it is actually more convenient.

And then I would probably ditch the other methods for now (and maybe StackString too...

It looks like you agree.

Make the buffered functions not consume any tokens if parsing fails (this is consistent with how other functions work)

I thought I did that. If I didn't it was an oversight. Regardless, that makes sense.

Some of them are just very tiny wrappers around parse_while anyway, so maybe they don't carry their weight

Agree.

signed_digits is sortof useful but also stands somewhat on its own here; it's even more specific than the rest.

Yeah, this method was basically the only reason I added them at all.

I was still a bit unhappy that things didn't play quite as well together as I wanted, but I think I am now happy with this next reworking. Now: ...

That was a good idea. I have to test it some to check, but that solution looks much nicer.

Maybe helper functions like signed_digit would be nice to have around, although I think perhaps that if I were to start adding such things, I'd add a helpers/utils module which just exposed a bunch of freestanding methods to do some common parsing tasks.

Agree. Totally makes sense to decide what, if any, to add later after the core methods are added.


The associated buffer method does allow less copying. I'd have to benchmark to see if it mattered. This proposed method is certainly elegant.

The other idea I'd have is to implement a parse_slice taking a start and end TokenLocation. The TakeWhile and related forward calls to parse as calls to the wrapped token implementation's parse_slice, but now StrTokens and StreamTokens and other implementations are free to implement a more efficient parse method knowing they have a buffer. Basically changing the doc on this new parse_slice and parse to be "may use the provided buffer type if extra buffering is needed". You'd get the advantage of no copies without much api change.
If this was a good idea parse_slice could come later since it should easily integrate with the current method.
@jsdw thoughts on this?

@Easyoakland
Copy link
Contributor

The main issue I see with the parse_slice addendum is that a Tokens without an existing buffer will wind up iterating twice (once to get the final position and once to collect up the elements). Which is probably worse than just always copying the elements and thus not worth the overhead of another method.

@jsdw
Copy link
Owner Author

jsdw commented Nov 12, 2023

Thankyou for your detailed responses!

I liked your parse_slice() idea, and also noted that we can then make parse() (and itself) avoid allocating with StrTokens (and hopefully the yap_streaming version) without adding an associated type Buffer at all, so I've pushed a commit which does that and am curious to see what you think!

This ticks off the main place I'd want to avoid allocations personally; the case where you use yap to parse some stringish tokens, but then need to go and call some external parsing function to turn that string into the desired type (eg a number).

I'm still open to thinking about adding a type Buffer, but now that we can tokens.parse() without allocating for suitable token types, I have a lot of what I'd want :)

I think it's reasonable to expect the library user to import the type they are interested in from another crate and only use StackString/ArrayString for testing (or just add a new dev-dependency). The only reason I can think of to keep it here is for convenience, but I question if it is actually more convenient.

Ok, yeah I have been quite on the fence about keeping it around anyway, so I'll just ditch it and leave it to the user to bring their own buffer!

@jsdw
Copy link
Owner Author

jsdw commented Nov 12, 2023

Aside from fixing a couple of bits, I hid parse_slice from the docs (use slice(from,to).parse() instead), and made Take and TakeWhile both make use of it too, so things like StrTokens can be optimal in all of the main cases.

The main issue I see with the parse_slice addendum is that a Tokens without an existing buffer will wind up iterating twice (once to get the final position and once to collect up the elements).

The way it's implemented at the moment, doing take_while().parse() or whatever will do one iteration through the tokens to find out which tokens need parsing and then:

  • In the default case, go back through (albeit without any need for any more checks) again to collect them up in the Buffer, or
  • In the StrTokens optimised impls, we won't need to iterate over them to collect them and can just directly parse the str slice.

@Easyoakland
Copy link
Contributor

Yeah, that's what I was describing.

For accessing the buffer directly, I think if I still found it necessary I could impl a new trait on the tokens so it doesn't need to happen here.

You might want to add a benchmark (criterion) comparing .as_iter().collect::<String>().parse() vs .parse::<_, String>(). And another comparing how that double iteration affects IterTokens and StrTokens.
Hopefully the result of that will be positive. If so I think this was a overall good addition!

@Easyoakland Easyoakland mentioned this pull request Nov 13, 2023
@jsdw
Copy link
Owner Author

jsdw commented Nov 13, 2023

Aah, at the time it didn't quite click that having all of the parse functions iterate once to find the start and end location, and then likely again in the default parse_slice impl was not optimal. In order to be (I think) as optimal as possible, I've now added parse_take_while and parse_take. Now, each of Slice, Take and TakeWhile delegates parse() to the corresponding function on Tokens, which avoids needing to iterate over tokens twice.

With this, I do agree that benchmarks would be nice to add anyway, but I won't add them here (probably an issue would be good) since I think nothing wasteful is happening now.

Good point about being able to define a trait which mentions a buffer type, if it's useful to be able to access it for other reasons :)

If there are no objections I'll probably merge this in the next day or so and put a release out! (I'm also running out of time I want to dedicate to this for now, too; too many other things to do!)

@Easyoakland
Copy link
Contributor

Easyoakland commented Nov 13, 2023

  • Doing toks.take(n).take_while(f).parse() can be take_while(offset_since <n && f). Same with toks.take_while(f).take(n).parse()
    • Which reminds me that IterTokens::offset() should probably be overridden to self.cursor instead of default self.location().offset() since that's an extra clone.
  • You forgot adding parse... methods to Context, ContextMut.
  • While you updated the crate example, the json.rs and README examples could also be updated to use this instead of the old .as_iter().collect().parse()

@jsdw
Copy link
Owner Author

jsdw commented Nov 14, 2023

Brill, thanks for spotting these; I added all of the obvious nested parse optimisations inc the ones you suggested (there are more that could be added but they aren't definitely optimal any more), and addressed your other points!

@Easyoakland
Copy link
Contributor

Is there a reason that tok.slice(from, to).take(n) and tok.slice(from, to).take_while(f) can't benefit from the same?

@jsdw
Copy link
Owner Author

jsdw commented Nov 14, 2023

Is there a reason that tok.slice(from, to).take(n) and tok.slice(from, to).take_while(f) can't benefit from the same?

I'll have a look later; it was quite late when I did the above and partly I just didn't want to think about it!

Edit: I had a look, and offhand it's a bit tricky because for slice(..).take(..) you'd want to turn it into .take_while(n > 0 && self.tokens.is_at_location(to)), and borrowing self again in the take_while wouldn't be allowed. I'm sure it can be done though, but I'm inclined to leave that for separate PRs (you'd be welcome to have a shot at it!).

@Easyoakland
Copy link
Contributor

I'm inclined to leave that for separate PRs

Sure.


I think the token.consume() you added, to use your terminology, is a low-value method. You can use token.as_iter().for_each(drop).

@jsdw
Copy link
Owner Author

jsdw commented Nov 14, 2023

I think the token.consume() you added, to use your terminology, is a low-value method. You can use token.as_iter().for_each(drop).

Maybe, but i had used that several times already In different places :)

@Easyoakland
Copy link
Contributor

I wouldn't keep consume, but other than that I have no comments.

@Easyoakland
Copy link
Contributor

See rust-lang/rust#64117 for discussion.

@jsdw
Copy link
Owner Author

jsdw commented Nov 14, 2023

See rust-lang/rust#64117 for discussion.

Thanks for the link! It's a little more useful of an alias here because of the additional .as_iter() bit, and it's simple + very generically useful, so I think I'll keep it. I also added collect() to sit next to it as an alias for collecting (and sortof complement for parsing), because that is super common, and It'd be nice if you didn't need to reach for .as_iter() methods much in normal usage.

Slice, TakeWhile and Take I'd already made impl Tokens instead of Iterator. I also standardised that now all of the other structs returned impl Tokens instead of Iterator which I think makes sense overall (if just to be consistent).

(I guess this PR has become a little broad but hey ho)

@jsdw
Copy link
Owner Author

jsdw commented Nov 14, 2023

@Easyoakland re Tokens::eof and Tokens::line_ending:

  • Is there a reason for eof to care about consuming the None if found? (otherwise, tokens.peek().is_none() would apply. It feels a bit of a niche thing to want to happen (consuming on None vs not).
  • I'm tempted to remove/move line_ending because it's Tokens<Item=char> specific; it's a good candidate as a standalone utility method for working with char tokens.

@Easyoakland
Copy link
Contributor

Easyoakland commented Nov 14, 2023

Is there a reason for eof to care about consuming the None if found? (otherwise, tokens.peek().is_none() would apply. It feels a bit of a niche thing to want to happen (consuming on None vs not).

If the iterator is fused it doesn't matter. If the iterator is not fused (e.g. Some(1) None Some(2)) then it makes sense to consume the None the same way tok.tokens() consumes instead of just returning a bool. Delete eof if you want.

I'm tempted to remove/move line_ending because it's Tokens<Item=char> specific; it's a good candidate as a standalone utility method for working with char tokens.

Sure. Put it where you want (new crate, new module, etc). Maybe this is how you start filling a character module leaving room for bytes and bits modules which apply to Tokens<Item = u8>/Tokens<Item = bool>? You could leave adding new functions to the future. (signed_digits would also go there if decided to re-add).

@jsdw
Copy link
Owner Author

jsdw commented Nov 14, 2023

If the iterator is fused it doesn't matter. If the iterator is not fused (e.g. Some(1) None Some(2)) then it makes sense to consume the None the same way tok.tokens() consumes instead of just returning a bool. Delete eof if you want.

I'll just leave it as-is; it's only a small thing, and if people don't want to consume then toks.peek().is_none() is there, and so this is a bit like calling toks.token(EOF) conceptually.

Sure. Put it where you want (new crate, new module, etc). Maybe this is how you start filling a character module leaving room for bytes and bits modules which apply to Tokens<Item = u8>/Tokens<Item = bool>? You could leave adding new functions to the future. (signed_digits would also go there if decided to re-add).

I moved it to a new chars module, because yup, I think we could add some other helper things there eventually (one that comes to mind is a proper IEEE float parser).

@Easyoakland
Copy link
Contributor

Two more things that can come after merging this increasingly large PR

  1. Instead of manually defining token wrappers you could make a macro. In the macro you could remember to forward to the parse... functions, which I think you forgot again and probably will in the future as well.
  2. If tokens is going to have a into_iterator method it may as well be IntoIterator and defined for both &mut Tokens and Tokens.

@jsdw
Copy link
Owner Author

jsdw commented Nov 14, 2023

If tokens is going to have a into_iterator method it may as well be IntoIterator and defined for both &mut Tokens and Tokens.

Do you mean, making trait Tokens: IntoIterator?

Instead of manually defining token wrappers you could make a macro. In the macro you could remember to forward to the parse... functions, which I think you forgot again and probably will in the future as well.

Heh, yeah potentially!

@jsdw
Copy link
Owner Author

jsdw commented Nov 14, 2023

And yup, probably time to merge this; it's turned into a general "work on yap for a bit" PR!

@jsdw jsdw merged commit 7e5aa4c into master Nov 14, 2023
4 checks passed
@jsdw jsdw deleted the buffered-parsing branch November 14, 2023 22:53
@Easyoakland
Copy link
Contributor

Easyoakland commented Nov 14, 2023

Do you mean, making trait Tokens: IntoIterator?

That or a blanket impl<T: Tokens> IntoIterator for T with type = TokensIntoIter instead of defining a into_iterator method on the Token trait itself.

@jsdw
Copy link
Owner Author

jsdw commented Nov 14, 2023

Ooh, alas Rust wouldn't allow such a blanket IntoIterator impl otherwise I'd love to have that. I'm much less keen on Tokens: IntoIterator because it just adds a hurdle to implementing the trait that's a bit of a nuisance.

@jsdw
Copy link
Owner Author

jsdw commented Nov 14, 2023

Anyway, thankyou for all of your time and comments on this; they are much appreciated! I think this parse stuff has worked out quite well in the end, and it's lovely to be able to avoid the "allocate intermediate buffer" in the common cases.

@Easyoakland
Copy link
Contributor

your collect docs appear messed up.

@jsdw
Copy link
Owner Author

jsdw commented Nov 14, 2023

your collect docs appear messed up.

Argh, it'll be this laptop touch pad; I do that far too often... Fixed on master :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants