Buffered parsing: add parse() method, and return Tokens from take_while() and take() #13

jsdw · 2023-11-11T15:04:55Z

This takes the [BufferedTokens] stuff from @Easyoakland's PR (#11) and makes a few tweaks with the following rationales:

Don't add a separate Buffer associated type for now, because I'm not sure how I feel about it:
- When you need eg a &str buffer, is it better to use StrTokens directly (perhaps we could allow one to easily take slices of this given some locations? Does exposing the buffer type sortof go against the "iterator centric" style of this library.
- If you do set the buffer type to eg &str, do you limit the possible Tokens impls that would work with that enough anyway that there's no point trying to be generic anyway now?
Buffered values return Result and not Option, because FromStr already has a variant for handling empty cases anyway, so we don't need separate handling (I tweaked a function to avoid this also).
Tweak StackString (I think I prefer that name to ArrayString; the differentiating point afterall is that it's on the stack) to limit it to taking chars only and having a push fn which accepts chars and not bytes. My goal here was to limit the ways that it can panic (now, only if it runs out of space it will panic, which seems reasonable).
Tweak some of the examples and tests.
Upgrade a couple of FnMuts to FnOnces because we only need to call them once.
Make the buffered functions not consume any tokens if parsing fails (this is consistent with how other functions work)

@Easyoakland what do you think?

jsdw · 2023-11-11T15:39:29Z

I'm still feeling somewhat on the fence about this:

The functions that you can call in buffered() are (currently) all string parsing related (unlike the core combinators of this library)
Some of them are just very tiny wrappers around parse_while anyway, so maybe they don't carry their weight
signed_digits is sortof useful but also stands somewhat on its own here; it's even more specific than the rest.

So a part of me is wondering about moving the parse_* methods onto the main Tokens trait, or perhaps putting them behind a parse() function instead of buffered() so that they are a little more discoverable and it's more obviously just "string parsing functions that happen to require a buffer". And then I would probably ditch the other methods for now (and maybe StackString too; it can panic quite easily and maybe it's best if that is user provided; nice that it's possible though!)

jsdw · 2023-11-11T16:03:13Z

I went and renamed buffered to parse, and removed all but the 3 "core" methods (now named from_n, from_remaining and from_filtered). I think I'm feeling better about this for now!

jsdw · 2023-11-12T00:29:26Z

I was still a bit unhappy that things didn't play quite as well together as I wanted, but I think I am now happy with this next reworking. Now:

Tokens::parse() is the single method for buffered parsing, to replace parse_n, parse_reamining and parse_while.
Tokens::tokens_while(..) is now Tokens::take_while, and the thing it returns impls Tokens too.
Tokens::take(n) has been added and also returns a thing which impls Tokens.

Together, we can get back the prev functionality; we can now do:

// Equivalent to tokens.parse_n
tokens.take(3).parse::<u16, String>();

// Equivalent to tokens.parse_while
tokens.take_while(|t| t.is_alphabetic()).parse::<u8, String>();

// Equivalent to tokens.parse_remaining
tokens.parse::<Foo, String>();

I've kept buffer::StackString around for now, as much as anything as a nice example of how to make such a buffer.

Maybe helper functions like signed_digit would be nice to have around, although I think perhaps that if I were to start adding such things, I'd add a helpers/utils module which just exposed a bunch of freestanding methods to do some common parsing tasks.

Easyoakland · 2023-11-12T10:28:15Z

Replying in order:

When you need eg a &str buffer, is it better to use StrTokens directly (perhaps we could allow one to easily take slices of this given some locations? Does exposing the buffer type sortof go against the "iterator centric" style of this library.

It is not strictly better. I made it this way is so yap_streaming could be used by swapping out the initial construction of the token. It would benefit there too as it already has to collect things into an internal buffer for backtracking, So you could still get minimal copies while parsing a source that didn't start in memory.

If you do set the buffer type to eg &str, do you limit the possible Tokens impls that would work with that enough anyway that there's no point trying to be generic anyway now?

See above

Buffered values return Result and not Option, because FromStr already has a variant for handling empty cases anyway, so we don't need separate handling (I tweaked a function to avoid this also).

Good point. That's better.

Tweak StackString (I think I prefer that name to ArrayString; the differentiating point afterall is that it's on the stack) to limit it to taking chars only and having a push fn which accepts chars and not bytes. My goal here was to limit the ways that it can panic (now, only if it runs out of space it will panic, which seems reasonable).

I changed it for two main reasons. First, it matches ArrayString. Second, its defining feature isn't really that it's on the stack. It's that it's inline (e.g. Box<StackString>)

I think it's reasonable to expect the library user to import the type they are interested in from another crate and only use StackString/ArrayString for testing (or just add a new dev-dependency). The only reason I can think of to keep it here is for convenience, but I question if it is actually more convenient.

And then I would probably ditch the other methods for now (and maybe StackString too...

It looks like you agree.

Make the buffered functions not consume any tokens if parsing fails (this is consistent with how other functions work)

I thought I did that. If I didn't it was an oversight. Regardless, that makes sense.

Some of them are just very tiny wrappers around parse_while anyway, so maybe they don't carry their weight

Agree.

signed_digits is sortof useful but also stands somewhat on its own here; it's even more specific than the rest.

Yeah, this method was basically the only reason I added them at all.

I was still a bit unhappy that things didn't play quite as well together as I wanted, but I think I am now happy with this next reworking. Now: ...

That was a good idea. I have to test it some to check, but that solution looks much nicer.

Maybe helper functions like signed_digit would be nice to have around, although I think perhaps that if I were to start adding such things, I'd add a helpers/utils module which just exposed a bunch of freestanding methods to do some common parsing tasks.

Agree. Totally makes sense to decide what, if any, to add later after the core methods are added.

The associated buffer method does allow less copying. I'd have to benchmark to see if it mattered. This proposed method is certainly elegant.

The other idea I'd have is to implement a parse_slice taking a start and end TokenLocation. The TakeWhile and related forward calls to parse as calls to the wrapped token implementation's parse_slice, but now StrTokens and StreamTokens and other implementations are free to implement a more efficient parse method knowing they have a buffer. Basically changing the doc on this new parse_slice and parse to be "may use the provided buffer type if extra buffering is needed". You'd get the advantage of no copies without much api change.
If this was a good idea parse_slice could come later since it should easily integrate with the current method.
@jsdw thoughts on this?

Easyoakland · 2023-11-12T10:38:15Z

The main issue I see with the parse_slice addendum is that a Tokens without an existing buffer will wind up iterating twice (once to get the final position and once to collect up the elements). Which is probably worse than just always copying the elements and thus not worth the overhead of another method.

jsdw · 2023-11-12T18:03:09Z

Thankyou for your detailed responses!

I liked your parse_slice() idea, and also noted that we can then make parse() (and itself) avoid allocating with StrTokens (and hopefully the yap_streaming version) without adding an associated type Buffer at all, so I've pushed a commit which does that and am curious to see what you think!

This ticks off the main place I'd want to avoid allocations personally; the case where you use yap to parse some stringish tokens, but then need to go and call some external parsing function to turn that string into the desired type (eg a number).

I'm still open to thinking about adding a type Buffer, but now that we can tokens.parse() without allocating for suitable token types, I have a lot of what I'd want :)

I think it's reasonable to expect the library user to import the type they are interested in from another crate and only use StackString/ArrayString for testing (or just add a new dev-dependency). The only reason I can think of to keep it here is for convenience, but I question if it is actually more convenient.

Ok, yeah I have been quite on the fence about keeping it around anyway, so I'll just ditch it and leave it to the user to bring their own buffer!

jsdw · 2023-11-12T19:44:52Z

Aside from fixing a couple of bits, I hid parse_slice from the docs (use slice(from,to).parse() instead), and made Take and TakeWhile both make use of it too, so things like StrTokens can be optimal in all of the main cases.

The main issue I see with the parse_slice addendum is that a Tokens without an existing buffer will wind up iterating twice (once to get the final position and once to collect up the elements).

The way it's implemented at the moment, doing take_while().parse() or whatever will do one iteration through the tokens to find out which tokens need parsing and then:

In the default case, go back through (albeit without any need for any more checks) again to collect them up in the Buffer, or
In the StrTokens optimised impls, we won't need to iterate over them to collect them and can just directly parse the str slice.

Easyoakland · 2023-11-13T06:31:07Z

Yeah, that's what I was describing.

For accessing the buffer directly, I think if I still found it necessary I could impl a new trait on the tokens so it doesn't need to happen here.

You might want to add a benchmark (criterion) comparing .as_iter().collect::<String>().parse() vs .parse::<_, String>(). And another comparing how that double iteration affects IterTokens and StrTokens.
Hopefully the result of that will be positive. If so I think this was a overall good addition!

…ke,TakeWhile

jsdw · 2023-11-13T19:29:08Z

Aah, at the time it didn't quite click that having all of the parse functions iterate once to find the start and end location, and then likely again in the default parse_slice impl was not optimal. In order to be (I think) as optimal as possible, I've now added parse_take_while and parse_take. Now, each of Slice, Take and TakeWhile delegates parse() to the corresponding function on Tokens, which avoids needing to iterate over tokens twice.

With this, I do agree that benchmarks would be nice to add anyway, but I won't add them here (probably an issue would be good) since I think nothing wasteful is happening now.

Good point about being able to define a trait which mentions a buffer type, if it's useful to be able to access it for other reasons :)

If there are no objections I'll probably merge this in the next day or so and put a release out! (I'm also running out of time I want to dedicate to this for now, too; too many other things to do!)

Easyoakland · 2023-11-13T21:03:03Z

Doing toks.take(n).take_while(f).parse() can be take_while(offset_since <n && f). Same with toks.take_while(f).take(n).parse()
- Which reminds me that IterTokens::offset() should probably be overridden to self.cursor instead of default self.location().offset() since that's an extra clone.
You forgot adding parse... methods to Context, ContextMut.
While you updated the crate example, the json.rs and README examples could also be updated to use this instead of the old .as_iter().collect().parse()

…nits

jsdw · 2023-11-14T01:22:18Z

Brill, thanks for spotting these; I added all of the obvious nested parse optimisations inc the ones you suggested (there are more that could be added but they aren't definitely optimal any more), and addressed your other points!

Easyoakland · 2023-11-14T05:07:14Z

Is there a reason that tok.slice(from, to).take(n) and tok.slice(from, to).take_while(f) can't benefit from the same?

jsdw · 2023-11-14T11:01:00Z

Is there a reason that tok.slice(from, to).take(n) and tok.slice(from, to).take_while(f) can't benefit from the same?

I'll have a look later; it was quite late when I did the above and partly I just didn't want to think about it!

Edit: I had a look, and offhand it's a bit tricky because for slice(..).take(..) you'd want to turn it into .take_while(n > 0 && self.tokens.is_at_location(to)), and borrowing self again in the take_while wouldn't be allowed. I'm sure it can be done though, but I'm inclined to leave that for separate PRs (you'd be welcome to have a shot at it!).

Easyoakland · 2023-11-14T19:56:07Z

I'm inclined to leave that for separate PRs

Sure.

I think the token.consume() you added, to use your terminology, is a low-value method. You can use token.as_iter().for_each(drop).

jsdw · 2023-11-14T20:05:30Z

I think the token.consume() you added, to use your terminology, is a low-value method. You can use token.as_iter().for_each(drop).

Maybe, but i had used that several times already In different places :)

Easyoakland · 2023-11-14T20:13:18Z

I wouldn't keep consume, but other than that I have no comments.

Easyoakland · 2023-11-14T20:15:17Z

See rust-lang/rust#64117 for discussion.

…llect() alias

jsdw · 2023-11-14T22:16:49Z

See rust-lang/rust#64117 for discussion.

Thanks for the link! It's a little more useful of an alias here because of the additional .as_iter() bit, and it's simple + very generically useful, so I think I'll keep it. I also added collect() to sit next to it as an alias for collecting (and sortof complement for parsing), because that is super common, and It'd be nice if you didn't need to reach for .as_iter() methods much in normal usage.

Slice, TakeWhile and Take I'd already made impl Tokens instead of Iterator. I also standardised that now all of the other structs returned impl Tokens instead of Iterator which I think makes sense overall (if just to be consistent).

(I guess this PR has become a little broad but hey ho)

jsdw · 2023-11-14T22:32:41Z

@Easyoakland re Tokens::eof and Tokens::line_ending:

Is there a reason for eof to care about consuming the None if found? (otherwise, tokens.peek().is_none() would apply. It feels a bit of a niche thing to want to happen (consuming on None vs not).
I'm tempted to remove/move line_ending because it's Tokens<Item=char> specific; it's a good candidate as a standalone utility method for working with char tokens.

Easyoakland · 2023-11-14T22:43:28Z

Is there a reason for eof to care about consuming the None if found? (otherwise, tokens.peek().is_none() would apply. It feels a bit of a niche thing to want to happen (consuming on None vs not).

If the iterator is fused it doesn't matter. If the iterator is not fused (e.g. Some(1) None Some(2)) then it makes sense to consume the None the same way tok.tokens() consumes instead of just returning a bool. Delete eof if you want.

I'm tempted to remove/move line_ending because it's Tokens<Item=char> specific; it's a good candidate as a standalone utility method for working with char tokens.

Sure. Put it where you want (new crate, new module, etc). Maybe this is how you start filling a character module leaving room for bytes and bits modules which apply to Tokens<Item = u8>/Tokens<Item = bool>? You could leave adding new functions to the future. (signed_digits would also go there if decided to re-add).

jsdw · 2023-11-14T22:49:40Z

If the iterator is fused it doesn't matter. If the iterator is not fused (e.g. Some(1) None Some(2)) then it makes sense to consume the None the same way tok.tokens() consumes instead of just returning a bool. Delete eof if you want.

I'll just leave it as-is; it's only a small thing, and if people don't want to consume then toks.peek().is_none() is there, and so this is a bit like calling toks.token(EOF) conceptually.

Sure. Put it where you want (new crate, new module, etc). Maybe this is how you start filling a character module leaving room for bytes and bits modules which apply to Tokens<Item = u8>/Tokens<Item = bool>? You could leave adding new functions to the future. (signed_digits would also go there if decided to re-add).

I moved it to a new chars module, because yup, I think we could add some other helper things there eventually (one that comes to mind is a proper IEEE float parser).

Easyoakland · 2023-11-14T22:49:42Z

Two more things that can come after merging this increasingly large PR

Instead of manually defining token wrappers you could make a macro. In the macro you could remember to forward to the parse... functions, which I think you forgot again and probably will in the future as well.
If tokens is going to have a into_iterator method it may as well be IntoIterator and defined for both &mut Tokens and Tokens.

jsdw · 2023-11-14T22:53:00Z

If tokens is going to have a into_iterator method it may as well be IntoIterator and defined for both &mut Tokens and Tokens.

Do you mean, making trait Tokens: IntoIterator?

Instead of manually defining token wrappers you could make a macro. In the macro you could remember to forward to the parse... functions, which I think you forgot again and probably will in the future as well.

Heh, yeah potentially!

jsdw · 2023-11-14T22:53:37Z

And yup, probably time to merge this; it's turned into a general "work on yap for a bit" PR!

Easyoakland · 2023-11-14T22:57:41Z

Do you mean, making trait Tokens: IntoIterator?

That or a blanket impl<T: Tokens> IntoIterator for T with type = TokensIntoIter instead of defining a into_iterator method on the Token trait itself.

jsdw · 2023-11-14T23:00:08Z

Ooh, alas Rust wouldn't allow such a blanket IntoIterator impl otherwise I'd love to have that. I'm much less keen on Tokens: IntoIterator because it just adds a hurdle to implementing the trait that's a bit of a nuisance.

jsdw · 2023-11-14T23:01:40Z

Anyway, thankyou for all of your time and comments on this; they are much appreciated! I think this parse stuff has worked out quite well in the end, and it's lovely to be able to avoid the "allocate intermediate buffer" in the common cases.

Easyoakland · 2023-11-14T23:03:56Z

your collect docs appear messed up.

jsdw · 2023-11-14T23:10:37Z

your collect docs appear messed up.

Argh, it'll be this laptop touch pad; I do that far too often... Fixed on master :)

Easyoakland and others added 9 commits August 16, 2023 02:27

Add buffered parsing and character based parsers

359a350

Fix test

31dfee6

StackString from Iterator<Item = char>

e83a115

Added docs for new methods.

90502ed

More docs + FromStr for StackString

f267c17

Tidy buffered parsing PR

d66bbfb

wee formatting tweak

ba93a77

dont consume tokens if parsing fails, and FnOnce in a couple of places

c7660ca

Merge branch 'master' into buffered-parsing

19d2f2f

jsdw changed the title ~~Buffered parsing~~ Buffered parsing: buffered() method and associated bits. Nov 11, 2023

fix doc links

8870ffa

buffered => parse, rename inner fns and remove low-value methods

ebe2b3d

Impl Tokens on TakeWhile, add Take, add single parse method

1aeccdb

jsdw changed the title ~~Buffered parsing: buffered() method and associated bits.~~ Buffered parsing: add parse() method, and return Tokens from take_while() and take() Nov 12, 2023

add parse_slice and optimise StrTokens to avoid needing buffer

2dcc808

jsdw added 2 commits November 12, 2023 18:04

remove unneeded clones in a test

4e5a5e1

hide parse_slice, use it in more places and test it better in StrTokens

b043c00

Easyoakland mentioned this pull request Nov 13, 2023

Buffered parse #11

Closed

add parse_take and parse_take_while, remove Iterator impl on Slice,Ta…

bce94ac

…ke,TakeWhile

Add optimations in Take,TakeWhile,Slice for nested parsing and other …

b1d19f4

…nits

add .consume(), tidy slice functioning

8dbf49d

Standardise on all returned structs impl Tokens, not Iterator. add co…

0a28590

…llect() alias

move line_ending to chars helper mod

4379261

jsdw merged commit 7e5aa4c into master Nov 14, 2023
4 checks passed

jsdw deleted the buffered-parsing branch November 14, 2023 22:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buffered parsing: add parse() method, and return Tokens from take_while() and take() #13

Buffered parsing: add parse() method, and return Tokens from take_while() and take() #13

jsdw commented Nov 11, 2023 •

edited

Loading

jsdw commented Nov 11, 2023 •

edited

Loading

jsdw commented Nov 11, 2023

jsdw commented Nov 12, 2023 •

edited

Loading

Easyoakland commented Nov 12, 2023

Easyoakland commented Nov 12, 2023

jsdw commented Nov 12, 2023 •

edited

Loading

jsdw commented Nov 12, 2023

Easyoakland commented Nov 13, 2023

jsdw commented Nov 13, 2023

Easyoakland commented Nov 13, 2023 •

edited

Loading

jsdw commented Nov 14, 2023

Easyoakland commented Nov 14, 2023

jsdw commented Nov 14, 2023 •

edited

Loading

Easyoakland commented Nov 14, 2023

jsdw commented Nov 14, 2023

Easyoakland commented Nov 14, 2023

Easyoakland commented Nov 14, 2023

jsdw commented Nov 14, 2023 •

edited

Loading

jsdw commented Nov 14, 2023 •

edited

Loading

Easyoakland commented Nov 14, 2023 •

edited

Loading

jsdw commented Nov 14, 2023

Easyoakland commented Nov 14, 2023

jsdw commented Nov 14, 2023

jsdw commented Nov 14, 2023

Easyoakland commented Nov 14, 2023 •

edited

Loading

jsdw commented Nov 14, 2023

jsdw commented Nov 14, 2023

Easyoakland commented Nov 14, 2023

jsdw commented Nov 14, 2023

Buffered parsing: add parse() method, and return Tokens from take_while() and take() #13

Buffered parsing: add parse() method, and return Tokens from take_while() and take() #13

Conversation

jsdw commented Nov 11, 2023 • edited Loading

jsdw commented Nov 11, 2023 • edited Loading

jsdw commented Nov 11, 2023

jsdw commented Nov 12, 2023 • edited Loading

Easyoakland commented Nov 12, 2023

Easyoakland commented Nov 12, 2023

jsdw commented Nov 12, 2023 • edited Loading

jsdw commented Nov 12, 2023

Easyoakland commented Nov 13, 2023

jsdw commented Nov 13, 2023

Easyoakland commented Nov 13, 2023 • edited Loading

jsdw commented Nov 14, 2023

Easyoakland commented Nov 14, 2023

jsdw commented Nov 14, 2023 • edited Loading

Easyoakland commented Nov 14, 2023

jsdw commented Nov 14, 2023

Easyoakland commented Nov 14, 2023

Easyoakland commented Nov 14, 2023

jsdw commented Nov 14, 2023 • edited Loading

jsdw commented Nov 14, 2023 • edited Loading

Easyoakland commented Nov 14, 2023 • edited Loading

jsdw commented Nov 14, 2023

Easyoakland commented Nov 14, 2023

jsdw commented Nov 14, 2023

jsdw commented Nov 14, 2023

Easyoakland commented Nov 14, 2023 • edited Loading

jsdw commented Nov 14, 2023

jsdw commented Nov 14, 2023

Easyoakland commented Nov 14, 2023

jsdw commented Nov 14, 2023

jsdw commented Nov 11, 2023 •

edited

Loading

jsdw commented Nov 11, 2023 •

edited

Loading

jsdw commented Nov 12, 2023 •

edited

Loading

jsdw commented Nov 12, 2023 •

edited

Loading

Easyoakland commented Nov 13, 2023 •

edited

Loading

jsdw commented Nov 14, 2023 •

edited

Loading

jsdw commented Nov 14, 2023 •

edited

Loading

jsdw commented Nov 14, 2023 •

edited

Loading

Easyoakland commented Nov 14, 2023 •

edited

Loading

Easyoakland commented Nov 14, 2023 •

edited

Loading