JSONReader needs to fail on bad input #107

dustmop · 2018-04-02T16:49:24Z

In order to proceed with fuzz testing (#103) the parser needs to reject invalid input so that the fuzzer library can properly mutate inputs and find the best edge cases. Currently, the JSONReader does not fail on most bad input text, instead returning some harmless entries before eventually returning EOF.

Here are some example invalid inputs (found through fuzzing) that are currently accepted without error:

{"""
{""
{""\n\r

[{"":0.891359201686479766013060971498190079908139321726943530014330540939446345918554318339765605212255964066145455497729631139948085803712198719971664381257402829111505715176981172691289708366:}

{""^M:""}

{"":78767417684105446,"":161921607789870092:"":81524627972118562,"":83456047028550530,

[{}{}

[itZatzVtFXY
fFUtEbc6Iuyt4MQJBAItQ+phy2U
fQtuUE9tcPg/FvytQDU
y2ptGsuSmgUtAqVXVlxt9Oeos0UUothgiDktJBzV+5Otd input"":"

[225940565588577426,97785349712147642s:111079331564477517I:63464490706047245:90488487091652254F"6xBlO8psh1cf4TONZUS1N5gIWD"

Specific things that need to be checked:
Objects have quoted keys, followed by a colon, then a value, followed by either a comma or end of object.
Objects end with closing curly braces and arrays end with closing square braces.
Invalid characters (the in the last example is the hex character 0xff) don't appear.
Braces are properly balanced.
Objects have unique keys.
Invalid alphabetic characters don't appear in number constants.

b5 · 2018-04-02T16:58:43Z

Fantastic. Seems like you're already working out some classifications of errors (invalid characers, unbalanced braces, etc...). In a perfect world these types of errors would be codified & reflected in godoc, but I think we need to have a larger conversation about errors & error types before we get there. In the meantime it'd be great to see different types of errors called out in comments or through consistent error message conventions, it'll make our lives easier down the road when we up the rigor of our error system.

dustmop · 2018-04-02T18:37:19Z

Seems like we can get all of this by replacing the current parser with one based upon Decoder.Token. It tokenizes json, returning delimiters, integers, booleans, floats, or strings, and automatically handles separators (commas, colons) and balanced braces.

See ExampleDecoder_Token: https://golang.org/src/encoding/json/example_test.go#L87

b5 · 2018-04-02T18:39:35Z

so much this. I was wondering if this didn't already exist for json in the standard lib, clearly it does. Let's do it!

I haven't been this pumped pumped for a PR that deletes code and returns more errors in a while.

dustmop · 2018-04-03T23:19:19Z

I've been investigating using Decoder.Token to rewrite the JSON parser code, but unfortunately, running the benchmarks shows that new implementation to be much slower, about twice as slow. Perhaps encoding/json is not performant, or there's still too much memory copying going on. I'm going to investigate other json parsing libraries (https://github.com/json-iterator/go claims to be high-performance) to see if they do better. For now, the new implementation with Decoder.Token does detect more errors; is it worth checking in just for that, or is the performance loss not worth it?

One other thing: there were two failing unit tests I didn't understand: TestEntryBuffer in entry_buffer_test.go and TestEachEntry in entry_test.go. These seem to be something like a pipe that connect together a reader and writer? With the new implementation that I wrote, which verifies that JSON opens correctly with a curly brace and closes with the same, it seems like these classes won't initialize correctly since they start out with empty buffers. What are their exact semantics?

b5 · 2018-04-03T23:33:23Z

Innnnnnnnteresting.

re: slowness, my guess is encoding/json's Decoder is much slower because it makes extensive use of the reflect package. I'd like to keep the current version and instead focus on improving error handling for a few reasons:

In nearly all cases the input data is being passed through rigid parsers like json.Unmarshal or a csv.Reader before the data is recorded as a dataset, so we have a degree of insulation here (for now at least)
Slowness is a real concern given that these codepaths are intended to behave like a database, we don't have the constraint of needing to encode to various target types, b/c case we know ahead of time what we're decoding to
thinking long term, I'd rather stick with our own home-rolled implementation to have the freedom to continue to form-fit the solution to our needs

TestEntryBuffer tests the real-world use case of using a buffer to do data format conversion (read from a csv-formatted EntryReader, and write to a json-formatted EntryBuffer). This is our current trick for getting out of one format and into another. TestEachEntry is a test that needs work, but is supposed to confirm the EachEntry package func works as expected, iterating each entry in a reader. Both need to work as they're currently implemented, which is another reason I'm happy to leave things as is.

dustmop · 2018-04-04T00:14:50Z

Sounds good. One thing about TestEntryBuffer is that the failing tests were due to my changing the semantics of the JSONReader simply because I didn't understand this particular use case. My new version was assuming that the constructor NewJSONReader should begin validating the underlying json text by looking for the opening "{" or "[". However that doesn't need to happen, it can start doing so lazily the first time ReadEntry is called, so that TestEntryBuffer works correctly. That can be fixed in any future implementation, regardless of how we choose to proceed.

b5 · 2018-08-20T13:14:48Z

This has been mostly addressed in #113 & #114. Closing!

dustmop self-assigned this Apr 2, 2018

b5 added the bug label Apr 3, 2018

b5 closed this as completed Aug 20, 2018

b5 mentioned this issue Aug 20, 2018

CBORReader Needs to fail on Bad input #139

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSONReader needs to fail on bad input #107

JSONReader needs to fail on bad input #107

dustmop commented Apr 2, 2018

b5 commented Apr 2, 2018

dustmop commented Apr 2, 2018

b5 commented Apr 2, 2018

dustmop commented Apr 3, 2018

b5 commented Apr 3, 2018

dustmop commented Apr 4, 2018

b5 commented Aug 20, 2018

JSONReader needs to fail on bad input #107

JSONReader needs to fail on bad input #107

Comments

dustmop commented Apr 2, 2018

b5 commented Apr 2, 2018

dustmop commented Apr 2, 2018

b5 commented Apr 2, 2018

dustmop commented Apr 3, 2018

b5 commented Apr 3, 2018

dustmop commented Apr 4, 2018

b5 commented Aug 20, 2018