-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSONReader needs to fail on bad input #107
Comments
Fantastic. Seems like you're already working out some classifications of errors (invalid characers, unbalanced braces, etc...). In a perfect world these types of errors would be codified & reflected in |
Seems like we can get all of this by replacing the current parser with one based upon Decoder.Token. It tokenizes json, returning delimiters, integers, booleans, floats, or strings, and automatically handles separators (commas, colons) and balanced braces. See ExampleDecoder_Token: https://golang.org/src/encoding/json/example_test.go#L87 |
so much this. I was wondering if this didn't already exist for json in the standard lib, clearly it does. Let's do it! I haven't been this pumped pumped for a PR that deletes code and returns more errors in a while. |
I've been investigating using Decoder.Token to rewrite the JSON parser code, but unfortunately, running the benchmarks shows that new implementation to be much slower, about twice as slow. Perhaps encoding/json is not performant, or there's still too much memory copying going on. I'm going to investigate other json parsing libraries (https://github.com/json-iterator/go claims to be high-performance) to see if they do better. For now, the new implementation with Decoder.Token does detect more errors; is it worth checking in just for that, or is the performance loss not worth it? One other thing: there were two failing unit tests I didn't understand: TestEntryBuffer in entry_buffer_test.go and TestEachEntry in entry_test.go. These seem to be something like a pipe that connect together a reader and writer? With the new implementation that I wrote, which verifies that JSON opens correctly with a curly brace and closes with the same, it seems like these classes won't initialize correctly since they start out with empty buffers. What are their exact semantics? |
Innnnnnnnteresting. re: slowness, my guess is encoding/json's Decoder is much slower because it makes extensive use of the reflect package. I'd like to keep the current version and instead focus on improving error handling for a few reasons:
TestEntryBuffer tests the real-world use case of using a buffer to do data format conversion (read from a csv-formatted EntryReader, and write to a json-formatted EntryBuffer). This is our current trick for getting out of one format and into another. TestEachEntry is a test that needs work, but is supposed to confirm the EachEntry package func works as expected, iterating each entry in a reader. Both need to work as they're currently implemented, which is another reason I'm happy to leave things as is. |
Sounds good. One thing about TestEntryBuffer is that the failing tests were due to my changing the semantics of the JSONReader simply because I didn't understand this particular use case. My new version was assuming that the constructor NewJSONReader should begin validating the underlying json text by looking for the opening "{" or "[". However that doesn't need to happen, it can start doing so lazily the first time ReadEntry is called, so that TestEntryBuffer works correctly. That can be fixed in any future implementation, regardless of how we choose to proceed. |
In order to proceed with fuzz testing (#103) the parser needs to reject invalid input so that the fuzzer library can properly mutate inputs and find the best edge cases. Currently, the JSONReader does not fail on most bad input text, instead returning some harmless entries before eventually returning EOF.
Here are some example invalid inputs (found through fuzzing) that are currently accepted without error:
{"""
{""
{""\n\r
[{"":0.891359201686479766013060971498190079908139321726943530014330540939446345918554318339765605212255964066145455497729631139948085803712198719971664381257402829111505715176981172691289708366:}
{""^M:""}
{"":78767417684105446,"":161921607789870092:"":81524627972118562,"":83456047028550530,
[{}{}
[itZatzVtFXY
fFUtEbc6Iuyt4MQJBAItQ+phy2U
fQtuUE9tcPg/FvytQDU
y2ptGsuSmgUtAqVXVlxt9Oeos0UUothgiDktJBzV+5Otd input"":"
[225940565588577426,97785349712147642s:111079331564477517I:63464490706047245:90488487091652254F"6xBlO8psh1cf4TONZUS1N5gIWD"
Specific things that need to be checked:
Objects have quoted keys, followed by a colon, then a value, followed by either a comma or end of object.
Objects end with closing curly braces and arrays end with closing square braces.
Invalid characters (the in the last example is the hex character 0xff) don't appear.
Braces are properly balanced.
Objects have unique keys.
Invalid alphabetic characters don't appear in number constants.
The text was updated successfully, but these errors were encountered: