-
-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: html5 attribute parsing #3393
base: main
Are you sure you want to change the base?
Conversation
4f6df29
to
d666fa1
Compare
Introduce code that is an optimized version of libxml2's xmlNewNsProp to avoid traversing the properties linked list to append an attribute every time we add one.
If there are more than 16 attributes, shift from doing strcmp to using a hashmap for duplicate detection. The number 16 was chosen based on the benchmark in #2568 I've introduced @tidwall's hashmap.c (MIT licensed and the copyright appropriately copied in the LICENSE-DEPENDENCIES file) to have something self-contained within the libgumbo codebase, rather than using libxml2's xmlHash or ruby's st.c.
d666fa1
to
41c601d
Compare
@stevecheckoway I would really love your feedback on this when you get a few minutes. ❤️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad you looked into fixing this. I was hesitant to reach into the internals of libxml2. It's often unclear which fields of the structs are not going to change.
if ((doc != NULL) && (doc->dict != NULL)) { | ||
cur->name = (xmlChar *) xmlDictLookup(doc->dict, name, -1); | ||
} else { | ||
cur->name = xmlStrdup(name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is can return NULL. libxml2 handles it by freeing the newly allocated attribute and returning NULL. https://github.com/GNOME/libxml2/blob/260954c5660315dd5775e597c9621e07acab4ff2/tree.c#L1669
} | ||
|
||
if (doc != NULL) { | ||
int res = xmlIsID(doc, node, cur); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can also fail when allocating memory.
int res = xmlIsID(doc, node, cur); | ||
|
||
if (res == 1) { | ||
xmlAddIDSafe(cur, value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this one as well.
#ifndef HAVE_XMLADDIDSAFE | ||
int | ||
xmlAddIDSafe(xmlAttrPtr attr, const xmlChar *value) { | ||
xmlIDPtr id = xmlAddID(NULL, attr->doc, value, attr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can also fail and return NULL due to allocation failure. I think that in this case that will cause xmlAddIDSafe
to return that the ID already exists.
GumboStringSet * | ||
gumbo_string_set_new(size_t cap) | ||
{ | ||
return (GumboStringSet*)hashmap_new(sizeof(char *), cap, SEED0, SEED1, string_hash, string_compare, NULL, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cast for the return value shouldn't be necessary.
@stevecheckoway Thanks for the review. Yes, I was sloppy with malloc failures to get something up and running; I'll fix. I was super curious about what you think of introducing the hashmap into libgumbo? Just want to make sure you don't think it's a silly thing to do. |
@flavorjones I think adding the hashmap is better than going with the red-black tree I implemented for this exact task before I discovered that libxml2 also had quadratic behavior. |
What problem is this PR intended to solve?
Addressing quadratic performance of HTML5 attribute parsing as described in #2568 and rubys/nokogumbo#144.
Closes #2568
Some benchmarks
Here is the benchmark script.
before:
after:
In graphical form:
Just the "after", showing linear performance now:
Analysis
Analyzing the parsing of a tag with 16000 attributes (using
samply
), we see two hotspots:strlen
/memcmp
checks in libgumbo'stokenizer.c:finish_attribute_name
which checks for duplicates in the attribute list:nokogiri/gumbo-parser/src/tokenizer.c
Lines 804 to 820 in 729c96c
xmlNewPropInternal
traversing the singly-linked properties list to append each new property (to maintain parse order)This PR addresses (1) by introducing a hashmap and using it if there are 16 or more attributes.
This PR addresses (2) by inlining most of
xmlNewNsProp
but keeping a pointer to the last property to avoid re-traversing the linked list each time.After those changes, the flamegraph looks much better, with no obvious hotspots:
hashmap
Keeping in mind we may still someday want to publish our libgumbo fork as a separate library, I opted to bring in a new self-contained hashmap library from https://github.com/tidwall/hashmap.c
I'm using xxHash3 which seems to be pretty fast, but I'm open to other opinions on what hash function to use.
Have you included adequate test coverage?
Yes.
Does this change affect the behavior of either the C or the Java implementations?
HTML5 is CRuby-only.