perf: html5 attribute parsing #3393

flavorjones · 2024-12-26T20:51:57Z

What problem is this PR intended to solve?

Addressing quadratic performance of HTML5 attribute parsing as described in #2568 and rubys/nokogumbo#144.

Some benchmarks

Here is the benchmark script.

#!/usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri", path: "."
  gem "benchmark-ips"
end

def html_with_attributes(count)
  html = "<div"
  count.times do |j|
    html << " fake-attr-#{j}"
  end
  html << ">"
end

Benchmark.ips do |x|
  x.warmup = 0

  [ 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384 ].each do |attribute_count|
    html = html_with_attributes(attribute_count)

    x.report "#{attribute_count.to_s.rjust(7)} attributes" do
      Nokogiri::HTML5(html, max_attributes: 100_000)
    end
  end
end

before:

Calculating -------------------------------------
        1 attributes    242.976k (± 9.3%) i/s    (4.12 μs/i) -    936.599k in   4.838811s
        2 attributes    219.456k (± 9.4%) i/s    (4.56 μs/i) -    837.644k in   4.858317s
        4 attributes    191.173k (± 9.3%) i/s    (5.23 μs/i) -    715.392k in   4.882336s
        8 attributes    144.813k (±12.2%) i/s    (6.91 μs/i) -    516.194k in   4.912577s
       16 attributes     83.396k (±17.2%) i/s   (11.99 μs/i) -    238.564k in   4.958331s
       32 attributes     55.800k (± 9.9%) i/s   (17.92 μs/i) -    190.871k in   4.970063s
       64 attributes     27.381k (±14.7%) i/s   (36.52 μs/i) -     91.867k in   4.981936s
      128 attributes     12.707k (± 9.4%) i/s   (78.70 μs/i) -     49.906k in   4.990254s
      256 attributes      4.723k (± 8.1%) i/s  (211.74 μs/i) -     19.657k in   4.995888s
      512 attributes      1.417k (±10.6%) i/s  (705.61 μs/i) -      6.339k in   4.998084s
     1024 attributes    382.078 (±11.5%) i/s    (2.62 ms/i) -      1.811k in   5.000356s
     2048 attributes    119.825 (±10.8%) i/s    (8.35 ms/i) -    582.000 in   5.003555s
     4096 attributes     29.034 (± 6.9%) i/s   (34.44 ms/i) -    144.000 in   5.003602s
     8192 attributes      6.492 (± 0.0%) i/s  (154.04 ms/i) -     33.000 in   5.111381s
    16384 attributes      1.478 (± 0.0%) i/s  (676.68 ms/i) -      8.000 in   5.417232s

after:

Calculating -------------------------------------
        1 attributes    229.084k (±10.4%) i/s    (4.37 μs/i) -    868.071k in   4.846385s
        2 attributes    213.788k (± 9.9%) i/s    (4.68 μs/i) -    804.683k in   4.859819s
        4 attributes    181.602k (±10.7%) i/s    (5.51 μs/i) -    652.740k in   4.886700s
        8 attributes    137.811k (± 9.9%) i/s    (7.26 μs/i) -    500.690k in   4.911176s
       16 attributes     95.462k (± 9.9%) i/s   (10.48 μs/i) -    338.045k in   4.938699s
       32 attributes     61.388k (± 9.5%) i/s   (16.29 μs/i) -    216.911k in   4.961587s
       64 attributes     34.368k (±10.2%) i/s   (29.10 μs/i) -    121.985k in   4.976219s
      128 attributes     17.315k (±10.9%) i/s   (57.75 μs/i) -     60.716k in   4.987542s
      256 attributes      9.011k (± 9.2%) i/s  (110.98 μs/i) -     32.443k in   4.992878s
      512 attributes      4.552k (±11.8%) i/s  (219.69 μs/i) -     16.522k in   4.995834s
     1024 attributes      2.255k (±13.6%) i/s  (443.38 μs/i) -      8.303k in   4.997861s
     2048 attributes      1.091k (±17.6%) i/s  (916.91 μs/i) -      4.132k in   4.998710s
     4096 attributes    544.840 (±22.0%) i/s    (1.84 ms/i) -      2.101k in   5.006943s
     8192 attributes    282.789 (±22.3%) i/s    (3.54 ms/i) -      1.089k in   5.000604s
    16384 attributes    134.991 (±27.4%) i/s    (7.41 ms/i) -    542.000 in   5.001360s

In graphical form:

Just the "after", showing linear performance now:

Analysis

Analyzing the parsing of a tag with 16000 attributes (using samply), we see two hotspots:

the strlen/memcmp checks in libgumbo's tokenizer.c:finish_attribute_name which checks for duplicates in the attribute list:

nokogiri/gumbo-parser/src/tokenizer.c

Lines 804 to 820 in 729c96c

    
           for (unsigned int i = 0; i < attributes->length; ++i) { 
        
             GumboAttribute* attr = attributes->data[i]; 
        
             if ( 
        
               strlen(attr->name) == tag_state->_buffer.length 
        
               && 0 == memcmp ( 
        
                 attr->name, 
        
                 tag_state->_buffer.data, 
        
                 tag_state->_buffer.length 
        
               ) 
        
             ) { 
        
               // Identical attribute; bail. 
        
               add_duplicate_attr_error(parser); 
        
               reinitialize_tag_buffer(parser); 
        
               tag_state->_drop_next_attr_value = true; 
        
               return; 
        
             } 
        
           }

libxml2's xmlNewPropInternal traversing the singly-linked properties list to append each new property (to maintain parse order)

This PR addresses (1) by introducing a hashmap and using it if there are 16 or more attributes.

This PR addresses (2) by inlining most of xmlNewNsProp but keeping a pointer to the last property to avoid re-traversing the linked list each time.

After those changes, the flamegraph looks much better, with no obvious hotspots:

hashmap

Keeping in mind we may still someday want to publish our libgumbo fork as a separate library, I opted to bring in a new self-contained hashmap library from https://github.com/tidwall/hashmap.c

I'm using xxHash3 which seems to be pretty fast, but I'm open to other opinions on what hash function to use.

Have you included adequate test coverage?

Yes.

Does this change affect the behavior of either the C or the Java implementations?

HTML5 is CRuby-only.

Introduce code that is an optimized version of libxml2's xmlNewNsProp to avoid traversing the properties linked list to append an attribute every time we add one.

@tidwall

If there are more than 16 attributes, shift from doing strcmp to using a hashmap for duplicate detection. The number 16 was chosen based on the benchmark in #2568 I've introduced @tidwall's hashmap.c (MIT licensed and the copyright appropriately copied in the LICENSE-DEPENDENCIES file) to have something self-contained within the libgumbo codebase, rather than using libxml2's xmlHash or ruby's st.c.

flavorjones · 2024-12-27T15:17:06Z

@stevecheckoway I would really love your feedback on this when you get a few minutes. ❤️

stevecheckoway

I'm glad you looked into fixing this. I was hesitant to reach into the internals of libxml2. It's often unclear which fields of the structs are not going to change.

stevecheckoway · 2024-12-27T16:46:14Z

ext/nokogiri/gumbo.c

+  if ((doc != NULL) && (doc->dict != NULL)) {
+    cur->name = (xmlChar *) xmlDictLookup(doc->dict, name, -1);
+  } else {
+    cur->name = xmlStrdup(name);


This is can return NULL. libxml2 handles it by freeing the newly allocated attribute and returning NULL. https://github.com/GNOME/libxml2/blob/260954c5660315dd5775e597c9621e07acab4ff2/tree.c#L1669

stevecheckoway · 2024-12-27T16:59:21Z

ext/nokogiri/gumbo.c

+    }
+
+    if (doc != NULL) {
+      int res = xmlIsID(doc, node, cur);


This can also fail when allocating memory.

stevecheckoway · 2024-12-27T16:59:31Z

ext/nokogiri/gumbo.c

+      int res = xmlIsID(doc, node, cur);
+
+      if (res == 1) {
+        xmlAddIDSafe(cur, value);


And this one as well.

stevecheckoway · 2024-12-27T17:06:10Z

ext/nokogiri/libxml2_polyfill.c

+#ifndef HAVE_XMLADDIDSAFE
+int
+xmlAddIDSafe(xmlAttrPtr attr, const xmlChar *value) {
+  xmlIDPtr id = xmlAddID(NULL, attr->doc, value, attr);


This can also fail and return NULL due to allocation failure. I think that in this case that will cause xmlAddIDSafe to return that the ID already exists.

stevecheckoway · 2024-12-27T17:08:59Z

gumbo-parser/src/string_set.c

+GumboStringSet *
+gumbo_string_set_new(size_t cap)
+{
+  return (GumboStringSet*)hashmap_new(sizeof(char *), cap, SEED0, SEED1, string_hash, string_compare, NULL, NULL);


The cast for the return value shouldn't be necessary.

flavorjones · 2024-12-27T18:44:50Z

@stevecheckoway Thanks for the review. Yes, I was sloppy with malloc failures to get something up and running; I'll fix.

I was super curious about what you think of introducing the hashmap into libgumbo? Just want to make sure you don't think it's a silly thing to do.

stevecheckoway · 2024-12-27T18:53:16Z

@flavorjones I think adding the hashmap is better than going with the red-black tree I implemented for this exact task before I discovered that libxml2 also had quadratic behavior.

flavorjones requested a review from stevecheckoway December 26, 2024 20:51

flavorjones changed the title ~~Flavorjones gumbo attributes perf~~ perf: html5 attribute parsing Dec 26, 2024

flavorjones force-pushed the flavorjones-gumbo-attributes-perf branch 2 times, most recently from 4f6df29 to d666fa1 Compare December 26, 2024 21:27

flavorjones added this to the v1.19.0 milestone Dec 26, 2024

flavorjones added 2 commits December 27, 2024 09:21

perf(html5): optimize gumbo code to add a libxml2 property

2611599

Introduce code that is an optimized version of libxml2's xmlNewNsProp to avoid traversing the properties linked list to append an attribute every time we add one.

flavorjones force-pushed the flavorjones-gumbo-attributes-perf branch from d666fa1 to 41c601d Compare December 27, 2024 14:21

stevecheckoway requested changes Dec 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: html5 attribute parsing #3393

perf: html5 attribute parsing #3393

flavorjones commented Dec 26, 2024 •

edited

Loading

flavorjones commented Dec 27, 2024

stevecheckoway left a comment

stevecheckoway Dec 27, 2024

stevecheckoway Dec 27, 2024

stevecheckoway Dec 27, 2024

stevecheckoway Dec 27, 2024

stevecheckoway Dec 27, 2024

flavorjones commented Dec 27, 2024

stevecheckoway commented Dec 27, 2024

	for (unsigned int i = 0; i < attributes->length; ++i) {
	GumboAttribute* attr = attributes->data[i];
	if (
	strlen(attr->name) == tag_state->_buffer.length
	&& 0 == memcmp (
	attr->name,
	tag_state->_buffer.data,
	tag_state->_buffer.length
	)
	) {
	// Identical attribute; bail.
	add_duplicate_attr_error(parser);
	reinitialize_tag_buffer(parser);
	tag_state->_drop_next_attr_value = true;
	return;
	}
	}

perf: html5 attribute parsing #3393

Are you sure you want to change the base?

perf: html5 attribute parsing #3393

Conversation

flavorjones commented Dec 26, 2024 • edited Loading

What problem is this PR intended to solve?

Some benchmarks

Analysis

hashmap

Have you included adequate test coverage?

Does this change affect the behavior of either the C or the Java implementations?

flavorjones commented Dec 27, 2024

stevecheckoway left a comment

Choose a reason for hiding this comment

stevecheckoway Dec 27, 2024

Choose a reason for hiding this comment

stevecheckoway Dec 27, 2024

Choose a reason for hiding this comment

stevecheckoway Dec 27, 2024

Choose a reason for hiding this comment

stevecheckoway Dec 27, 2024

Choose a reason for hiding this comment

stevecheckoway Dec 27, 2024

Choose a reason for hiding this comment

flavorjones commented Dec 27, 2024

stevecheckoway commented Dec 27, 2024

flavorjones commented Dec 26, 2024 •

edited

Loading