Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: html5 attribute parsing #3393

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

flavorjones
Copy link
Member

@flavorjones flavorjones commented Dec 26, 2024

What problem is this PR intended to solve?

Addressing quadratic performance of HTML5 attribute parsing as described in #2568 and rubys/nokogumbo#144.

Closes #2568

Some benchmarks

Here is the benchmark script.
#!/usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri", path: "."
  gem "benchmark-ips"
end

def html_with_attributes(count)
  html = "<div"
  count.times do |j|
    html << " fake-attr-#{j}"
  end
  html << ">"
end

Benchmark.ips do |x|
  x.warmup = 0

  [ 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384 ].each do |attribute_count|
    html = html_with_attributes(attribute_count)

    x.report "#{attribute_count.to_s.rjust(7)} attributes" do
      Nokogiri::HTML5(html, max_attributes: 100_000)
    end
  end
end

before:

Calculating -------------------------------------
        1 attributes    242.976k (± 9.3%) i/s    (4.12 μs/i) -    936.599k in   4.838811s
        2 attributes    219.456k (± 9.4%) i/s    (4.56 μs/i) -    837.644k in   4.858317s
        4 attributes    191.173k (± 9.3%) i/s    (5.23 μs/i) -    715.392k in   4.882336s
        8 attributes    144.813k (±12.2%) i/s    (6.91 μs/i) -    516.194k in   4.912577s
       16 attributes     83.396k (±17.2%) i/s   (11.99 μs/i) -    238.564k in   4.958331s
       32 attributes     55.800k (± 9.9%) i/s   (17.92 μs/i) -    190.871k in   4.970063s
       64 attributes     27.381k (±14.7%) i/s   (36.52 μs/i) -     91.867k in   4.981936s
      128 attributes     12.707k (± 9.4%) i/s   (78.70 μs/i) -     49.906k in   4.990254s
      256 attributes      4.723k (± 8.1%) i/s  (211.74 μs/i) -     19.657k in   4.995888s
      512 attributes      1.417k (±10.6%) i/s  (705.61 μs/i) -      6.339k in   4.998084s
     1024 attributes    382.078 (±11.5%) i/s    (2.62 ms/i) -      1.811k in   5.000356s
     2048 attributes    119.825 (±10.8%) i/s    (8.35 ms/i) -    582.000 in   5.003555s
     4096 attributes     29.034 (± 6.9%) i/s   (34.44 ms/i) -    144.000 in   5.003602s
     8192 attributes      6.492 (± 0.0%) i/s  (154.04 ms/i) -     33.000 in   5.111381s
    16384 attributes      1.478 (± 0.0%) i/s  (676.68 ms/i) -      8.000 in   5.417232s

after:

Calculating -------------------------------------
        1 attributes    229.084k (±10.4%) i/s    (4.37 μs/i) -    868.071k in   4.846385s
        2 attributes    213.788k (± 9.9%) i/s    (4.68 μs/i) -    804.683k in   4.859819s
        4 attributes    181.602k (±10.7%) i/s    (5.51 μs/i) -    652.740k in   4.886700s
        8 attributes    137.811k (± 9.9%) i/s    (7.26 μs/i) -    500.690k in   4.911176s
       16 attributes     95.462k (± 9.9%) i/s   (10.48 μs/i) -    338.045k in   4.938699s
       32 attributes     61.388k (± 9.5%) i/s   (16.29 μs/i) -    216.911k in   4.961587s
       64 attributes     34.368k (±10.2%) i/s   (29.10 μs/i) -    121.985k in   4.976219s
      128 attributes     17.315k (±10.9%) i/s   (57.75 μs/i) -     60.716k in   4.987542s
      256 attributes      9.011k (± 9.2%) i/s  (110.98 μs/i) -     32.443k in   4.992878s
      512 attributes      4.552k (±11.8%) i/s  (219.69 μs/i) -     16.522k in   4.995834s
     1024 attributes      2.255k (±13.6%) i/s  (443.38 μs/i) -      8.303k in   4.997861s
     2048 attributes      1.091k (±17.6%) i/s  (916.91 μs/i) -      4.132k in   4.998710s
     4096 attributes    544.840 (±22.0%) i/s    (1.84 ms/i) -      2.101k in   5.006943s
     8192 attributes    282.789 (±22.3%) i/s    (3.54 ms/i) -      1.089k in   5.000604s
    16384 attributes    134.991 (±27.4%) i/s    (7.41 ms/i) -    542.000 in   5.001360s

In graphical form:

image

Just the "after", showing linear performance now:

image

Analysis

Analyzing the parsing of a tag with 16000 attributes (using samply), we see two hotspots:

image

  1. the strlen/memcmp checks in libgumbo's tokenizer.c:finish_attribute_name which checks for duplicates in the attribute list:
    for (unsigned int i = 0; i < attributes->length; ++i) {
    GumboAttribute* attr = attributes->data[i];
    if (
    strlen(attr->name) == tag_state->_buffer.length
    && 0 == memcmp (
    attr->name,
    tag_state->_buffer.data,
    tag_state->_buffer.length
    )
    ) {
    // Identical attribute; bail.
    add_duplicate_attr_error(parser);
    reinitialize_tag_buffer(parser);
    tag_state->_drop_next_attr_value = true;
    return;
    }
    }
  2. libxml2's xmlNewPropInternal traversing the singly-linked properties list to append each new property (to maintain parse order)

This PR addresses (1) by introducing a hashmap and using it if there are 16 or more attributes.

This PR addresses (2) by inlining most of xmlNewNsProp but keeping a pointer to the last property to avoid re-traversing the linked list each time.

After those changes, the flamegraph looks much better, with no obvious hotspots:

image

hashmap

Keeping in mind we may still someday want to publish our libgumbo fork as a separate library, I opted to bring in a new self-contained hashmap library from https://github.com/tidwall/hashmap.c

I'm using xxHash3 which seems to be pretty fast, but I'm open to other opinions on what hash function to use.

Have you included adequate test coverage?

Yes.

Does this change affect the behavior of either the C or the Java implementations?

HTML5 is CRuby-only.

@flavorjones flavorjones changed the title Flavorjones gumbo attributes perf perf: html5 attribute parsing Dec 26, 2024
@flavorjones flavorjones force-pushed the flavorjones-gumbo-attributes-perf branch 2 times, most recently from 4f6df29 to d666fa1 Compare December 26, 2024 21:27
@flavorjones flavorjones added this to the v1.19.0 milestone Dec 26, 2024
Introduce code that is an optimized version of libxml2's xmlNewNsProp
to avoid traversing the properties linked list to append an attribute
every time we add one.
If there are more than 16 attributes, shift from doing strcmp to using
a hashmap for duplicate detection. The number 16 was chosen based on
the benchmark in #2568

I've introduced @tidwall's hashmap.c (MIT licensed and the copyright
appropriately copied in the LICENSE-DEPENDENCIES file) to have
something self-contained within the libgumbo codebase, rather than
using libxml2's xmlHash or ruby's st.c.
@flavorjones flavorjones force-pushed the flavorjones-gumbo-attributes-perf branch from d666fa1 to 41c601d Compare December 27, 2024 14:21
@flavorjones
Copy link
Member Author

@stevecheckoway I would really love your feedback on this when you get a few minutes. ❤️

Copy link
Contributor

@stevecheckoway stevecheckoway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad you looked into fixing this. I was hesitant to reach into the internals of libxml2. It's often unclear which fields of the structs are not going to change.

if ((doc != NULL) && (doc->dict != NULL)) {
cur->name = (xmlChar *) xmlDictLookup(doc->dict, name, -1);
} else {
cur->name = xmlStrdup(name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is can return NULL. libxml2 handles it by freeing the newly allocated attribute and returning NULL. https://github.com/GNOME/libxml2/blob/260954c5660315dd5775e597c9621e07acab4ff2/tree.c#L1669

}

if (doc != NULL) {
int res = xmlIsID(doc, node, cur);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also fail when allocating memory.

int res = xmlIsID(doc, node, cur);

if (res == 1) {
xmlAddIDSafe(cur, value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this one as well.

#ifndef HAVE_XMLADDIDSAFE
int
xmlAddIDSafe(xmlAttrPtr attr, const xmlChar *value) {
xmlIDPtr id = xmlAddID(NULL, attr->doc, value, attr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also fail and return NULL due to allocation failure. I think that in this case that will cause xmlAddIDSafe to return that the ID already exists.

GumboStringSet *
gumbo_string_set_new(size_t cap)
{
return (GumboStringSet*)hashmap_new(sizeof(char *), cap, SEED0, SEED1, string_hash, string_compare, NULL, NULL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cast for the return value shouldn't be necessary.

@flavorjones
Copy link
Member Author

@stevecheckoway Thanks for the review. Yes, I was sloppy with malloc failures to get something up and running; I'll fix.

I was super curious about what you think of introducing the hashmap into libgumbo? Just want to make sure you don't think it's a silly thing to do.

@stevecheckoway
Copy link
Contributor

@flavorjones I think adding the hashmap is better than going with the red-black tree I implemented for this exact task before I discovered that libxml2 also had quadratic behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fix O(n^2) behavior when checking for duplicate attributes (libgumbo)
2 participants