A History of HTML Parsing at Cloudflare: Part 1


To coincide with the launch of streaming HTML rewriting functionality for Cloudflare Workers we are open sourcing the Rust HTML rewriter (LOL HTML) used to back the Workers HTMLRewriter API. We also thought it was about time to review the history of HTML rewriting at Cloudflare.

The first blog post will explain the basics of a streaming HTML rewriter and our particular requirements. We start around 8 years ago by describing the group of ‘ad-hoc’ parsers that were created with specific functionality such as to rewrite e-mail addresses or minify HTML. By 2016 the state machine defined in the HTML5 specification could be used to build a single spec-compliant HTML pluggable rewriter, to replace the existing collection of parsers. The source code for this rewriter is now public and available here: https://github.com/cloudflare/lazyhtml.

The second blog post will describe the next iteration of rewriter. With the launch of the edge compute platform Cloudflare Workers we came to realise that developers wanted the same HTML rewriting capabilities with a JavaScript API. The post describes the thoughts behind a low latency streaming HTML rewriter with a CSS-selector based API. We open-sourced the Rust library as it can also be used as a stand-alone HTML rewriting/parsing library.

What is a streaming HTML rewriter ?

A streaming HTML rewriter takes either a HTML string or byte stream input, parses it into tokens or any other structured intermediate representation (IR) – such as an Abstract Syntax Tree (AST). It then performs transformations on the tokens before converting back to HTML. This provides the ability to modify, extract or add to an existing HTML document as the bytes are being processed. Compare this with a standard HTML tree parser which needs to retrieve the entire file to generate a full DOM tree. The tree-based rewriter will both take longer to deliver the first processed bytes and require significantly more memory.

HTML rewriter

For example; consider you own a large site with a lot of historical content that you want to now serve over HTTPS. You will quickly run into the problem of resources (images, scripts, videos) being served over HTTP. This ‘mixed content’ opens a security hole and browsers will warn or block these resources. It can be difficult or even impossible to update every link on every page of a website. With a streaming HTML rewriter you can select the URI attribute of any HTML tag and change any HTTP links to HTTPS. We built this very feature Automatic HTTPS rewrites back in 2016 to solve mixed content issues for our customers.

The reader may already be wondering: “Isn’t this a solved problem, aren’t there many widely used open-source browsers out there with HTML parsers that can be used for this purpose?”. The reality is that writing code to run in 190+ PoPs around the world with a strict low latency requirement turns even seemingly trivial problems into complex engineering challenges.

The following blog posts will detail the journey of how starting with a simple idea of finding email addresses within an HTML page led to building an almost spec compliant HTML parser and then on to a CSS selector matching Virtual Machine. We learned a lot on this journey. I hope you find some of this as interesting as we did.

Rewriting at the edge

When rewriting content through Cloudflare we do not want to impact site performance. The balance in designing a streaming HTML rewriter is to minimise the pause in response byte flow by holding onto as little information as possible whilst retaining the ability to rewrite matching tokens.

The difference in requirements compared to an HTML parser used in a browser include:

Output latency

For browsers, the Document Object Model (DOM) is the end product of the parsing process but in our case we have to parse, rewrite and serialize back to HTML. In the case of Cloudflare’s reverse proxy any content processing on the edge server results in latency between the server and an eyeball. It is desirable to minimize the latency impact of HTML handling, which involves parsing, rewriting and serializing back to HTML. In all of these stages we want to be as fast as possible to minimize latency.

Parser throughput

Let’s assume that usually browsers rarely need to deal with HTML pages bigger than 1Mb in size and an average page load time is somewhere around 3s at best. HTML parsing is not the main bottleneck of the page loading process as the browser will be blocked on running scripts and loading other render-critical resources. We can roughly estimate that ~3Mbps is an acceptable throughput for browser’s HTML parser. At Cloudflare we have hundreds of megabytes of traffic per CPU, so we need a parser that is faster by an order of magnitude.

Memory limitations

As most users must realise, browsers have the luxury of being able to consume memory. For example, this simple HTML markup when opened in a browser will consume a significant chunk of your system memory before eventually killing a browser tab (and all this memory will be consumed by the parser) :

<script>
   document.write('<');
   while(true) {
      document.write('aaaaaaaaaaaaaaaaaaaaaaaa');
   }
</script>

Unfortunately, buffering of some fraction of the input is inevitable even for streaming HTML rewriting. Consider these 2 HTML snippets:

<div foo="bar" qux="qux">

<div foo="bar" qux="qux"

These seemingly similar fragments of HTML will be treated completely differently when encountered at the end of an HTML page. The first fragment will be parsed as a start tag and the second one will be ignored. By just seeing a `<` character followed by a tag name, the parser can’t determine if it has found a start tag or not. It needs to traverse the input in the search of the closing `>` to make a decision, buffering all content in between, so it can later be emitted to the consumer as a start tag token.

This requirement forces browsers to indefinitely buffer content before eventually giving up with the out-of-memory error.

In our case, we can’t afford to spend hundreds of megabytes of memory parsing a single HTML file (actual constraints are even tighter – even using a dozen kilobytes for each request would be unacceptable). We need to be much more sophisticated than other implementations in terms of memory usage and gracefully handle all the situations where provided memory capacity is insufficient to accomplish parsing.

v0 : “Ad-hoc parsers”

As usual with big projects, it all started pretty innocently.

Find and obfuscate an email

In 2010, Cloudflare decided to provide a feature that would stop popular email scrapers. The basic idea of this protection was to find and obfuscate emails on pages and later decode them back in the browser with injected JavaScript code. Sounds easy, right? You search for anything that looks like an email, encode it and then decode it with some JavaScript magic and present the result to the end-user.

However, even such a seemingly simple task already requires solving several issues. First of all, we need to define what an email is, and there is no simple answer. Even the infamous regex supposedly covering the entire RFC is, in fact, outdated and incomplete as the new RFC added lots of valid email constructions, including Unicode support. Let’s not go down that rabbit hole for now and instead focus on a higher-level issue: transforming streaming content.

Content from the network comes in packets, which have to be buffered and parsed as HTTP by our servers. You can’t predict how the content will be split, which means you always need to buffer some of it because content that is going to be replaced can be present in multiple input chunks.

Let’s say we decided to go with a simple regex like `[w.][email protected][w.]+`. If the content that comes through contains the email “[email protected]”, it might be split in the following chunks:

In order to keep good Time To First Byte (TTFB) and consistent speed, we want to ensure that the preceding chunk is emitted as soon as we determine that it’s not interesting for replacement purposes.

The easiest way to do that is to transform our regex into a state machine, or a finite automata. While you could do that by hand, you will end up with hard-to-maintain and error-prone code. Instead, Ragel was chosen to transform regular expressions into efficient native state machine code. Ragel doesn’t try to take care of buffering or anything other than traversing the state machine. It provides a syntax that not only describes patterns, but can also associate custom actions (code in a host language) with any given state.

In our case we can pass through buffers until we match the beginning of an email. If we subsequently find out the pattern is not an email we can bail out from buffering as soon as the pattern stops matching. Otherwise, we can retrieve the matched email and replace it with new content.

To turn our pattern into a streaming parser we can remember the position of the potential start of an email and, unless it was already discarded or replaced by the end of the current input, store the unhandled part in a permanent buffer. Then, when a new chunk comes, we can process it separately, resuming from a state Ragel remembers itself, but then use both the buffered chunk and a new one to either emit or obfuscate.

Now that we have solved the problem of matching email patterns in text, we need to deal with the fact that they need to be obfuscated on pages. This is when the first hints of HTML “parsing” were introduced.

I’ve put “parsing” in quotes because, rather than implementing the whole parser, the email filter (as the module was called) didn’t attempt to replicate the whole HTML grammar, but rather added custom Ragel patterns just for skipping over comments and tags where emails should not be obfuscated.

This was a reasonable approach, especially back in 2010 – four years before the HTML5 specification, when all browsers had their own quirks handling of HTML. However, as you can imagine, this approach did not scale well. If you’re trying to work around quirks in other parsers, you start gaining more and more quirks in your own, and then work around these too. Simultaneously, new features started to be added, which also required modifying HTML on the fly (like automatic insertion of Google Analytics script), and an existing module seemed to be the best place for that. It grew to handle more and more tags, operations and syntactic edge cases.

Now let’s minify..

In 2011, Cloudflare decided to also add minification to allow customers to speed up their websites even if they had not employed minification themselves. For that, we decided to use an existing streaming minifier – jitify. It already had NGINX bindings, which made it a great candidate for integration into the existing pipeline.

Unfortunately, just like most other parsers from that time as well as ours described above, it had its own processing rules for HTML, JavaScript and CSS, which weren’t precise but rather tried to parse content on a best-effort basis. This led to us having two independent streaming parsers that were incompatible and could produce bugs either individually or only in combination.

v1 : “(Almost) HTML5 Spec compliant parser”

Over the years engineers kept adding new features to the ever-growing state machines, while fixing new bugs arising from imprecise syntax implementations, conflicts between various parsers, and problems in features themselves.

By 2016, it was time to get out of the multiple ad hoc parsers business and do things ‘the right way’.

The next section(s) will describe how we built our HTML5 compliant parser starting from the specification state machine. Using only this state machine it should have been straight-forward to build a parser. You may be aware that historically the parsing of HTML had not been entirely strict which meant to not break existing implementations the building of an actual DOM was required for parsing. This is not possible for a streaming rewriter so a simulator of the parser feedback was developed. In terms of performance, it is always better not to do something. We then describe why the rewriter can be ‘lazy’ and not perform the expensive encoding and decoding of text when rewriting HTML. The surprisingly difficult problem of deciding if a response is HTML is then detailed.

HTML5

By 2016, HTML5 had defined precise syntax rules for parsing and compatibility with legacy content and custom browser implementations. It was already implemented by all browsers and many 3rd-party implementations.

The HTML5 parsing specification defines basic HTML syntax in the form of a state machine. We already had experience with Ragel for similar use cases, so there was no question about what to use for the new streaming parser. Despite the complexity of the grammar, the translation of the specification to Ragel syntax was straightforward. The code looks simpler than the formal description of the state machine, thanks to the ability to mix regex syntax with explicit transitions.

A visualisation of a small fraction of the HTML state machine. Source: https://twitter.com/RReverser/status/715937136520916992

HTML5 parsing requires a ‘DOM’

However, HTML has a history. To not break existing implementations HTML5 is specified with recovery procedures for incorrect tag nesting, ordering, unclosed tags, missing attributes and all the other possible quirks that used to work in older browsers. In order to resolve these issues, the specification expects a tree builder to drive the lexer, essentially meaning you can’t correctly tokenize HTML (split into separate tags) without a DOM.

HTML parsing flow as defined by the specification

For this reason, most parsers don’t even try to perform streaming parsing and instead take the input as a whole and produce a document tree as an output. This is not something we could do for streaming transformation without adding significant delays to page loading.

An existing HTML5 JavaScript parser – parse5 – had already implemented spec-compliant tree parsing using a streaming tokenizer and rewriter. To avoid having to create a full DOM the concept of a “parser feedback simulator” was introduced.

Tree builder feedback

As you can guess from the name, this is a module that aims to simulate a full parser’s feedback to the tokenizer, without actually building the whole DOM, but instead preserving only the required information and context necessary for correctly driving the state machine.

After rigorous testing and upstreaming a test runner to parse5, we found this technique to be suitable for the majority of even poorly written pages on the Internet, and employed it in LazyHTML.

LazyHTML architecture

Avoiding decoding – everything is ASCII

Now that we had a streaming tokenizer working, we wanted to make sure that it was fast enough so that users didn’t notice any slowdowns to their pages as they go through the parser and transformations. Otherwise it would completely circumvent any optimisations we’d want to attempt on the fly.

It would not only cause a performance hit due to decoding and re-encoding any modified HTML content, but also significantly complicates our implementation due to multiple sources of potential encoding information required to determine the character encoding, including sniffing of the first 1 KB of the content.

The “living” HTML Standard specification permits only encodings defined in the Encoding Standard. If we look carefully through those encodings, as well as a remark on Character encodings section of the HTML spec, we find that all of them are ASCII-compatible with the exception of UTF-16 and ISO-2022-JP.

This means that any ASCII text will be represented in such encodings exactly as it would be in ASCII, and any non-ASCII text will be represented by bytes outside of the ASCII range. This property allows us to safely tokenize, compare and even modify original HTML without decoding or even knowing which particular encoding it contains. It is possible as all the token boundaries in HTML grammar are represented by an ASCII character.

We need to detect UTF-16 by sniffing and either decode or skip such documents without modification. We chose the latter to avoid potential security-sensitive bugs which are common with UTF-16, and because the character encoding is seen in less than 0.1% of known character encodings luckily.

The only issue left with this approach is that in most places the HTML tokenization specification requires you to replace U+0000 (NUL) characters with U+FFFD (replacement character) during parsing. Presumably, this was added as a security precaution against bugs in C implementations of old engines which could treat NUL character, encoded in ASCII / UTF-8 / … as a 0x00 byte, as the end of the string (yay, null-terminated strings…). It’s problematic for us because U+FFFD is outside of the ASCII range, and will be represented by different sequences of bytes in different encodings. We don’t know the encoding of the document, so this will lead to corruption of the output.

Luckily, we’re not in the same business as browser vendors, and don’t worry about NUL characters in strings as much – we use “fat pointer” string representation, in which the length of the string is determined not by the position of the NUL character, but stored along with the data pointer as an integer field:

typedef struct {
   const char *data;
   size_t length;
} lhtml_string_t;

Instead, we can quietly ignore these parts of the spec (sorry!), and keep U+0000 characters as-is and add them as such to tag, attribute names, and other strings, and later re-emit to the document. This is safe to do, because it doesn’t affect any state machine transitions, but merely preserves original 0x00 bytes and delegates their replacement to the parser in the end user’s browser.

Content type madness

We want to be lazy and minimise false positives. We only want to spend time parsing, decoding and rewriting actual HTML rather than breaking images or JSON. So the question is how do you decide if something is a HTML document. Can you just use the Content-Type for example ? A comment left in the source code best describes the reality.

/*
Dear future generations. I didn't like this hack either and hoped
we could do the right thing instead. Unfortunately, the Internet
was a bad and scary place at the moment of writing. If this
ever changes and websites become more standards compliant,
please do remove it just like I tried.
Many websites use PHP which sets Content-Type: text/html by
default. There is no error or warning if you don't provide own
one, so most websites don't bother to change it and serve
JSON API responses, private keys and binary data like images
with this default Content-Type, which we would happily try to
parse and transforms. This not only hurts performance, but also
easily breaks response data itself whenever some sequence inside
it happens to look like a valid HTML tag that we are interested
in. It gets even worse when JSON contains valid HTML inside of it
and we treat it as such, and append random scripts to the end
breaking APIs critical for popular web apps.
This hack attempts to mitigate the risk by ensuring that the
first significant character (ignoring whitespaces and BOM)
is actually `<` - which increases the chances that it's indeed HTML.
That way we can potentially skip some responses that otherwise
could be rendered by a browser as part of AJAX response, but this
is still better than the opposite situation.
*/

The reader might think that it’s a rare edge case, however, our observations show that almost 25% of the traffic served through Cloudflare with the “text/html” content type is unlikely to be HTML.

The trouble doesn’t end there: it turns out that there is a considerable amount of XML content served with the “text/html” content type which can’t be always processed correctly when treated as HTML.

Over time bailouts for binary data, JSON, AMP and correctly identifying HTML fragments leads to the content sniffing logic which can be described by the following diagram:

This is a good example of divergence between formal specifications and reality.

Tag name comparison optimisation

But just having fast parsing is not enough – we have functionality that consumes the output of the parser, rewrites it and feeds it back for the serialization. And all the memory and time constraints that we have for the parser are applicable for this code as well, as it is a part of the same content processing pipeline.

It’s a common requirement to compare parsed HTML tag names, e.g. to determine if the current tag should be rewritten or not. A naive implementation will use regular per-byte comparison which can require traversing the whole tag name. We were able to narrow this operation to a single integer comparison instruction in the majority of cases by using specially designed hashing algorithm.

The tag names of all standard HTML elements contain only alphabetical ASCII characters and digits from 1 to 6 (in numbered header tags, i.e. <h1> – <h6>). Comparison of tag names is case-insensitive, so we only need 26 characters to represent alphabetical characters. Using the same basic idea as arithmetic coding, we can represent each of the possible 32 characters of a tag name using just 5 bits and, thus, fit up to floor(64 / 5) = 12 characters in a 64-bit integer which is enough for all the standard tag names and any other tag names that satisfy the same requirements! The great part is that we don’t even need to additionally traverse a tag name to hash it – we can do that as we parse the tag name consuming the input byte by byte.

However, there is one problem with this hashing algorithm and the culprit is not so obvious: to fit all 32 characters in 5 bits we need to use all possible bit combinations including 00000. This means that if the leading character of the tag name is represented with 00000 then we will not be able to differentiate between a varying number of consequent repetitions of this character.

For example, considering that ‘a’ is encoded as 00000 and ‘b’ as 00001 :

Tag nameBit representationEncoded value
ab00000 000011
aab00000 00000 000011

Luckily, we know that HTML grammar doesn’t allow the first character of a tag name to be anything except an ASCII alphabetical character, so reserving numbers from 0 to 5 (00000b-00101b) for digits and numbers from 6 to 31 (00110b – 11111b) for ASCII alphabetical characters solves the problem.

LazyHTML

After taking everything mentioned above into consideration the LazyHTML (https://github.com/cloudflare/lazyhtml) library was created. It is a fast streaming HTML parser and serializer with a token based C-API derived from the HTML5 lexer written in Ragel. It provides a pluggable transformation pipeline to allow multiple transformation handlers to be chained together.

An example of a function that transforms `href` property of links:

// define static string to be used for replacements
static const lhtml_string_t REPLACEMENT = {
   .data = "[REPLACED]",
   .length = sizeof("[REPLACED]") - 1
};

static void token_handler(lhtml_token_t *token, void *extra /* this can be your state */) {
  if (token->type == LHTML_TOKEN_START_TAG) { // we're interested only in start tags
    const lhtml_token_starttag_t *tag = &token->start_tag;
    if (tag->type == LHTML_TAG_A) { // check whether tag is of type <a>
      const size_t n_attrs = tag->attributes.count;
      const lhtml_attribute_t *attrs = tag->attributes.items;
      for (size_t i = 0; i < n_attrs; i++) { // iterate over attributes
        const lhtml_attribute_t *attr = &attrs[i];
        if (lhtml_name_equals(attr->name, "href")) { // match the attribute name
          attr->value = REPLACEMENT; // set the attribute value
        }
      }
    }
  }
  lhtml_emit(token, extra); // pass transformed token(s) to next handler(s)
}

So, is it correct and how fast is it?

It is HTML5 compliant as tested against the official test suites. As part of the work several contributions were sent to the specification itself for clarification / simplification of the spec language.

Unlike the previous parser(s), it didn’t bail out on any of the 2,382,625 documents from HTTP Archive, although 0.2% of documents exceeded expected bufferization limits as they were in fact JavaScript or RSS or other types of content incorrectly served with Content-Type: text/html, and since anything is valid HTML5, the parser tried to parse e.g. a<b; x=3; y=4 as incomplete tag with attributes. This is very rare (and goes to even lower amount of 0.03% when two error-prone advertisement networks are excluded from those results), but still needs to be accounted for and is a valid case for bailing out.

As for the benchmarks, In September 2016 using an example which transforms the HTML spec itself (7.9 MB HTML file) by replacing every <a href> (only that property only in those tags) to a static value. It was compared against the few existing and popular HTML parsers (only tokenization mode was used for the fair comparison, so that they don’t need to build AST and so on), and timings in milliseconds for 100 iterations are the following (lazy mode means that we’re using raw strings whenever possible, the other one serializes each token just for comparison):

The results show that LazyHTML parser speeds are around an order of magnitude faster.

That concludes the first post in our series on HTML rewriters at Cloudflare. The next post describes how we built a new streaming rewriter on top of the ideas of LazyHTML. The major update was to provide an easier to use CSS selector API. It provides the back-end for the Cloudflare workers HTMLRewriter JavaScript API.



Source link