How HTML up to 4.01 is out of line with browsers
May 30th, 2008The HTML specifications, versions 2.0 though 4.01, define a language that is different to that language read and interpreted by current HTML browsers. This means that it is entirely possible to write correct syntax in HTML which significantly breaks pages in web browsers, just as it’s possible to write incorrect HTML syntax which works fine in browsers. HTML up to version 4.01 is effectively a lame duck specification, not representing HTML as it is used in practice.
HTML5 aims to change that.
What’s understood by browsers isn’t necessarily HTML
There are a great many things which are understood well by browsers, but which aren’t and never have been valid. For example:
<b><p>This is a paragraph</p></b>
This is invalid HTML, but is interpreted as a bold paragraph by browsers, who forgive the fact that you can’t put a paragraph inside a <b> element.
<b><i>These are incorrectly nested </b></i>
In the above code, the </b> and </i> tags would make no sense at all to an HTML validator, or to a browser who simply adhered to the HTML specification. This is invalid HTML, yet browsers don’t seem to mind. Their parsers have evolved to cope.
Where the misconception about XHTML and HTML differences comes from
It is a common misconception that code such as the above is acceptable in HTML but not in XHTML, which is what has led many to believe that XHTML is stricter than HTML, but this isn’t the case. Such code is equally invalid in both HTML and XHTML. Such a belief stems partly from a misunderstanding of HTML - for example, based on knowledge of what HTML works in browsers rather than a knowledge of the HTML specification.
For example, this very popular tutorial lists four important differences between HTML and XHTML, but in reality, all but one of those four are not differences but are exactly the same requirements in valid HTML and XHTML! The only difference that is actually a difference, by the way, is that in XHTML all element names and attributes must be lowercase.
In some respects, XHTML is considerably more lenient than HTML.
- XHTML does not enforce restrictions such as not allowing an <a> element inside another <a> element when there’s another element in between. This is due to XHTML’s relaxed DTD syntax, which does not feature inclusions or exclusions.
- XHTML has other subtle relaxations in content model - for example, in HTML a <fieldset> must contain one and only one <legend> as its first child element, yet in XHTML you can place the legend anywhere you want within the fieldset, or having more than one, or even omit the legend altogether.
What’s valid HTML isn’t necessarily understood correctly by browsers
Just as there are many things that are accepted by browsers when interpreting HTML, but aren’t defined in an HTML specification and aren’t accepted by HTML validators, there are also things which are defined in HTML standards and accepted by validators, but which would be significantly broken in browsers.
Dark Side of the HTML was a quick blog post written in 1995 (well, it wouldn’t have been called a blog post then) and illustrated the differences in syntax between what we on the web have come to accept is HTML, and what the specifications say HTML is. Its author read enough of the HTML specification to be dangerous, and discovered that the browsers of the time didn’t support most HTML syntax.
HTML, as of version 2.0 (until version 4.01, the current version) is based on SGML, a generalised markup language from which HTML derives the concept of elements, tags, attributes opening brackets (’<’), DTDs, and a heirarchical structure. When an HTML validator validates an HTML page, it does so according to the SGML syntax and the SGML DTD associated with the page.
However, there are a great many things that are valid HTML but which aren’t understood, or are understood incorrectly, by web browsers. The situation hasn’t changed since 1995; browsers of today still parse HTML in largely the same way as they did back then, with a few exceptions.
Here are some examples of valid HTML, as taken from Dark Side of the HTML:
<H1/Header with null-end tag/
The above line shows a minimised tag form. It represents an H1 element (heading level one) with the content “Header with null-end tag”. The stash at the end signifies the end of the tag. No closing bracket is required after that slash.
An HTML validator, including the W3C validator, will happily accept that as valid HTML code and interpret it as a properly closed H1 element containing some valid text. Of course, that syntax isn’t accepted by web browsers, who don’t support tag minimisation (even though it is defined as enabled in the HTML DTD).
This text is <b<i> bold and italic at once </i</b>
The above line shows tags without a closing bracket. In valid HTML, it is unnecessary to close a tag using a closing bracket (’>’) if the tag is immediately followed by the opening of another tag. Again, validators will accept this because it is valid HTML, but your browser probably won’t.
<UL>
<LI> this is the first item of the list
<> this is second one
</UL>
A tricky feature of HTML is that the parser needs to understand the content model (what can go where) of the document while it is parsing. For example, to parse the above it needs to understand that only <LI> elements are allowed inside a <UL> element.
The second list item above uses an empty tag! It has no name, but it is valid HTML because there is no room for confusion by the parser that it actually represents the beginning of an <LI> element, because that’s all it could be according to HTML’s content model.Again, most browsers are unlikely to support this type of syntax, though it’ll validate fine. End tags can also be left empty in the same way, as in ‘</>’.
What to do?
Unfortunately, even if we create valid HTML code as verified by an HTML validator, we cannot know that we haven’t done something which the web browser won’t parse properly. There is no existing HTML validator that validates pages the way a web browser would interpret them.
Should we fix the validators?
Validators operate on the important principle that they validate based on a defined standard. Even if we were to come up with a validator that parsed HTML in the way that a browser would, there is no such defined standard for this syntax. For a start, there isn’t a consensus among browsers.
Should we move to a different, easier to define syntax?
XHTML is based not on the SGML syntax but on a much simpler one to parse: XML.
Browsers are yet to support XHTML in droves, but if and when they do, we are unlikely to run into the situation where the XHTML that we come to expect based on browser capabilities is vastly different to the XHTML as defined by specifications and tested by validators.
For the moment, though, we can’t yet use XHTML on the general web due to lack of browser support, though we can (and scores of web designers do) dabble in HTML-compatible XHTML, and we can use XHTML internally for our own processing needs.
Should we rewrite the HTML specification to match the browser’s version of HTML?
This is precisely the approach that HTML5 is taking (currently in draft form) with its syntax. Instead of using the same syntax of HTML, based on SGML, which has been defined since HTML 2.0, it defines its own newer, simpler syntax which is intended to be compatible with the way current browsers interpret HTML.
The benefit to this is that as soon as the standard comes out, it will already be parse-able by all of the current browsers, because it is simply defining what they already expect. It is intended as a full replacement of previous HTML versions, with enough backwards-compatibility that existing (HTML 4.01 and lower) documents can be interpreted under the new rules and still work.
HTML5 also goes a long way towards not only dictating what the valid syntax is, but how user agents (such as browsers) should interpret it, going so far as to define what should happen at each step of the way through parsing each tag. This means that we also have a clear definition, for the first time, of what should happen when browsers encounter invalid code. Again, this will all be done with a view to matching the way most current browsers operate.
