How HTML up to 4.01 is out of line with browsers

May 30th, 2008

The HTML specifications, versions 2.0 though 4.01, define a language that is different to that language read and interpreted by current HTML browsers.  This means that it is entirely possible to write correct syntax in HTML which significantly breaks pages in web browsers, just as it’s possible to write incorrect HTML syntax which works fine in browsers.  HTML up to version 4.01 is effectively a lame duck specification, not representing HTML as it is used in practice.

HTML5 aims to change that.

What’s understood by browsers isn’t necessarily HTML

There are a great many things which are understood well by browsers, but which aren’t and never have been valid.  For example:

<b><p>This is a paragraph</p></b>

This is invalid HTML, but is interpreted as a bold paragraph by browsers, who forgive the fact that you can’t put a paragraph inside a <b> element.

<b><i>These are incorrectly nested </b></i>

In the above code, the </b> and </i> tags would make no sense at all to an HTML validator, or to a browser who simply adhered to the HTML specification.  This is invalid HTML, yet browsers don’t seem to mind.  Their parsers have evolved to cope.

Where the misconception about XHTML and HTML differences comes from

It is a common misconception that code such as the above is acceptable in HTML but not in XHTML, which is what has led many to believe that XHTML is stricter than HTML, but this isn’t the case. Such code is equally invalid in both HTML and XHTML.  Such a belief stems partly from a misunderstanding of HTML - for example, based on knowledge of what HTML works in browsers rather than a knowledge of the HTML specification.

For example, this very popular tutorial lists four important differences between HTML and XHTML, but in reality, all but one of those four are not differences but are exactly the same requirements in valid HTML and XHTML!  The only difference that is actually a difference, by the way, is that in XHTML all element names and attributes must be lowercase.

In some respects, XHTML is considerably more lenient than HTML.

  • XHTML does not enforce restrictions such as not allowing an <a> element inside another <a> element when there’s another element in between.  This is due to XHTML’s relaxed DTD syntax, which does not feature inclusions or exclusions.
  • XHTML has other subtle relaxations in content model - for example, in HTML a <fieldset> must contain one and only one <legend> as its first child element, yet in XHTML you can place the legend anywhere you want within the fieldset, or having more than one, or even omit the legend altogether.

What’s valid HTML isn’t necessarily understood correctly by browsers
Just as there are many things that are accepted by browsers when interpreting HTML, but aren’t defined in an HTML specification and aren’t accepted by HTML validators, there are also things which are defined in HTML standards and accepted by validators, but which would be significantly broken in browsers.

Dark Side of the HTML was a quick blog post written in 1995 (well, it wouldn’t have been called a blog post then) and illustrated the differences in syntax between what we on the web have come to accept is HTML, and what the specifications say HTML is.  Its author read enough of the HTML specification to be dangerous, and discovered that the browsers of the time didn’t support most HTML syntax.

HTML, as of version 2.0 (until version 4.01, the current version) is based on SGML, a generalised markup language from which HTML derives the concept of elements, tags, attributes opening brackets (’<’), DTDs, and a heirarchical structure.  When an HTML validator validates an HTML page, it does so according to the SGML syntax and the SGML DTD associated with the page.

However, there are a great many things that are valid HTML but which aren’t understood, or are understood incorrectly, by web browsers.  The situation hasn’t changed since 1995; browsers of today still parse HTML in largely the same way as they did back then, with a few exceptions.

Here are some examples of valid HTML, as taken from Dark Side of the HTML:

<H1/Header with null-end tag/

The above line shows a minimised tag form.  It represents an H1 element (heading level one) with the content “Header with null-end tag”.  The stash at the end signifies the end of the tag.  No closing bracket is required after that slash.

An HTML validator, including the W3C validator, will happily accept that as valid HTML code and interpret it as a properly closed H1 element containing some valid text. Of course, that syntax isn’t accepted by web browsers, who don’t support tag minimisation (even though it is defined as enabled in the HTML DTD).

This text is <b<i> bold and italic at once </i</b>

The above line shows tags without a closing bracket.  In valid HTML, it is unnecessary to close a tag using a closing bracket (’>’) if the tag is immediately followed by the opening of another tag.  Again, validators will accept this because it is valid HTML, but your browser probably won’t.

<UL>
<LI> this is the first item of the list
<> this is second one
</UL>

A tricky feature of HTML is that the parser needs to understand the content model (what can go where) of the document while it is parsing.  For example, to parse the above it needs to understand that only <LI> elements are allowed inside a <UL> element.

The second list item above uses an empty tag!  It has no name, but it is valid HTML because there is no room for confusion by the parser that it actually represents the beginning of an <LI> element, because that’s all it could be according to HTML’s content model.Again, most browsers are unlikely to support this type of syntax, though it’ll validate fine.  End tags can also be left empty in the same way, as in ‘</>’.

What to do?

Unfortunately, even if we create valid HTML code as verified by an HTML validator, we cannot know that we haven’t done something which the web browser won’t parse properly.  There is no existing HTML validator that validates pages the way a web browser would interpret them.

Should we fix the validators?

Validators operate on the important principle that they validate based on a defined standard.  Even if we were to come up with a validator that parsed HTML in the way that a browser would, there is no such defined standard for this syntax.  For a start, there isn’t a consensus among browsers.

Should we move to a different, easier to define syntax?

XHTML is based not on the SGML syntax but on a much simpler one to parse: XML.

Browsers are yet to support XHTML in droves, but if and when they do, we are unlikely to run into the situation where the XHTML that we come to expect based on browser capabilities is vastly different to the XHTML as defined by specifications and tested by validators.

For the moment, though, we can’t yet use XHTML on the general web due to lack of browser support, though we can (and scores of web designers do) dabble in HTML-compatible XHTML, and we can use XHTML internally for our own processing needs.

Should we rewrite the HTML specification to match the browser’s version of HTML?

This is precisely the approach that HTML5 is taking (currently in draft form) with its syntax. Instead of using the same syntax of HTML, based on SGML, which has been defined since HTML 2.0, it defines its own newer, simpler syntax which is intended to be compatible with the way current browsers interpret HTML.

The benefit to this is that as soon as the standard comes out, it will already be parse-able by all of the current browsers, because it is simply defining what they already expect.  It is intended as a full replacement of previous HTML versions, with enough backwards-compatibility that existing (HTML 4.01 and lower) documents can be interpreted under the new rules and still work.

HTML5 also goes a long way towards not only dictating what the valid syntax is, but how user agents (such as browsers) should interpret it, going so far as to define what should happen at each step of the way through parsing each tag.  This means that we also have a clear definition, for the first time, of what should happen when browsers encounter invalid code. Again, this will all be done with a view to matching the way most current browsers operate.

XHTML on the web seemed like a good idea at the time

May 23rd, 2008

XHTML for the web seemed like a great idea 8 years ago.

Even though it uses a completely different syntax to HTML, it was designed to be somewhat backwards-compatible, a good idea given that browsers did not support XHTML at the time, and given that content authors would want to use XHTML eventually.  XHTML 1.0, published in 2000, included an Appendix called HTML Compatibility Guidelines which described how to write a restrictive form of XHTML that didn’t choke when it was labelled as HTML and fed to an HTML browser.

These compatibility guidelines worked because:

  • XHTML 1.0 has the same (almost identical) content model as HTML 4.01, which is well supported by browsers.  That is, it has the same elements and attributes.  HTML browsers would not notice any elements or attributes they didn’t understand.
  • XHTML syntax is similar enough to HTML syntax that if you tell browsers it’s HTML, they don’t have too many problems parsing it if you follow the compatibility guidelines carefully.  For example, current browsers don’t seem to mind that their ‘HTML’ contains trailing slashes within their empty elements, as they are just seen as invalid characters and ignored.

Content authors started to use XHTML 1.0, following the compatibility guidelines to ensure they could tell the current round of browsers it was HTML and it would still work.  The point of doing so, I believe, was to prepare themselves for the day when browsers supported XHTML, making it easier to move across to actual XHTML.

However, that day never came.

8 years later we still don’t have support for XHTML - well, at least not from the most popular general web browser, Internet Explorer.  It’s not coming in the next version either.  We’re still using the HTML Compatibility Guidelines to write a limited, HTML-like form of XHTML and we’re still telling browsers that it’s HTML so we can get away with using it on the general web, where it would be exposed to Internet Explorer.

The problem is that we aren’t really using XHTML.  It’s just the best we can do before XHTML actually becomes supported.  But unlike back in 2000, it no longer looks like this is going to eventuate.

But there are several problems with using HTML-compatible XHTML:

  • The most recent official version of XHTML is XHTML 1.1, which is not backwards compatible with HTML and, despite being released in 2001, also not supported by Internet Explorer.  This means that if we are to use XHTML for general web use, we have to limit ourselves to an even older version, XHTML 1.0, which does have provisions for HTML compatibility.
  • In order to follow the compatibility guidelines we’re telling browsers that what we’re sending them is HTML, through the use of the text/html mime type.  This means that whatever we tell the validators (through the use of a DOCTYPE declaration, used by validators but not browsers), we’re still only sending browsers malformed HTML code.
  • We need to avoid the following features, which would otherwise be valid in XHTML.  If you have never heard of these features or thought they were invalid in XHTML, then you are probably currently writing HTML 4 compatible XHTML and serving it as HTML:
    • Namespaces other than the root namespace, which prevents combining XHTML with (for example) SVG or MathML.
    • XML-style CDATA sections like <![CDATA[text]]>
    • Self-closed elements without a space before the trailing slash like <img src=”" alt=”"/>
    • Self-closed elements which aren’t declared EMPTY in HTML 4 like <p/>
    • Closing tags for elements which are declared EMPTY in HTML 4 like <img src=”" alt=”"> </img>
    • Character entities or comment elements within script or style elements, which are declared as CDATA in HTML 4
    • XML processing instructions, such as those for using CSS or XSLT stylesheets
    • Javascript code, unless it is HTML-compatible (and if so it probably won’t be XHTML-compatible)

    In summary, we’re wrangling our XHTML so that it looks as close to HTML 4 as possible.  We may as well be using HTML 4.

There are some very good arguments against using HTML-compatible XHTML at all.

The W3C have moved on, and are now working on HTML5, a successor to HTML 4.01.  Rather than define a new, incompatible syntax, HTML5 defines a syntax based on the way existing browsers interpret HTML.  Using an XML syntax with HTML5 is optional (and doing so has become known colloquially as ‘XHTML5′, though that is not an official name).  To support HTML5, there is no requirement that browsers implement an entirely new syntax, which is one factor contributing to IE’s reluctance to add XHTML support.  The browsers only need to augment their existing HTML support, which many are already doing, despite HTML5 still being in working draft status.

XHTML can be useful for limited situations, such as for internal processing within an application or for use on embedded ‘XHTML browsers’ on hand-held devices.  The ease of parsing XHTML and the proliferation of XML parsers for common web frameworks makes it a decent idea to use XHTML internally for processing and storage, even if this is converted to HTML for output.

XHTML, however, can’t be used on the general web due to lack of support.  As shown by current developments on HTML5, the web’s future will involve HTML.  In preparation for this future, perhaps content authors ought to start coding in HTML 4.01 in preparation.  At least, then, they won’t have to pretend that they are using a standard they aren’t.

I Work on the Web

March 18th, 2008

The uprising of Web 2.0 brought about a number of other equally revolutionary trends. You might know one of them by the name of social networking, better known as Facebook, MySpace, Flickr and the like. Now, not only can people live their lives, but they can share it with people they don’t know - for exhibitionist geeks, the possibilities linked to social networking applications was beyond their wildest dreams.

You may or may not be familiar with the “I Work on the Web” meme - a bandwagon built by “workers” of “the web” whereby they will post a picture of themselves on Flickr, join the aforementioned meme group and write self-important paragraphs of text describing the value of themselves and their work so that other “workers” of “the web” may learn more about their kind.

There’s a whole race of exhibitionist geeks traveling the underground world of social networking of whom you may not be aware - until you find one of them. Then, you only need to follow the trail of accounts interlinked across multiple superfluous applications before you find the hive where they all reside, the haven where the exhibitionist geek can justify his or her job, artistic-oritented hobbies, designer t-shirts, and most importantly, self-professed geekiness in order to achieve the non-existent but very important to publicly portray balance between “geek” and “artist”.

These glorified blogs and profiles are nothing more than a glamor portrait taken in your neighborhood photography studio. The reality is that there’s nothing special or noteworthy about working on the web.

This is me, I work on the web.
I’m a web developer. For 8+ hours a day, I sit in front of my computer. Every two minutes, my email program checks to see if I have any emails. Sometimes, I have to alter a bit of code that moves a button from the right-hand side of the screen over to the left-hand side, because apparently it’s easier for people to use the website if the button is on the left. I spend most of my day writing lines of code to make technologies that weren’t designed to work on all web browsers and platforms to do just that. I partake in the sending of emails about things that are really important to us like web standards, typography, microformats (and whether “microformats” should be written as one word or two) and programming trends and roll my eyes when people say “Ajax” when they mean “Javascript”. Every day at 10am and 3pm I have coffee with my workmates. The coffee machine is very important to us.

These are the markings covered under the layers of foundation that you don’t see on the “workers” of “the web” - no science, no creative freedom, and minimal intelligence, a monotonous routine of email checking, coffee-drinking and repetitive code-writing. Without the make-up, anybody can work on the web.

Why then, does the phenomenon of the exhibitionist geek exist? In essence, it’s about obtaining self justification by gaining the approval of others. The exhibitionist geek is insecure and needs to validate him or herself: not quite talented enough to be an artist and not quite smart enough to be a real geek with an important job, he or she creates the illusion of importance and grandeur through the persona of the exhibitionist geek.

The uprising of 2.0

January 24th, 2008

In the beginning there was the web, and webmasters who made second-rate websites, and it was good. People and companies loved the web, and bought lots of little webs for themselves and their companies, spending more money than they had and causing a great flood in the stock market, which claimed many of their lives and livelihoods.

Less than a fraction of a generation later, a new generation of webmasters turned their faith back to the web. These brave souls loved to play with new technology, and it did make them feel empowered, for it brought them loyal followers.

And these chosen few invented many new names to describe their new ways of doing things, to replace the old, pre-Armageddon names for the same things. Much confusion was had by all, and someone gave a single name to the uprising, to surround the names and bind all the names together. Below is an artist’s impression of the revolution, tastefully rendered by a follower of the movement who we were legally obliged to mention was named Luca Cremonini and who donated it to us under a fancy new license.

Web 2.0

And these words did become modern day mantras, though many did disagree about their meaning or their claims to originality. And the mantras were confusing, and they distracted from the reality of the web. Buzzwords replaced substance, and misinformation was spread by the followers. The self-appointed gurus of the movement wrote authoritatively on Web 2.0, describing it as a business revolution, and a knowledge-oriented environment wrapped up in a service-oriented architecture.

Suddenly, the mantra became more important than the end result, and all knowledge and wisdom that had existed before the Armageddon was eventually forgotten, re-invented under new names and methodologies. Concepts were sold, not content, and the followers scrambled to add these concepts to their products at any cost. Followers hell bent on creating a folkonomy added tag clouds to their webs. Followers wanting to offer social networking built islands in Second Life. They did not understand why, or think to ask. They acted out of fear of being left behind.