Named entity references and XML well-formedness

Discussion:

Aryeh Gregor

2010-04-26 00:46:14 UTC

In XML, named entity references like   and • (with the
special exceptions of < > & " ') can be treated as
well-formedness errors across the board by conformant XML processors.
(Yes, this means that *any* XML document that uses *any* named entity
reference except the special five is not well-formed, if you ask these
XML processors.) Alternatively, if a DTD is provided, conformant XML
processors can retrieve the DTD, parse it, and treat the reference as
a well-formedness error if it doesn't occur in the DTD, otherwise
parse it as you'd expect. (Yes, processors can really pick whichever
behavior they want, as far as I understand it. As we all know, the
great thing about standards is how many there are to choose from.)

In practice, as far as I can tell, XML UAs that our users use do the
latter, retrieving the DTD. (Otherwise they'd instantly break, and
our users wouldn't use them!) Thus we get away with using   and
such, and still work in these UAs. But this means we have to provide
a doctype with a DTD, which means not just <!DOCTYPE html>. This is
the default behavior on trunk -- we output an XHTML Strict DTD when
the document is actually HTML5. This has a few disadvantages, in
addition to just being odd:

1) Validators treat the content as XHTML Strict, not HTML5, so it
fails validation unless you specifically ask for HTML5 validation.
I've already seen a couple of complaints about this, and we haven't
even released yet. Lots of people care about validation.

2) XML processors are still within their rights to reject the page,
declining to process the DTD and treating the page as non-well-formed.

3) For XML processors that do process the DTD, we force them to do a
network load as soon as they start parsing the page. Presumably this
slows down parsing (dunno how much in practice), and it also hurts the
W3C's poor servers:
<http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic>

The alternative is to simply not use any named character references --
replace them all by numeric ones. E.g., use   instead of  ,
and · instead of •. Then we can use <!DOCTYPE html> by default
and avoid these problems. In fact, we already do this for anything
that passes through the parser, as far as I can tell -- we convert it
to UTF-8.

The problem is that if we do this and then miss a few entities
somewhere in the source code, some pages will mysteriously become
non-well-formed and tools will break. Plus, of course, you have the
usual risks of breakage from mass changes. Overall, though, I'd
prefer that we do this, because the alternative is that I'd have to
pester the standards people and validator people for a means to let us
validate properly with an XHTML Strict doctype.

Are there any objections to me removing all named entity references
from MediaWiki output?

Dmitriy Sintsov

2010-04-26 08:02:24 UTC

Permalink

Post by Aryeh Gregor
In XML, named entity references like   and • (with the
well-formedness errors across the board by conformant XML processors.
(Yes, this means that *any* XML document that uses *any* named entity
reference except the special five is not well-formed, if you ask these
XML processors.) Alternatively, if a DTD is provided, conformant XML
processors can retrieve the DTD, parse it, and treat the reference as
a well-formedness error if it doesn't occur in the DTD, otherwise
parse it as you'd expect. (Yes, processors can really pick whichever
behavior they want, as far as I understand it. As we all know, the
great thing about standards is how many there are to choose from.)

Wouldn't it be enough just to define an entity?
http://www.criticism.com/dita/dtd2.html#section-ENTITIES
I used such definition for nbsp once in XSL sheet. Don't know how well
it works alone in XML.
Dmitriy

Aryeh Gregor

2010-04-26 16:05:23 UTC

Permalink

Post by Dmitriy Sintsov
Wouldn't it be enough just to define an entity?
http://www.criticism.com/dita/dtd2.html#section-ENTITIES
I used such definition for nbsp once in XSL sheet. Don't know how well
it works alone in XML.

I guess that would be possible, yes, but HTML defines an awful lot of
entities, and adding them all inline to every page doesn't sound like
a great idea to me.

Platonides

2010-04-26 21:04:13 UTC

Permalink

Post by Aryeh Gregor

I guess that would be possible, yes, but HTML defines an awful lot of
entities, and adding them all inline to every page doesn't sound like
a great idea to me.

I suppose that you could link to a local copy of the DTD, that would
keep happy but would probably break more parsing, since html doctypes
are more or less magic words for many programs dealing with it
(beginning with browsers, but some validators also do so).
I would prefer not having to deal with the less developer friendly
numeric entities in the html.

If we are serving HTML5 (not XHTML) why is XML weel-formedness
important? I thought that HTML5 means giving up on it.
A HTML5 parser must implement the "HTML entities", so they shouldn't
need a DTD.
http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#character-references
http://www.whatwg.org/specs/web-apps/current-work/multipage/named-character-references.html#named-character-references

Aryeh Gregor

2010-04-26 22:30:18 UTC

Permalink

Post by Platonides
I suppose that you could link to a local copy of the DTD, that would
keep happy but would probably break more parsing, since html doctypes
are more or less magic words for many programs dealing with it
(beginning with browsers, but some validators also do so).
I would prefer not having to deal with the less developer friendly
numeric entities in the html.

You'd think this would be annoying, but in fact, in article text we've
always converted entities to UTF-8, and I've never actually been
inconvenienced by it. Or even noticed it, without actually testing.
I actually don't think • is less developer-friendly than •, say.
The only common one where the UTF-8 form would be annoying is  ,
and that's just one code point to remember,  . Not to mention
that 90% of   could be replaced by a normal space with no actual
change.

Post by Platonides
If we are serving HTML5 (not XHTML) why is XML weel-formedness
important? I thought that HTML5 means giving up on it.

"""
But HTML5 is tag soup!
HTML5 doesn't require XML well-formedness – e.g., you can omit
attribute quote marks – but it does permit it. MediaWiki currently
still outputs well-formed XML by default. This means that by default,
you can still (modulo bugs) parse MediaWiki pages using XML libraries,
transform them via XSLT, etc. MediaWiki administrators who want to
reduce the size of output HTML can disable $wgWellFormedXml. When
HTML5 has been around for a while and HTML5 parsing libraries are as
prevalent as XML parsing libraries, this benefit might not be so
compelling anymore.
"""
<http://www.mediawiki.org/wiki/HTML5#FAQ_about_MediaWiki_use_of_HTML5>

In practice, tons of bots still do screen-scraping using XML
libraries, and we get a lot of complaints very quickly if we start
serving many non-well-formed pages. They should use the API instead,
of course -- which is why I'm not *too* worried about the occasional
entity creeping through and malforming a page, if we do use just
<!DOCTYPE html>. Screen-scrapers should die anyway. :)