This document records all known errors in the Extensible Markup Language (XML) 1.0 Specification (W3C Recommendation 10 Feb 1998); this specification has been superseded by the Second Edition of the Extensible Markup Language (XML) 1.0 Specification (W3C Recommendation 6 Oct 2000). For updates see the latest version.
The errata are numbered, classified as Substantive, Editorial or Clarification and listed in reverse chronological order of their date of publication. Early errata (1999-02-17 and before) are neither classified nor dated.
Please email error reports to xml-editor@w3.org.
Add a well-formedness constraint applying to production [28] doctypedecl
as follows:
Well-formedness constraint: External Subset
The external subset, if any, must match the production for extSubset.
Add a new production [28a] as follows:
DeclSep ::= PEReference | S
doctypedecl
and [31] extSubsetDecl
, replace
"PEReference |S
" with "DeclSep
".Add a well-formedness constraint applying to production [28a] DeclSep
as follows:
Well-formedness constraint: PE Between Declarations
The replacement text of a parameter entity reference in a DeclSep must match the production extSubsetDecl.
In the first paragraph, replace the sentence "An external parameter entity is well-formed if it matches the production labeled extPE." with "All external parameter entities are well-formed by definition."
Remove production [79] extPE
.
Rescind E62 and restore productions [6] and [8] to their First Edition state.
Change the beginning of the first sentence to read: "XML documents should begin with an XML declaration..."
Change the first sentence to read: "External parsed entities should each begin with a text declaration."
In the second paragraph after production [81], replace « "ISO-8859-9" » with « "ISO-8859-n" (where n is the part number) »
Replace the whole of Appendix F with the following:
The XML encoding declaration functions as an internal label on each entity, indicating which character encoding is in use. Before an XML processor can read the internal label, however, it apparently has to know what character encoding is in use--which is what the internal label is trying to indicate. In the general case, this is a hopeless situation. It is not entirely hopeless in XML, however, because XML limits the general case in two ways: each implementation is assumed to support only a finite set of character encodings, and the XML encoding declaration is restricted in position and content in order to make it feasible to autodetect the character encoding in use in each entity in normal cases. Also, in many cases other sources of information are available in addition to the XML data stream itself. Two cases may be distinguished, depending on whether the XML entity is presented to the processor without, or with, any accompanying (external) information. We consider the first case first.
Because each XML entity not accompanied by external encoding information and not in UTF-8 or UTF-16 encoding
must begin with an XML encoding declaration, in which the first characters must be '<?xml
',
any conforming processor can detect, after two to four octets of input, which of the following cases apply. In reading
this list, it may help to know that in UCS-4, '<' is "#x0000003C
" and '?' is "
#x0000003F
", and the Byte Order Mark required of UTF-16 data streams is "#xFEFF
". The notation
## is used to denote any byte value except that two consecutive ##s cannot be both 00.
With a Byte Order Mark:
00 00 FE FF |
UCS-4, big-endian machine (1234 order) |
FF FE 00 00 |
UCS-4, little-endian machine (4321 order) |
00 00 FF FE |
UCS-4, unusual octet order (2143) |
FE FF 00 00 |
UCS-4, unusual octet order (3412) |
FE FF ## ## |
UTF-16, big-endian |
FF FE ## ## |
UTF-16, little-endian |
EF BB BF |
UTF-8 |
Without a Byte Order Mark:
00 00 00 3C |
UCS-4 or other encoding with a 32-bit code unit and ASCII characters encoded as ASCII values, in respectively big-endian (1234), little-endian (4321) and two unusual byte orders (2143 and 3412). The encoding declaration must be read to determine which of UCS-4 or other supported 32-bit encodings applies. |
3C 00 00 00 |
|
00 00 3C 00 |
|
00 3C 00 00 |
|
00 3C 00 3F |
UTF-16BE or big-endian ISO-10646-UCS-2 or other encoding with a 16-bit code unit in big-endian order and ASCII characters encoded as ASCII values (the encoding declaration must be read to determine which) |
3C 00 3F 00 |
UTF-16LE or little-endian ISO-10646-UCS-2 or other encoding with a 16-bit code unit in little-endian order and ASCII characters encoded as ASCII values (the encoding declaration must be read to determine which) |
3C 3F 78 6D |
UTF-8, ISO 646, ASCII, some part of ISO 8859, Shift-JIS, EUC, or any other 7-bit, 8-bit, or mixed-width encoding which ensures that the characters of ASCII have their normal positions, width, and values; the actual encoding declaration must be read to detect which of these applies, but since all of these encodings use the same bit patterns for the relevant ASCII characters, the encoding declaration itself may be read reliably |
4C 6F A7 94 |
EBCDIC (in some flavor; the full encoding declaration must be read to tell which code page is in use) |
Other | UTF-8 without an encoding declaration, or else the data stream is mislabeled (lacking a required encoding declaration), corrupt, fragmentary, or enclosed in a wrapper of some kind |
Note:
In cases above which do not require reading the encoding declaration to determine the encoding, section 4.3.3 still requires that the encoding declaration, if present, be read and that the encoding name be checked to match the actual encoding of the entity. Also, it is possible that new character encodings will be invented that will make it necessary to use the encoding declaration to determine the encoding, in cases where this is not required at present.
This level of autodetection is enough to read the XML encoding declaration and parse the character-encoding identifier, which is still necessary to distinguish the individual members of each family of encodings (e.g. to tell UTF-8 from 8859, and the parts of 8859 from each other, or to distinguish the specific EBCDIC code page in use, and so on).
Because the contents of the encoding declaration are restricted to characters from the ASCII repertoire (however encoded), a processor can reliably read the entire encoding declaration as soon as it has detected which family of encodings is in use. Since in practice, all widely used character encodings fall into one of the categories above, the XML encoding declaration allows reasonably reliable in-band labeling of character encodings, even when external sources of information at the operating-system or transport-protocol level are unreliable. Character encodings such as UTF-7 that make overloaded usage of ASCII-valued bytes may fail to be reliably detected.
Once the processor has detected the character encoding in use, it can act appropriately, whether by invoking a separate input routine for each case, or by calling the proper conversion function on each character of input.
Like any self-labeling system, the XML encoding declaration will not work if any software changes the entity's character set or encoding without updating the encoding declaration. Implementors of character-encoding routines should be careful to ensure the accuracy of the internal and external information used to label the entity.
The second possible case occurs when the XML entity is accompanied by encoding information, as in some file systems
and some network protocols. When multiple sources of information are available, their relative priority and the
preferred method of handling conflict should be specified as part of the higher-level protocol used to deliver XML. In
particular, please refer to [IETF RFC 2376] or its successor, which defines the
text/xml
and application/xml
MIME types and provides some useful guidance. In the interests
of interoperability, however, the following rule is recommended.
If an XML entity is in a file, the Byte-Order Mark and encoding declaration are used (if present) to determine the character encoding.
Replace the second paragraph with the following:
To simplify the tasks of applications, the characters passed to an application by the XML processor must be as if the XML processor normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.
The original description wrongly had the effect of requiring normalization of sequences such as that resulting from
<!ENTITY % e '<!ENTITY f "a
b">'>
and was thus not equivalent to normalization on input.
Amend the first sentence of the text added by E80 to read:
If the entities lt or amp are declared, they must be declared as internal entities whose replacement text is a character reference to the respective character (less-than sign or ampersand) being escaped;
Prepend the following sentence to the first sentence of the first paragraph:
As noted in 3.2.1 Element Content, it is required that content models in element type declarations be deterministic. This requirement is for compatibility with SGML (which calls deterministic content models "unambiguous");
To the sentence "Please report errors in this document to xml-editor@w3.org.", append "archives are available."
In the paragraph beginning "This document specifies a syntax...", add a sentence as follows:
The English version of this specification is the only normative version. However, for translations of this document, see http://www.w3.org/XML/#trans.
Replace the Note with the following:
The "Namespaces in XML" Recommendation [NAMESPACES] assigns a meaning to names containing colon characters. Therefore authors should not use the colon in XML names except for namespace purposes, but XML processors must accept the colon as a name character.
Add a non-normative reference:
Replace the first sentence of the paragraph immediately preceding production [44] with the following:
Definition: An element with no content is said to be empty. The representation of an empty element is either a start-tag immediately followed by an end-tag, or an empty-element tag.
Add the following at the end of the sole paragraph:
This behavior does not apply to parameter entity references within entity values; these are described in 4.4.5 Included in Literal.
The text declaration in the external parsed entity is not considered part of its replacement text.
Similarly, the declaration of a general entity must precede any attribute-list declaration containing a default value with a direct or indirect reference to that general entity.
Add the following to section 3.4 at the end of the third paragraph (i.e., after "are ignored"):
The contents of an ignored conditional section are parsed by ignoring all characters after the "[" following the keyword, except conditional section starts "<![" and ends "]]>", until the matching conditional section end is found. Parameter entity references are not recognised in this process.
Add a validity constraint applying to productions [62] includeSect
and [63] ignoreSect
:
Validity Constraint: Proper Conditional Section/PE Nesting.
If any of the <![, [, or ]]> of a conditional section is contained in the replacement text for a parameter-entity reference, all of them must be contained in the same replacement text.
as a reference within either the internal or external subsets of the DTD, but outside of an EntityValue, AttValue, PI, Comment, SystemLiteral, PubidLiteral or the contents of an ignored conditional section (see section 3.4: Conditional Sections).
... document type declarations, processing instructions, XML declarations, text declarations, and whitespace at the top-level of the document entity (that is, outside the document element and not inside any other markup).
To simplify the tasks of applications, processors must normalize line breaks in parsed entities to #xA by either:
At user option, processors may normalize such characters to some canonical form.
<
").
<
et al. are not magic.
Note: it is possible to construct a well-formed document containing a doctypedecl that neither points to an external subset nor contains an internal subset.
When declared, it must be given as an enumerated type whose only possible values are "default" and "preserve".with:
When declared, it must be given as an enumerated type whose values are one or both of "default" and "preserve".Add an example after the existing one (in the same table):
<!ATTLIST pre xml:space (preserve) #FIXED 'preserve'> |
If the entities lt or amp are declared, they must be declared as internal entities whose replacement text is a character reference to the character being escaped; the double escaping is required for these entities so that references to them produce a well-formed result. If the entities gt, apos or quot are declared, they must be declared as internal entities whose replacement text is the single character being escaped (or a character reference to that character; the double escaping here is unnecessary but harmless). For example:Delete the sentence after the box.
It is a fatal error if an XML entity is determined (via default, encoding declaration or higher-level protocol) to be in a certain encoding but contains octet sequences that are not legal in that encoding. It is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16.
The system identifier "hello.dtd" gives the address (a URI reference) of a DTD for the document.
URI references require encoding and escaping of certain characters. The disallowed characters include all non-ASCII characters, plus the excluded characters listed in Section 2.4 of [IETF RFC2396], except for the number sign (#) and percent sign (%) characters and the square bracket characters re-allowed in [IETF RFC2732]. Disallowed characters must be escaped as follows:
- Each disallowed character is converted to UTF-8 [IETF RFC2279] as one or more bytes.
- Any octets corresponding to a disallowed character are escaped with the URI escaping mechanism (that is, converted to
%
HH, where HH is the hexadecimal notation of the byte value).- The original character is replaced by the resulting character sequence.
The terms 'UTF-8' and 'UTF-16' in this specification do not apply to character encodings labeled with any other labels, even if the encodings or the labels are very similar to UTF-8 or UTF-16.
The SystemLiteral is called the entity's system identifier. It is a URI reference, meant to be dereferenced to obtain input for the XML processor to construct the entity's replacement text. It is an error for a fragment identifier (beginning with a # character) to be part of a system identifier.
Parameter entity references are recognised anywhere in the DTD (internal and external subsets and external parameter entities), except in literals, processing instructions, comments and the contents of ignored conditional sections (see Section 3.4: Conditional Sections). They are also recognised in entity value literals. The use of parameter entities in the internal subset is restricted as described below.
The values of the attribute are language identifiers as defined by [IETF RFC 1766], "Tags for the Identification of Languages" or its successor on the IETF Standards Track.Replace productions [33] to [38] and all the following text, down to but excluding the sentence "For example" just before the examples, with the following:
Note: RFC 1766 tags are constructed from two-letter language codes as defined by [ISO 639], from two-letter country codes as defined by [ISO 3166] or from language identifiers registered with the Internet Assigned Numbers Authority [IANA-LANGCODES]. It is expected that the successor to [IETF RFC 1766] will introduce three-letter language codes for languages not presently covered by [ISO 639].
Note: although the EntityValue
production allows the definition of an entity consisting of a single
explicit < in the literal (e.g. <!ENTITY mylt
"<">
), it is strongly advised to avoid this practice
since any reference to that entity will cause a well-formedness
error.
[43] content ::= CharData? ((element | Reference | CDSect | PI | Comment) CharData?)*
Before the value of an attribute is passed to the application or checked for validity, the XML processor must normalize the attribute value by applying the algorithm below, or by using some other method such that the value passed to the application is the same as that produced by the algorithm.
If the attribute type is not CDATA, then the XML processor must further process the normalized attribute value by discarding any leading and trailing space (#x20) characters, and by replacing sequences of space (#x20) characters by a single space (#x20) character.
Note that if the unnormalized attribute value contains a character reference to a whitespace character other than space (#x20), the normalized value contains the referenced character itself (#xD, #xA or #x9). This contrasts with the case where the unnormalized value contains a whitespace character (not a reference), which is replaced with a space character (#x20) in the normalized value and also contrasts with the case where the unnormalized value contains an entity reference whose replacement text contains a whitespace character; being recursively processed, the whitespace character is replaced with a space character (#x20) in the normalized value.
All attributes for which no declaration has been read should be treated by a non-validating parser as if declared CDATA.
Here are a few example of attribute normalization. Given the declarations:
<!ENTITY d "
"> <!ENTITY a "
"> <!ENTITY da "
"> |
the attribute specifications in the left column would be normalized to the character sequences of the middle column if the attribute a is declared NMTOKENS and to those of the right columns if a is declared CDATA.
Attribute specification | a is NMTOKENS | a is CDATA |
---|---|---|
a=" xyz" |
x y z |
#x20 #x20 x y z |
a="&d;&d;A&a;&a;B&da;" |
A #x20 B |
#x20 #x20 A #x20 #x20 B #x20 #x20 |
a= |
#xD #xD A #xA #xA B #xD #xA |
#xD #xD A #xA #xA B #xD #xA |
It is noteworthy that the last example is not valid (but still well-formed) if a is declared to be of type NMTOKENS.
The versions of Unicode and ISO 10646 cited in the references were current at the time this document was prepared. Since new characters may be added to these standards by amendments, XML processors must accept any character in the range specified for Char. XML processors may at user option check that the data characters in the document are legal characters in a particular version of Unicode or ISO 10646.
New characters may be added to the Unicode and ISO 10646 standards cited in the references by amendements or new editions. Consequently, XML processors must accept any character in the range specified for Char.
To take into account Petr Kuzel's comments and really finish E16, also amend the sentence to read: "Legal characters, defined by production Char below, are tab...".
Validity constraint: No Notation on Empty Element
For compatibility, an attribute of type NOTATION may not be declared on an element declared EMPTY.
Entities encoded in UTF-16 must begin with the Byte Order Mark described by Annex F of [ISO/IEC 10646-1:1993], Annex H of [ISO/IEC 10646-1:2000], section 2.4 of [Unicode 2.0] and section 2.7 of [Unicode 3.0] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF).
...(as defined in [IETF RFC2396], updated by [IETF RFC2732])...
or in parameter entities. An external markup declaration is defined as a markup declaration occuring in the external subset or in a parameter entity (external or internal, internal parameter entities being included because non-validating processors are not required to read them).
Amend the first sentence of the paragraph following production [32] to read:
In a standalone document declaration, the value "yes" indicates that there are no external markup declarations which affect the information passed from the XML processor to the application.
[6] Names ::= Name (#x20 Name)*
"[8] Nmtokens ::= Nmtoken (#x20 Nmtoken)*
"
Note that if the unnormalized attribute value contains a character reference to a whitespace character other than space (#x20), the normalized value contains the referenced character itself (#xD, #xA or #x9). This contrasts with the case where the unnormalized value contains a whitespace character (not a reference), which is replaced with a space character (#x20) in the normalized value and also contrasts with the case where the unnormalized value contains an entity reference whose replacement text contains a whitespace character; being recursively processed, the whitespace character is replaced with a space character (#x20) in the normalized value.
There may be any number of Subcode segments; if the Langcode is an ISO639Code, and if the first subcode segment exists and consists of two letters, then it must be a country code from [ISO 3166], "Codes for the representation of names of countries."
In an encoding declaration, the values "UTF-8
",
"UTF-16
", "ISO-10646-UCS-2
", and
"ISO-10646-UCS-4
" should be used for the various encodings and
transformations of Unicode / ISO/IEC 10646, the values
"ISO-8859-1
", "ISO-8859-2
",
... "ISO-8859-9
" should be used for the parts of ISO 8859, and the
values "ISO-2022-JP
", "Shift_JIS
", and
"EUC-JP
" should be used for the various encoded forms of JIS
X-0208-1997. It is recommended that character encodings registered (as
charsets) with the Internet Assigned Numbers Authority [IANA], other than those just
listed, should be referred to using their registered names; other encodings
should use names starting with an "x-" prefix. XML processors should match
character encoding names in a case-insensitive way and should either interpret
an IANA-registered name as the encoding registered at IANA for that name or
treat it as unknown (processors are of course not required to support all
IANA-registered encodings).
In the column headed "Character", "Not recognized" hyperlinks to "#not recognized" instead of "#not-recognized" (missing dash). In the columns headed " Internal General" and "External Parsed General", "Forbidden" hyperlinks to "#not-recognized" instead of "#forbidden".
choice ::= '(' S? cp ( S? '|' S? cp )* S? ')'
"choice ::= '(' S? cp ( S? '|' S? cp )+ S? ')'
"The second possible case occurs when the XML entity is accompanied by encoding information, as in some file systems and some network protocols. When multiple sources of information are available, their relative priority and the preferred method of handling conflict should be specified as part of the higher-level protocol used to deliver XML. In particular, please refer to [IETF RFC2376] "XML Media Types" which defines the text/xml and application/xml MIME types and provides some useful guidance. In the interests of interoperability, however, the following rule is recommended.
With a Byte Order Mark: 00 00 FE FF: UCS-4, big-endian machine (1234 order) FF FE 00 00: UCS-4, little-endian machine (4321 order) FE FF 00 ##: UTF-16, big-endian FF FE ## 00: UTF-16, little-endian EF BB BF: UTF-8 Without a Byte Order Mark: 00 00 00 3C: UCS-4, big-endian machine (1234 order) 3C 00 00 00: UCS-4, little-endian machine (4321 order) 00 00 3C 00: UCS-4, unusual octet order (2143) 00 3C 00 00: UCS-4, unusual octet order (3412) 00 3C ## ##, 00 25 ## ##, 00 20 ## ##, 00 09 ## ##, 00 0D ## ## or 00 0A ## ##: Big-endian UTF-16 or ISO-10646-UCS-2. Note that, absent an encoding declaration, these cases are strictly speaking in error. 3C 00 ## ##, 25 00 ## ##, 20 00 ## ##, 09 00 ## ##, 0D 00 ## ## or 0A 00 ## ##: Little-endian UTF-16 or ISO-10646-UCS-2. Note that, absent an encoding declaration, these cases are strictly speaking in error. 3C 3F 78 6D: UTF-8, ISO 646, ASCII, some part of ISO 8859, Shift-JIS, EUC, or any other 7-bit, 8-bit, or mixed-width encoding which ensures that the characters of ASCII have their normal positions, width, and values; the actual encoding declaration must be read to detect which of these applies, but since all of these encodings use the same bit patterns for the ASCII characters, the encoding declaration itself may be read reliably 4C 6F A7 94: EBCDIC (in some flavor; the full encoding declaration must be read to tell which code page is in use) other: UTF-8 without an encoding declaration, or else the data stream is corrupt, fragmentary, or enclosed in a wrapper of some kind
Add the following to the second paragraph after the list (this also takes care of the previous erratum on UTF-7): "Note: Since external parsed entities in UTF-16 may begin with any character, this autodetection does not always work. Also, because of the overloaded usage it makes of ASCII-valued bytes, the UTF-7 encoding may fail to be reliably detected."
standalone='yes'
", they must not process entity
declarations or attribute-list declarations encountered after a
reference to a parameter entity that is not read, since the entity may
have contained overriding declarations."standalone='yes'
"', there
is no guarantee that making a document standalone will cause all XML processors
to reports the same results to the application.--->
'. The
following example is not well-formed." and an
example: "<!-- B+, B, or B--->
"Before the value of an attribute is passed to the application or checked for validity, but after the end-of-line normalization described in section 2.11 has been performed, the XML processor must normalize the attribute value as follows:
If the attribute type is not CDATA, then the XML processor must further process the normalized attribute value by discarding any leading and trailing space (#x20) characters, and by replacing sequences of space (#x20) characters by a single space (#x20) character.
"Validity Constraint: Unique Notation Name: only one notation declaration can declare a given Name."
"For interoperability, if a parameter-entity reference appears in a choice, seq, or Mixed construct, its replacement text should not be empty, and neither the first nor last non-blank character of the replacement text should be a connector (| or ,)."
to
"For interoperability, if a parameter-entity reference appears in a choice, seq, or Mixed construct, its replacement text should contain at least one non-blank character, and neither the first nor last non-blank character of the replacement text should be a connector (| or ,)."