Errata in REC-xml-19980210

The XML encoding declaration functions as an internal label on each entity, indicating which character encoding is in use. Before an XML processor can read the internal label, however, it apparently has to know what character encoding is in use--which is what the internal label is trying to indicate. In the general case, this is a hopeless situation. It is not entirely hopeless in XML, however, because XML limits the general case in two ways: each implementation is assumed to support only a finite set of character encodings, and the XML encoding declaration is restricted in position and content in order to make it feasible to autodetect the character encoding in use in each entity in normal cases. Also, in many cases other sources of information are available in addition to the XML data stream itself. Two cases may be distinguished, depending on whether the XML entity is presented to the processor without, or with, any accompanying (external) information. We consider the first case first.

F.1 Detection Without External Encoding Information

Because each XML entity not accompanied by external encoding information and not in UTF-8 or UTF-16 encoding must begin with an XML encoding declaration, in which the first characters must be '<?xml ', any conforming processor can detect, after two to four octets of input, which of the following cases apply. In reading this list, it may help to know that in UCS-4, '<' is "#x0000003C" and '?' is "#x0000003F", and the Byte Order Mark required of UTF-16 data streams is "#xFEFF". The notation ## is used to denote any byte value except that two consecutive ##s cannot be both 00.

With a Byte Order Mark:

`00 00 FE FF`	UCS-4, big-endian machine (1234 order)
`FF FE 00 00`	UCS-4, little-endian machine (4321 order)
`00 00 FF FE`	UCS-4, unusual octet order (2143)
`FE FF 00 00`	UCS-4, unusual octet order (3412)
`FE FF ## ##`	UTF-16, big-endian
`FF FE ## ##`	UTF-16, little-endian
`EF BB BF`	UTF-8

Without a Byte Order Mark:

`00 00 00 3C`	UCS-4 or other encoding with a 32-bit code unit and ASCII characters encoded as ASCII values, in respectively big-endian (1234), little-endian (4321) and two unusual byte orders (2143 and 3412). The encoding declaration must be read to determine which of UCS-4 or other supported 32-bit encodings applies.
`3C 00 00 00`
`00 00 3C 00`
`00 3C 00 00`
`00 3C 00 3F`	UTF-16BE or big-endian ISO-10646-UCS-2 or other encoding with a 16-bit code unit in big-endian order and ASCII characters encoded as ASCII values (the encoding declaration must be read to determine which)
`3C 00 3F 00`	UTF-16LE or little-endian ISO-10646-UCS-2 or other encoding with a 16-bit code unit in little-endian order and ASCII characters encoded as ASCII values (the encoding declaration must be read to determine which)
`3C 3F 78 6D`	UTF-8, ISO 646, ASCII, some part of ISO 8859, Shift-JIS, EUC, or any other 7-bit, 8-bit, or mixed-width encoding which ensures that the characters of ASCII have their normal positions, width, and values; the actual encoding declaration must be read to detect which of these applies, but since all of these encodings use the same bit patterns for the relevant ASCII characters, the encoding declaration itself may be read reliably
`4C 6F A7 94`	EBCDIC (in some flavor; the full encoding declaration must be read to tell which code page is in use)
Other	UTF-8 without an encoding declaration, or else the data stream is mislabeled (lacking a required encoding declaration), corrupt, fragmentary, or enclosed in a wrapper of some kind

Note:

In cases above which do not require reading the encoding declaration to determine the encoding, section 4.3.3 still requires that the encoding declaration, if present, be read and that the encoding name be checked to match the actual encoding of the entity. Also, it is possible that new character encodings will be invented that will make it necessary to use the encoding declaration to determine the encoding, in cases where this is not required at present.

This level of autodetection is enough to read the XML encoding declaration and parse the character-encoding identifier, which is still necessary to distinguish the individual members of each family of encodings (e.g. to tell UTF-8 from 8859, and the parts of 8859 from each other, or to distinguish the specific EBCDIC code page in use, and so on).

Because the contents of the encoding declaration are restricted to characters from the ASCII repertoire (however encoded), a processor can reliably read the entire encoding declaration as soon as it has detected which family of encodings is in use. Since in practice, all widely used character encodings fall into one of the categories above, the XML encoding declaration allows reasonably reliable in-band labeling of character encodings, even when external sources of information at the operating-system or transport-protocol level are unreliable. Character encodings such as UTF-7 that make overloaded usage of ASCII-valued bytes may fail to be reliably detected.

Once the processor has detected the character encoding in use, it can act appropriately, whether by invoking a separate input routine for each case, or by calling the proper conversion function on each character of input.

Like any self-labeling system, the XML encoding declaration will not work if any software changes the entity's character set or encoding without updating the encoding declaration. Implementors of character-encoding routines should be careful to ensure the accuracy of the internal and external information used to label the entity.

Attribute specification	a is NMTOKENS	a is CDATA
a=" xyz"	x y z	#x20 #x20 x y z
a="&d;&d;A&a;&a;B&da;"	A #x20 B	#x20 #x20 A #x20 #x20 B #x20 #x20
a= " A B "	#xD #xD A #xA #xA B #xD #xA	#xD #xD A #xA #xA B #xD #xA

XML 1.0 Specification Errata

Abstract

Known Errors

Errata as of 2000-09-27.

E109 Substantive

Obsoletes E13

E108 Substantive

Obsoletes E62

E107 Clarification

E106 Editorial

E105 Clarification

Obsoletes E44

F Autodetection of Character Encodings (Non-Normative)

F.1 Detection Without External Encoding Information

F.2 Priorities in the Presence of External Encoding Information

E104 Clarification

Obsoletes E86

E103 Clarification

Further clarifies E80

E102 Clarification

E101 Editorial

E100 Editorial

There is no E99

Errata as of 2000-08-10.

E98 Clarification Source: XML Core WG list [members only]

Errata as of 2000-07-27.

E97 Clarification Source: XML Core WG list [members only]

Obsoletes E40

E96 Editorial Source: XML Core WG list [members only]

Errata as of 2000-07-13.

E95 Editorial Source: xml-editor list

E94 Clarification Source: xml-editor list

E93 Substantive Source: xml-editor list

E92 Clarification Source: xml-editor list

There is no E91

E90 Substantive Source: XML Core WG list [members only]

Obsoletes E41 and part of E63

E89 Substantive Source: XML Core WG list [members only]

Obsoletes part of E39

E88 Substantive Source: I18N issues with the XML Specification [members only]

E87 Clarification Source: XML Core WG list [members only]

E86 Substantive Source: XML Core WG list [members only]

Obsoleted by E104

E85 Substantive Source: I18N issues with the XML Specification [members only]

E84 Substantive Source: I18N issues with the XML Specification [members only]

E83 Substantive Source: xml-editor list

E82 Substantive Source: xml-editor list

E81 Substantive Source: xml-editor list

E80 Clarification Source: XML Core WG list [members only]

Further clarified by E103

E79 Clarification Source: I18N issues with the XML Specification [members only]

E78 Substantive Source: I18N issues with the XML Specification [members only]

Obsoletes E49

E77 Clarification Source: XML Core WG list [members only]

E76 Clarification Source: XML Core WG list [members only]

Obsoletes E26

E75 Clarification Source: XML Core WG list [members only]

E74 Editorial Source: Martin Dürst

E73 Substantive Source:

Obsoletes E31, E60, and part of E38

E72 Clarification Source:

E71 Editorial Source: xml-editor list

E70 Substantive Source: XML Core WG list [members only]

Obsoletes E24 and E61

E69 Substantive Source: XML Core WG list [members only]

Obsoletes E16

E68 Substantive Source: XML Core WG list [members only]

Errata as of 2000-04-19.

E67 Editorial Source: I18N issues with the XML Specification [members only]

Obsoletes E37

E66 Editorial Source: I18N issues with the XML Specification [members only]

Errata as of 2000-04-15.

E65 Substantive Source: XML Core WG list [members only]

E64 Clarification Source: XML Core WG list [members only]

E63 Clarification Source: XML Core WG list [members only]

Partially obsoleted by E90

Errata as of 2000-04-09.

E62 Substantive Source: XML Core WG list [members only]

Obsoleted by E108

Errata as of 2000-02-17.