Unicode 11.0.0

Home | Site Map | Search

11.0.0 Core Specification
All Chapters and Appendices Together:
•	Full Text pdf for Viewing (13.5 MB)
11.0.0 Front Matter
	Title and Copyright
	Contents
	Unicode 11.0 Web Bookmarks
	Preface
11.0.0 Chapters
1	Introduction
2	General Structure
3	Conformance
4	Character Properties
5	Implementation Guidelines
6	Writing systems and Punctuation
7	Europe-I
8	Europe-II
9	Middle East-I
10	Middle East-II
11	Cuneiform and Hieroglyphs
12	South and Central Asia-I
13	South and Central Asia-II
14	South and Central Asia-III
15	South and Central Asia-IV
16	Southeast Asia
17	Indonesia and Oceania
18	East Asia
19	Africa
20	Americas
21	Notational Systems
22	Symbols
23	Special Areas and Format Characters
24	About the Code Charts
11.0.0 Appendices and Back Matter
A	Notational Conventions
B	Unicode Publications and Resources
C	Relationship to ISO/IEC 10646
D	Version History of the Standard
E	Han Unification History
F	Documentation of CJK Strokes
	Index
	Colophon
Code Charts
•	Latest Code Charts
•	Delta Code Charts (additions to 11.0.0 highlighted)
•	Archival Code Charts (11.0.0)
Han Radical-Stroke Indices
•	Interactive Han Radical-Stroke Index
•	IICore Radical-Stroke Index (3.2 MB)
•	Full Han Radical-Stroke Index (~35 MB)
11.0.0 Unicode Standard Annexes
UAX #9, The Unicode Bidirectional Algorithm
UAX #11, East Asian Width
UAX #14, Unicode Line Breaking Algorithm
UAX #15, Unicode Normalization Forms
UAX #24, Unicode Script Property
UAX #29, Unicode Text Segmentation
UAX #31, Unicode Identifier and Pattern Syntax
UAX #34, Unicode Named Character Sequences
UAX #38, Unicode Han Database (Unihan)
UAX #41, Common References for Unicode Standard Annexes
UAX #42, Unicode Character Database in XML
UAX #44, Unicode Character Database
UAX #45, U-Source Ideographs
UAX #50, Unicode Vertical Text Layout
11.0.0 UCD
11.0.0 (files) (about)
11.0.0 Zipped files (for bulk download)
Related Links
Unicode Acknowledgements
Archive of Unicode Versions
About Versions
Updates and Errata
Glossary of Unicode Terms
References for the Unicode Standard
Unicode Character Name Index
Technical Reports
Unicode Emoji

Unicode® 11.0.0
2018 June 5 (Announcement)
Version 11.0.0 has been superseded by the latest version of the Unicode Standard.

This page summarizes the important changes for the Unicode Standard, Version 11.0.0. This version supersedes all previous versions of the Unicode Standard.

A. Summary
B. Technical Overview
C. Stability Policy Update
D. Textual Changes and Character Additions
E. Conformance Changes
F. Changes in the Unicode Character Database
G. Changes in the Unicode Standard Annexes
H. Changes in Synchronized Unicode Technical Standards
M. Implications for Migration

A. Summary

Unicode 11.0 adds 684 characters, for a total of 137,374 characters. These additions include 7 new scripts, for a total of 146 scripts, as well as 66 new emoji characters.

The new scripts and characters in Version 11.0 add support for lesser-used languages and unique written requirements worldwide. Funds from the Adopt-a-Character program provided support for some of these additions. The new scripts and characters include:

Dogra, used to write historic Dogra in South Asia

Georgian Mtavruli capital letters, newly added to support modern casing practices

Gunjala Gondi, used to write the Adilabad dialect of the Gondi language in South Asia

Hanifi Rohingya, used to write the modern Rohingya language in Southeast Asia

Makasar, used to write historic Makasar in Indonesia

Medefaidrin, used for modern liturgical purposes in Africa

Old Sogdian, used to write historic Sogdian in the third to fifth centuries in Central Asia

Sogdian, used to write historic languages in the seventh to fourteenth centuries in Central Asia

Five urgently needed CJK unified ideographs: three for newly standardized names of chemical elements, and two for Japan's government administration Moji Joho Kiban Project that includes ideographs for personal and place names

Popular symbol additions:

66 emoji characters, including 4 new emoji components for hair color. For complete statistics regarding all emoji as of Unicode 11.0, see Emoji Counts. For more information about emoji additions for Unicode 11.0, including new emoji ZWJ sequences and emoji modifier sequences, see Emoji Recently Added, v11.0.

Copyleft symbol

Half stars for rating systems

Additional astrological symbols

Xiangqi Chinese chess symbols

Additional support for lesser-used languages and scholarly work was extended worldwide, including:

For the Mazahua language, a Mesoamerican language recognized by law in Mexico

For Mayan numerals used in printed materials in Central America

For Sanskrit manuscripts written in Bengali

For Gurmukhi manuscripts

For historic documents of the Buryats of the Barguzin Steppe

Version 11.0 improved the segmentation algorithms by simplifying the statements of emoji-related rules for grapheme cluster boundaries and for word boundaries.

Synchronization

Several other important Unicode specifications have been updated for Version 11.0. The following four Unicode Technical Standards are versioned in synchrony with the Unicode Standard, because their data files cover the same repertoire. All have been updated to Version 11.0:

UTS #10, Unicode Collation Algorithm — sorting Unicode text

UTS #39, Unicode Security Mechanisms — reducing Unicode spoofing

UTS #46, Unicode IDNA Compatibility Processing — compatible processing of non-ASCII URLs

UTS #51, Unicode Emoji — emoji-related data and behavior

Some of the changes in Version 11.0 and associated Unicode Technical Standards may require modifications to implementations. For more information, see the migration and modification sections of UTS #10, UTS #39, UTS #46, and UTS #51.

This version of the Unicode Standard is also synchronized with 10646:2017, fifth edition, plus Amendment 1 to the fifth edition, plus the following additions from Amendment 2 to the fifth edition:

46 Mtavruli Georgian capital letters

5 urgently needed CJK unified ideographs

66 emoji characters

See Sections D through H below for additional details regarding the changes in this version of the Unicode Standard, its associated annexes, and the other synchronized Unicode specifications.

B. Technical Overview

Version 11.0 of the Unicode Standard consists of:

The core specification

The code charts (delta and archival) for this version

The Unicode Standard Annexes

The Unicode Character Database (UCD)

The core specification gives the general principles, requirements for conformance, and guidelines for implementers. The code charts show representative glyphs for all the Unicode characters. The Unicode Standard Annexes supply detailed normative information about particular aspects of the standard. The Unicode Character Database supplies normative and informative data for implementers to allow them to implement the Unicode Standard.

Core Specification

The core specification is available as a single pdf for viewing. (13.5 MB) Links are also available in the navigation bar on the left of this page to access individual chapters and appendices of the core specification.

Code Charts

Several sets of code charts are available. They serve different purposes:

The latest set of code charts for the Unicode Standard is available online. Those charts are always the most current code charts available, and may be updated at any time. The charts are organized by scripts and blocks for easy reference. An online index by character name is also provided.

For Unicode 11.0.0 in particular two additional sets of code chart pages are provided:

A set of delta code charts showing the new blocks and any blocks in which characters were added for Unicode 11.0.0. The new characters are visually highlighted in the charts.

A set of archival code charts that represents the entire set of characters, names and representative glyphs at the time of publication of Unicode 11.0.0.

The delta and archival code charts are a stable part of this release of the Unicode Standard. They will never be updated.

Unicode Standard Annexes

Links to the individual Unicode Standard Annexes are available in the navigation bar on the left of this page. The list of significant changes in the content of the Unicode Standard Annexes for Version 11.0 can be found in Section G below.

Unicode Character Database

Data files for Version 11.0 of the Unicode Character Database are available. The ReadMe.txt in that directory provides a roadmap to the functions of the various subdirectories. Zipped versions of the UCD for bulk download are available, as well.

Version References

Version 11.0.0 of the Unicode Standard should be referenced as:

The Unicode Consortium. The Unicode Standard, Version 11.0.0, (Mountain View, CA: The Unicode Consortium, 2018. ISBN 978-1-936213-19-1)
http://www.unicode.org/versions/Unicode11.0.0/

The terms “Version 11.0” or “Unicode 11.0” are abbreviations for the full version reference, Version 11.0.0.

The citation and permalink for the latest published version of the Unicode Standard is:

The Unicode Consortium. The Unicode Standard.
http://www.unicode.org/versions/latest/

A complete specification of the contributory files for Unicode 11.0 is found on the page Components for 11.0.0. That page also provides the recommended reference format for Unicode Standard Annexes. For examples of how to cite particular portions of the Unicode Standard, see also the Reference Examples.

Errata

Errata incorporated into Unicode 11.0 are listed by date in a separate table. For corrigenda and errata after the release of Unicode 11.0, see the list of current Updates and Errata.

C. Stability Policy Update

There were no significant changes to the Stability Policy of the core specification between Unicode 10.0 and Unicode 11.0.

D. Textual Changes and Character Additions

Seven new scripts were added with accompanying new block descriptions:

Script Number of
Characters

Hanifi Rohingya 50

Old Sogdian 40

Sogdian 42

Dogra 60

Gunjala Gondi 63

Makasar 25

Medefaidrin 91

Changes in the Unicode Standard Annexes are listed in Section G.

Character Assignment Overview

684 characters have been added. Most character additions are in new blocks, but there are also character additions to a number of existing blocks. For details, see delta code charts.

E. Conformance Changes

There are no significant new conformance requirements in Unicode 11.0. However, the informative discussion of the use of U+FFFD to replace ill-formed sequences encountered during conversion (in Section 3.9, Unicode Encoding Forms) has been simplified and clarified, with more explicit examples shown.

F. Changes in the Unicode Character Database

The detailed listing of all changes to the contributory data files of the Unicode Character Database for Version 11.0 can be found in UAX #44, Unicode Character Database. The changes listed there include character additions and property revisions to existing characters that will affect implementations. Some of the important impacts on implementations migrating from earlier versions of the standard are highlighted in Section M.

G. Changes in the Unicode Standard Annexes

In Version 11.0, some of the Unicode Standard Annexes have had significant revisions. The most important of these changes are listed below. For the full details of all changes, see the Modifications section of each UAX, linked directly from the following list of UAXes.

Unicode Standard Annex Changes

UAX #9
Unicode Bidirectional Algorithm Clarified the explanation of how paragraph separators are handled in X8.

UAX #11
East Asian Width Added a note in Section 2 that the East_Asian_Width property was never intended to be used by modern terminal emulators, especially with Unicode's current repertoire.

UAX #14
Unicode Line Breaking Algorithm Updated Rule LB8a to handle the same set of pictographic symbols in the line breaking of emoji zwj sequences as is used for text segmentation in UAX #29.

UAX #15
Unicode Normalization Forms Section 5, Composition Exclusion was rewritten for clarity and correctness.

UAX #24
Unicode Script Property No significant changes in this version.

UAX #29
Unicode Text Segmentation Added use of the Extended_Pictographic property from Emoji 11.0, to simplify the statement of emoji-related rules for grapheme cluster boundaries and word boundaries. Added a table of formal regex definitions to rationalize the definition of the classes used for grapheme cluster boundaries.

UAX #31
Unicode Identifier and Pattern Syntax Refined the use of ZWJ in identifiers (adding some restrictions and relaxing others slightly), added the new scripts for Version 11.0, and broadened the definition of hashtag identifiers.

UAX #34
Unicode Named Character Sequences No significant changes in this version.

UAX #38
Unicode Han Database (Unihan) Added five fields and improved regular expressions. Documented extension of Unihan properties to non-Unihan characters.

UAX #41
Common References for Unicode Standard Annexes Updated all references for Unicode 11.0.

UAX #42
Unicode Character Database in XML Added new code point attributes, values, and patterns.

UAX #44
Unicode Character Database Added new property Equivalent_Unified_Ideograph to the property table. Added regular expressions for the validation of Bidi_Paired_Bracket and Equivalent_Unified_Ideograph to Table 21. Updated the discussion of emoji variation sequences. Provided further clarification about the range of numeric values allowed for the Age property.

UAX #45
U-Source Ideographs Improved documentation for identifier prefixes.

UAX #50
Unicode Vertical Text Layout Section 4, Tailorings, was removed, because its content was no longer useful.

H. Changes in Synchronized Unicode Technical Standards

There are also significant revisions in the Unicode Technical Standards whose versions are synchronized with the Unicode Standard. The most important of these changes are listed below. For the full details of all changes, see the Modifications section of each UTS, linked directly from the following list of UTSes.

Unicode Technical Standard Changes

UTS #10
Unicode Collation Algorithm A clarification was added regarding search tailoring in scripts which use the visual order model. The DUCET table was updated to cover the Unicode 11.0 repertoire

UTS #39
Unicode Security Mechanisms Added further discussion about the use of joining controls, including how advanced implementations may use script-specific information to determine behavior. Also refined the suggestions about checking certain kinds of combining sequences in spoof detection.

UTS #46
Unicode IDNA Compatibility Processing Changed the format of the test file to permit testing with different, arbitrary combinations of the input settings. The format of the input setting for Transitional_Processing was updated. And the table of IDNA Comparisons was updated to reflect Unicode 11.0 character additions.

UTS #51
Unicode Emoji The versioning of UTS #51 was bumped to 11.0, so that it now matches the Unicode version number associated with the latest emoji delta release. The Extended_Pictographic property for emoji was added and documented, to enable a more compact description of the behavior of emoji in segmentation algorithms. An emoji ZWJ sequence mechanism was added for hinting at glyph facing direction for some emoji. Documentation was added regarding the use of the four new hair emoji components. A discussion was added regarding the use of gender neutral emoji.

M. Implications for Migration

There are a significant number of changes in Unicode 11.0 which may impact implementations which are upgrading to Version 11.0 from earlier versions of the standard. The most important of these are listed and explained here, to help focus on the issues most likely to cause unexpected trouble during upgrades.

Script-related Changes

Version 11.0 adds seven new scripts, so implementations which process script data should be carefully checked. Some of these scripts have particular attributes which may cause issues for implementations.

The Hanifi Rohingya script is a new RTL script, with numbers written LTR, as in Arabic.

The tatweel (U+0640) has been extended for use in Hanifi Rohingya and Sogdian.

There are two new sets of vigesimal (base 20) numerals, one for the Medefaidrin script, and another for Mayan. The Mayan numerals are added for specialty use, as for page numbers, in advance of the encoding of the full Mayan script.

Indic Siyaq numerals have complex formatting requirements, when combined to represent large numbers.

Casing Issues

Casing behavior for the Georgian script has changed significantly. There is a new set of Mtavruli capital letters (U+1C90..U+1CBA, U+1CBD..U+1CBF) in Unicode 11.0, with case mappings to the existing Mkhedruli letters (U+10D0..U+10FA, U+10FD..U+10FF). In prior versions of the Unicode Standard, Mkhedruli Georgian was considered a monocameral (non-casing) script, and the Mkhedruli Georgian letters were gc=Lo. Starting with Version 11.0, those Mkhedruli Georgian letters are now gc=Ll, and have uppercase mappings to Mtavruli Georgian capital letters. This change will have major implications for Georgian implementations, including changes for input methods, fonts, casing, and string matching. Existing implementations have treated Mtavruli headlines and other uses for textual emphasis as a text style, so there will also be significant issues for document conversion and upgrade.

Another complication for Georgian is that the primary orthography does not use titlecasing, and the Mkhedruli Georgian letters do not have titlecase mappings to Mtavruli letters. This is unique among bicameral systems in the Unicode Standard, so casing implementations should be prepared for this exception.

Shaping Issues

Unicode 11.0 adds formal recognition of a number of previously encoded mathematical characters as forming mirroring pairs. As a result, there is now a further deviation between the mappings defined in BidiMirroring.txt and those defined in the OpenType mirroring list, which was frozen as of Unicode 5.1. This does not change bidirectional formatting: there is no change to the Bidi_Mirrored binary property value here, but only to the listing of which pairs of encoded characters have nominally mirroring glyphs.

Some property values have been added to the Indic_Syllabic_Category property. These new values may impact implementations which use the Indic_Syllabic_Category property to help define shaping behavior.

In prior versions of the UCD, cursive joining scripts which had any Joining_Group values assigned included distinct values for all characters that participate in cursive joining, including all of the Joining_Group singletons (classes containing only a single character). Starting with Unicode 11.0 and going forward, explicit Joining_Group values are assigned only to characters which do not constitute singleton classes. This new convention is applicable to the two newly encoded cursive joining scripts: Hanifi Rohingya and Sogdian. Implementations may need to take into account this discontinuity in how Joining_Group values are assigned to cursive joining scripts.

Segmentation-related Changes

Four Grapheme_Cluster_Break and Word_Break classes have become obsolete and are no longer used: E_Base, E_Modifier, Glue_After_Zwj, and E_Base_GAZ. Those values are still part of the enumeration of the property values, because stability constraints prevent removal of enumerated property values, even if obsolete; however, these are no longer assigned to any characters, and are no longer referred to explicitly by any rules in the algorithms.

The algorithms for GCB and WB now make use of the new Extended_Pictographic (ExtPict) property defined in UTS #51, Unicode Emoji, Version 11.0. That is a separate property relevant to emoji, rather than a particular class of the GCB or WB properties.

The classes GCB=Extend and WB=Extend now include emoji skin tone modifiers, to improve segmentation behavior for emoji. One of the implications of this is that GCB=Extend no longer matches Grapheme_Extend=Y. (WB=Extend already did not match Grapheme_Extend=Y.) SB=Extend is unaffected.

The WB property has a new property value WSegSpace. A new rule in the WB algorithm makes use of that new property value to prevent word breaks within runs of whitespace characters.
UAX #14, Unicode Line Breaking Algorithm has an adjustment to Rule LB8a, to simplify the specification of line break opportunities near ZWJ. Now there is simply no break opportunity following a ZWJ. This improves line breaking behavior for emoji sequences, in particular.

CJK/Unihan Changes

There are five additional CJK unified ideographs, which push the end of range for assigned characters in the main CJK block to U+9FEF. The same issue applies for Tangut, which also had five new ideographs added at the end of the main Tangut block, pushing the end of that range to U+187F1. Implementations which use hard coded ranges for ideographs will need updates for those values.

Five new provisional Unihan properties have been added.

The kHangul property values underwent a major revision.

Standardized Variation Sequences

One new standardized variation sequence has been added, to represent a short diagonal stroke form of U+FF10 FULLWIDTH DIGIT ZERO. The short diagonal stroke form is included in the Adobe-Japan1-6 glyph set, which is used as the basis for numerous OpenType Japanese fonts.

New Data Files Added to the UCD

A new data file has been added to the UCD: EquivalentUnifiedIdeograph.txt. That data file contains the mapping values for the new property, Equivalent_Unified_Ideograph (EqUIdeo). That property, which provides equivalent ideograph mappings (where possible) for CJK radicals and CJK stroke characters, is intended to support tailorings of sorting and searching, which may need to include radicals and strokes in their scope, for completeness.

Code Charts

There are numerous changes in the representative glyphs, some backed by explicit errata. There are also glyph changes in the text presentation of a number of emoji and emoticons. Some of those changes reflect an attempt to make the text presentation glyphs for emoji converge on common practice among vendors for the emoji presentation glyphs. Such glyph changes are highlighted in violet in the delta code charts for Version 11.0.

The use of characters beyond the range of Latin-1 is now allowed in annotations in the names list. (See NamesList.html for details.) Some other adaptations have been made in the use of fonts in the names list part of the code charts.