The Unicode Blog

Tuesday, November 26, 2024

UTC #181 Highlights

Unicode Technical Committee (UTC) meeting #181 was held November 6 – 8 in Cupertino, hosted by Apple. Here are some highlights.

Starting the Unicode 17.0 cycle

UTC approved a plan and timeline for the Unicode 17.0 release. Here’s a summary of the timeline:

November 2024: UTC #181 approved new character repertoire
January 2025: UTC #182 will finalize content for the alpha release
February – March: alpha release for public review
April: UTC #183 will finalize content for the beta release
May – June: beta release for public review
July: UTC #184 will finalize 17.0 content
September: Unicode 17.0 release

Unicode 17.0 character and emoji repertoire

UTC #179 had previously approved 4,301 CJK ideographs for Unicode 17.0, including the addition of the CJK Unified Ideographs Extension J block. At this UTC meeting, a number of additional characters and symbols were approved for Unicode 17.0, including five new scripts:

Beria Erfe is a modern-use script used for the Zaghawa language in eastern Africa.
Chisoi is a modern-use script used for the Kurmali language in eastern India.
Sidetic is an historic script that was used in ancient Anatolia.
Tai Yo is the traditional script for the Tai Yo language, spoken in Vietnam and Laos.
Tolong Siki is a modern-use script used for the Kurukh language in eastern India.

A few changes were made to the approved new CJK ideographs repertoire: two ideographs from the CJK Extension J block were removed, while four ideographs were added. UTC also approved 297 other non-emoji character additions for already encoded scripts or symbol blocks.

UTC #181 also approved 8 new emoji characters for Unicode 17.0, along with a number of emoji ZWJ sequences; see document L2/24-226R for details.

Besides characters approved for Unicode 17.0, code points were provisionally assigned for 365 new characters that are candidates for encoding in a future Unicode version.

See the Pipeline page for all characters currently approved for Unicode 17.0, along with code points provisionally assigned for future encoding.

Algorithm specs

UTC approved some significant changes related to algorithm specifications for Unicode 17.0. Notably, in UAX #14, a new Line_Break property value was approved — Unambiguous_Hyphen — along with related changes to various rules of the line-breaking algorithm. Also, for UTS #10, Unicode Collation Algorithm, information about conformance tests had previously been published in a companion document, but this will be incorporated into UTS #10 for version 17.0. New public review issues will be posted soon to get feedback on the planned changes.

UTC also approved proposed drafts for two new algorithm specifications:

Proposed Draft UTS #58, Unicode Linkification: this proposed standard will specify a mechanism for detecting URLs that contain Unicode characters.
Proposed Draft UTR #59, East Asian Spacing: this proposed technical report will specify an algorithm for established typographic conventions in East Asian text for spacing between runs of text from different scripts.

A public review issue has been posted for review of PD UTS #58 (see PRI #509). A public review issue for PD UTR #59 will be posted soon.

Update on Text Terminal Working Group

At UTC #175, a temporary working group was formed to work on improving support for Unicode text in text terminal environments. After a slow start due to the original chairperson no longer being available, Fraser Gordon was chosen as a new chair for the group, and it has started to function with several interested participants. Fraser Gordon reported on the group’s activity and requested feedback from UTC on some technical questions the working group was facing, including whether it could be in scope to propose requirements for fonts or a text protocol for signaling between applications and terminals — UTC feedback was that either of these could be considered. See L2/24-264 for more details.

UTC coming to Eastern US

Earlier this year, UTC started discussing the possibility of trying new locations to make it easier for people in other regions or time zones to participate. Between having people interested from many parts of the world as well as travel constraints on regular participants, there is no perfect answer. However, we received a generous offer from the University of New Hampshire to host a meeting there, and so UTC has decided to switch the location of the July 2025 meeting from Redmond, WA to Manchester, New Hampshire (about an hour drive north of Boston). Some preliminary logistic info will be provided soon to give plenty of time to consider travel plans.

For complete details on outcomes from UTC #181, see the draft minutes.

Feedback Requested on Proposed Draft UTS #58 Unicode Linkification

Feedback is requested on Proposed Draft UTS #58 Unicode Linkification, especially by technologists working with browsers and any programs that automatically apply links to URLs, such as email programs.

So what is Linkification?

With most email programs, when someone pastes in the plain text:

The page https://ja.wikipedia.org/wiki/アルベルト・アインシュタイン contains information about Albert Einstein.

and sends to someone else, they receive it as:

The page https://ja.wikipedia.org/wiki/アルベルト・アインシュタイン contains information about Albert Einstein.

URLs are also “linkified” in many other applications, such when pasting into a word processor (triggered by typing a space afterward, for example).

Problem

However, many products (many text messaging apps, video messaging chats, etc.) completely fail to recognize any non-ASCII characters past the domain name. And even among those that do recognize such non-ASCII characters, there are gratuitous differences in where they stop linkifying.

The linkification process for URLs is already fragmented — with different implementations producing very different results — but it is amplified with the addition of non-ASCII characters, which often have very different behavior. That is, developers’ lack of familiarity with the behavior of non-ASCII characters has caused the different implementations of linkification to splinter. Yet non-ASCII characters are very important for readability. People do not want to see the above URL expressed in escaped ASCII:

The page https://ja.wikipedia.org/wiki/%E3%82%A2%E3%83%AB%E3%83%99%E3%83%AB%E3%83%88%29%E3%82%A2%E3%82%A4%E3%83%B3%E3%82%B7%E3%83%A5%E3%82%BF%E3%82%A4%E3%83%B3 contains information about Albert Einstein.

Proposed Solution

This proposed draft Unicode Technical Standard #58 Unicode Linkification specifies a standard mechanism for detecting URLs embedded in plain text — in particular, detecting URLs containing non-ASCII characters. It also defines the minimally necessary escaping of non-ASCII code points in the Path, Query, and Fragment portions of a URL that aligns with the mechanism for detecting URLs.

How to Provide Feedback

For information about how to discuss this Public Review Issue and how to supply formal feedback, please see the feedback and discussion instructions. The closing date is 2025 January 02 for this draft, but this is only the first step towards approval.

_________________________________________________

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Announcing ICU4X 2.0 Beta 1 (and UTW 2024 recording)

Across the globe, people are coming online with smaller and more varied devices including smartphones, smartwatches, and gadgets. An offshoot of the International Components for Unicode (ICU) Technical Committee, the ICU4X Committee, is responsible for enabling these next-generation devices to communicate with their users in thousands of languages. Written in Rust, ICU4X brings lightweight, modular, and secure internationalization libraries to low-resource devices and many programming languages.

The ICU4X-TC is happy to now announce the release of ICU4X 2.0 Beta 1. Learn more about it in our UTW 2024 presentation: 2024 ICU4X 2.0: Next Level i18n

This release includes a rewritten datetime component, type-safe preferences in all constructors, CLDR 46 and Unicode 16 data, new experimental duration and unit formatting components, an all-new WebAssembly demo, and improvements to many other components including locale tailoring in segmenter, algorithmic plural selection, and IXDTF parsing for zoned datetimes.

This release includes breaking changes. The most common you will encounter include:

All constructors take a preference bag by value instead of a `&DataLocale`.
Many functions had subtle renames, such as `try_from_bytes` becoming `try_from_utf8`.
The datetime component was rewritten, and call sites will need to be migrated.

Refer to the latest documentation for more information. Please also ask questions on GitHub:

https://github.com/unicode-org/icu4x/discussions/5872

This is a beta release, meaning that the team expects this to be mostly compatible with the upcoming 2.0 final release, but there is still room to make changes. Please send feedback by creating an issue or discussion on GitHub.

________________________________________________

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Tuesday, October 29, 2024

Script Encoding and Cultural Identity: Navigating Digital Exclusion

By Maroua Bezzaoui, SILICON Intern

During the summer of 2024, Unicode’s internship program included interns from Stanford University, Northeastern University, and Google’s Summer of Code. Several of the interns have shared their experiences. The second featured piece is from Maroua Bezzaoui at Stanford University.

ICU 76 Released

Unicode® ICU 76 has just been released. ICU is the premier library for software internationalization, used by a wide array of companies and organizations to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR).

ICU 76 updates to Unicode 16 (blog), including new characters and scripts, emoji, collation & IDNA changes, and corresponding APIs and implementations. It also updates to CLDR 46 (beta blog) locale data with new locales, significant updates to existing locales, and various additions and corrections. For example, the CLDR and Unicode default sort orders are now very nearly the same.

Most of the java.time (Temporal) types can now be formatted directly using the existing ICU4J date/time formatting classes.

There are some new APIs to make ICU easier to use with modern C++ and Java patterns. Most of the C/C++ APIs added for this purpose are implemented as C++ header-only APIs, and usable on top of binary stable C APIs, which is a first for ICU.

The Java and C++ technology preview implementations of the (also in tech preview) CLDR MessageFormat 2.0 specification have been updated to match recent changes.

ICU 76 and CLDR 46 are major releases, including a new version of Unicode and major locale data improvements.

For details, please see
https://unicode-org.github.io/icu/download/76.html.

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Unicode CLDR 46 available

Unicode CLDR 46 is now available and has been integrated into version 76 of ICU.

The most significant data changes in this release were:

Updated to Unicode 16.0 (including major changes to collation)
Substantial additions and modifications of Emoji search keyword data
‘Upleveling’ the locale coverage (see below)

The most significant changes in the specification were:

Updates to Message Format in tech preview
Updates to conformance
New tech preview section on semantic skeletons

CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.) For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?))

Via the Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.

In version 46, the following levels were reached:

New / Upleveled Locales

±	New Level	Locales
📈	Modern	Nigerian Pidgin, Tigrinya
📈	Moderate	Akan, Baluchi (Latin), Kangri, Tajik, Tatar, Wolof
📈	Basic	Ewe, Ga, Kinyarwanda, Konkani (Latin), Northern Sotho, Oromo, Sichuan Yi, Southern Sotho, Tswana
📉	Basic*	Chuvash, Anii

We are currently planning for CLDR 47 to be a closed release with no data submission period. The focus will be on improving the Survey Tool used for data submission, making necessary infrastructure changes, and some high priority data quality fixes.

For more information

See the CLDR 46 release page , which has information on accessing the data, reviewing charts of the changes, and — importantly — Migration issues.

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Tuesday, October 22, 2024

Time and Trust

By Samuel Minev-Benzecry, SILICON Intern

During the summer of 2024, Unicode’s internship program included interns from Stanford University, Northeastern University, and Google’s Summer of Code. Several of the interns have shared their experiences. The first featured piece is from Samuel Minev-Benzecry at Stanford University.

Unicode Technology Workshop on October 22-23 – Program Updates!

By the UTW 2024 Program Committee

Join us for two days of community building around the Unicode technology that makes software work for billions of people. With a deeper emphasis on case studies, unconference, and workshop-style sessions, this event will enable participants to collaborate and learn from each other to tackle the latest challenges. Register Now for this in-person-only event hosted at Google in Sunnyvale, CA. The full program, including session details and bios, is available here:

UTW 2024 Event Website !

⭐ Highlights:

Build connections within the internationalization community
Learn best practices from peers and case studies
Network with the developers and users to help shape the future of Unicode technology
Deepen knowledge of how to solve tough problems in the i18n and l10n space and how to engineer products that work better for global users

📢 Confirmed Sessions!

“A User-Centric Approach to a Bidi Text Interface” with Adil Allawi
“Common Locale Data Repository - Using the Survey Tool to Expand Language Coverage” with Conrad Nied
"Talking Emoji 🔥😮‍💨🍄🪦💀🐷🐙😤" with Jennifer Daniel
“Design Deep-Dive” with Mark Davis
“How Would You Like Your Text Today?” with John Hudson
"Indic Script Policy & Planning in the Digital Age" with Karthik Malli
“Language and Direction Metadata on the Web” with Addison Phillips
“MessageFormat 2 Technical Preview: Where Are We Now?” with Addison Phillips
“Tracking Language Digitization in the UNESCO World Atlas” with Jeannette Stewart and Tex Texin
“Why Does Unicode Do That?” with Mark Davis
"Volunteers for Keyboards for Indigenous Language Communities" with Tex Texin
"Optimizing Glyphs for Real-Time Vector Rendering" with Eric Lengyel
"How To Not Run Towards The Bear: Directionality & Emoji" with Kamilé Demir and Ben Joeng (Yang)
"What is a Valid Person Name?" with Michael McKenna
“Case Study - Solving Inflection” with Nebojša Ćirić (Chair of the Unicode ICU-Language Inflection Working Group) and George Rhoten
“Bridging Languages in ICU4X: How Diplomat Brings i18n to the Web and Beyond” with Tyler Knowlton
“We Need a New Message Resource Format” with Eemeli Aro
"New in CLDR/ICU" with Mark Davis
"Could You Give Me an Example? Simplifying the CLDR Survey Tool" with Helena Aytenfisu and Emiyare Ikwut-Ukwa
“ICU4X 2.0: Next Level i18n” with Shane F Carr (Chair of the Unicode ICU4X Technical Committee)
"From Oral to Digital in One Generation - An Exploration of Amazonian Languages and Their Path to Digital Inclusion" with Samuel Minev-Benzecry
“Encoding Expectations: How Long Does It Really Take?” Anushah Hossain and Ahad Bashir
"Indic Script Policy & Planning in the Digital Age" with Karthik Malli
"Date, Time, and Timezone for Netflix Live Events” with Shawn Xu and Chester Fung
"Behind the Curtains: Unicode Technical Groups” with Mark Davis (Unicode Co-founder and CTO)
“Ask Unicode Anything” with Toral Cowieson, Mark Davis, Cathy Wissink

Please note that sessions are continually being added for the two tracks.

👍 Expect workshops, seminars, free-form discussions, and lightning talks on:

i18n libraries
locale data frameworks
globalization tooling
input methods
text rendering
localization pipelines

❓Who should attend?:

Whether you’re an experienced GILT professional, an internationalization or Unicode enthusiast, just starting out, or a student, the UTW 2024 sessions will enrich your understanding of key issues!

❗Space is limited so be sure to secure your spot today!

Discounts are available for Unicode members and students. Registration fees include continental breakfast, lunch, refreshments, and Mix & Mingle at the end of the first day.

Register Now !

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Tuesday, November 26, 2024

UTC #181 Highlights

Feedback Requested on Proposed Draft UTS #58 Unicode Linkification

Adopt a Character and Support Unicode’s Mission

Announcing ICU4X 2.0 Beta 1 (and UTW 2024 recording)

Adopt a Character and Support Unicode’s Mission

Tuesday, October 29, 2024

Script Encoding and Cultural Identity: Navigating Digital Exclusion

Friday, October 25, 2024

ICU 76 Released

Adopt a Character and Support Unicode’s Mission

Unicode CLDR 46 available

New / Upleveled Locales

For more information

Adopt a Character and Support Unicode’s Mission

Tuesday, October 22, 2024

Time and Trust

Friday, September 27, 2024

Unicode Technology Workshop on October 22-23 – Program Updates!

⭐ Highlights:

📢 Confirmed Sessions!

👍 Expect workshops, seminars, free-form discussions, and lightning talks on:

❓Who should attend?:

❗Space is limited so be sure to secure your spot today!

Adopt a Character and Support Unicode’s Mission

Links of Interest

Blog Archive

Labels

Followers

Tuesday, November 26, 2024

Adopt a Character and Support Unicode’s Mission

Adopt a Character and Support Unicode’s Mission

Tuesday, October 29, 2024

Friday, October 25, 2024

Adopt a Character and Support Unicode’s Mission

New / Upleveled Locales

For more information

Adopt a Character and Support Unicode’s Mission

Tuesday, October 22, 2024

Friday, September 27, 2024

⭐ Highlights:

📢 Confirmed Sessions!

👍 Expect workshops, seminars, free-form discussions, and lightning talks on:

❓Who should attend?:

❗Space is limited so be sure to secure your spot today!

Adopt a Character and Support Unicode’s Mission

Links of Interest

Blog Archive

Labels

Followers

Subscribe to this blog