Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard

Front Cover
Addison-Wesley Professional, 2003 - Computers - 853 pages

Unicode is a critical enabling technology for developers who want to internationalize applications for global environments. But, until now, developers have had to turn to standards documents for crucial information on utilizing Unicode. In Unicode Demystified, one of IBM's leading software internationalization experts covers every key aspect of Unicode development, offering practical examples and detailed guidance for integrating Unicode 3.0 into virtually any application or environment. Writing from a developer's point of view, Rich Gillam presents a systematic introduction to Unicode's goals, evolution, and key elements. Gillam illuminates the Unicode standards documents with insightful discussions of character properties, the Unicode character database, storage formats, character sequences, Unicode normalization, character encoding conversion, and more. He presents practical techniques for text processing, locating text boundaries, searching, sorting, rendering text, accepting user input, and other key development tasks. Along the way, he offers specific guidance on integrating Unicode with other technologies, including Java, JavaScript, XML, and the Web. For every developer building internationalized applications, internationalizing existing applications, or interfacing with systems that already utilize Unicode.

 

What people are saying - Write a review

User Review - Flag as inappropriate

Buy/borrow this and read the standard online.

Contents

An Architectural Overview of the Unicode Standard
1
Language Computers and r1 Unicode
3
WHAT UNICODE Is
7
WHAT UNICODE ISNT
10
THE CHALLENGE OF REPRESENTING TEXT IN COMPUTERS
14
WHAT THIS BOOK DOES
18
How THIS BOOK Is ORGANIZED
20
Unicode in Essence
21
The CJK Unified Ideographs Extension B Area
368
The CJK Compatibility Ideographs Supplement Block
369
The CJK Radicals Supplement Block
370
BOPOMOFO
376
The Bopomofo Block
377
The Bopomofo Extended Block
378
The Hiragana Block
385
The Katakana Block
386

Unicode in Action
23
A Brief History of Character Encoding
25
The Telegraph and Morse Code
26
The Teletypewriter and Baudot Code
28
Other Teletype and Telegraphy Codes
29
FIELDATA and ASCII
31
Hollerith and EBCDIC
33
SINGLEBYTE ENCODING SYSTEMS
36
EightBit Encoding Schemes and the ISO 2022 Model
37
ISO 8859
38
Other 8Bit Encoding Schemes
40
CHARACTER ENCODING TERMINOLOGY
41
a MULTIPLEBYTE ENCODING SYSTEMS
45
Character Encoding Schemes for East Asian Coded Character Sets
47
Other East Asian Encoding Systems
50
ISO 10646 AND UNICODE
51
How THE UNICODE STANDARD Is MAINTAINED
57
Not Just a Pile of Code Charts
61
THE UNICODE CHARACTERGLYPH MODEL
62
i CHARACTER POSITIONING
66
i THE PRINCIPLE OF UNIFICATION
70
AlternateGlyph Selection
75
MULTIPLE REPRESENTATIONS
76
FLAVORS OF UNICODE
80
CHARACTER SEMANTICS
83
UNICODE VERSIONS AND UNICODE TECHNICAL REPORTS
86
Unicode Standard Annexes
87
Unicode Technical Standards
88
Draft and Proposed Draft Technical Reports
89
Unicode Versions
90
Unicode Stability Policies
91
ARRANGEMENT OF THE ENCODING SPACE
93
The Basic Multilingual Plane
95
The Supplementary Planes
99
Noncharacter Code Point Values
100
CONFORMING TO THE STANDARD
104
Producing Text as Output
106
Interpreting Text from the Outside World
107
Passing Text Through
108
Comparing Character Strings
110
Sequences and Unicode Normalization
111
How UNICODE NONSPACING MARKS WORK
114
Dealing Properly with Combining Character Sequences
117
CANONICAL DECOMPOSITIONS
118
CANONICAL ACCENT ORDERING
120
DOUBLE DIACRITICS
123
COMPATIBILITY DECOMPOSITIONS
124
SINGLETON DECOMPOSITIONS
127
HANGUL
128
UNICODE NORMALIZATION FORMS
132
GRAPHEME CLUSTERS
134
and the Unicode Character Database
139
WHERE TO GET THE UNICODE CHARACTER DATABASE
140
THE UNIDATA DIRECTORY
141
UNICODEDATATXT
145
PROPLlSTTXT
149
GENERAL CHARACTER PROPERTIES
151
Standard Character Names
152
Algorithmically Derived Names
153
ControlCharacter Names
154
ISO 10646 Comments
155
GENERAL CATEGORY
156
Marks
159
Numbers
160
Symbols
161
Separators
162
OTHER CATEGORIES
163
PROPERTIES OF LETTERS
166
SpecialCasingtxt
167
CaseFoldingtxt
169
PROPERTIES OF DIGITS NUMERALS AND MATHEMATICAL SYMBOLS
170
LAYOUTRELATED PROPERTIES
171
Bidirectional Layout
172
Mirroring
173
East Asian Width
174
LineBreaking Property
175
NORMALIZATIONRELATED PROPERTIES
176
Decomposition
177
Combining Class
179
Normalization Test File
181
Derived Normalization Properties
182
Grapheme ClusterRelated Properties
183
Unicode Storage and Serialization Formats
185
A HISTORICAL NOTE
186
UTF32
188
UTF16 AND THE SURROGATE MECHANISM
189
ENDIANNESS AND THE BYTE ORDER MARK
192
UTF8
195
CESU8
199
UTFEBCDIC
200
UTF7
201
STANDARD COMPRESSION SCHEME FOR UNICODE
202
BOCU
207
DETECTING UNICODE STORAGE FORMATS
208
A Guided Tour of the Character Repertoire
211
Scripts of Europe
213
THE WESTERN ALPHABETIC SCRIPTS
214
THE LATIN ALPHABET
216
The Latin1 Characters
219
The Latin Extended A Block
220
The Latin Extended B Block
222
The Latin Extended Additional Block
225
The International Phonetic Alphabet
226
DIACRITICAL MARKS
228
Isolated Combining Marks
234
Spacing Modifier Letters
235
THE GREEK ALPHABET
237
The Greek Block
240
The Greek Extended Block
242
THE CYRILLIC ALPHABET
243
The Cyrillic Block
248
The Cyrillic Supplementary Block
249
THE GEORGIAN ALPHABET
251
m Scripts of the Middle East
255
BIDIRECTIONAL TEXT LAYOUT
256
THE UNICODE BIDIRECTIONAL LAYOUT ALGORITHM
261
Inherent Directionality
262
Neutrals
265
Numbers
266
The LefttoRight and RighttoLeft Marks
267
The Explicit Override Characters
269
The Explicit Embedding Characters
270
Mirroring Characters
271
Line and Paragraph Boundaries
272
THE HEBREW ALPHABET
276
The Hebrew Block
279
THE ARABIC ALPHABET
280
The Arabic Block
285
Joiners and Nonjoiners
286
The Arabic Presentation Forms B Block
288
The Arabic Presentation Forms A Block
290
The Syriac Block
293
THE THAANA SCRIPT
294
The Thaana Block
296
Scripts of India and r A Southeast Asia
297
DEVANAGARI
300
The Devanagari Block
307
The Bengali Block
314
The Gurmukhi Block
316
The Gujarati Block
318
The Oriya Block
319
TAMIL
320
The Tamil Block
323
The Telugu Block
324
The Kannada Block
326
The Malayalam Block
328
The Sinhala Block
330
THAI
331
The Thai Block
333
The Lao Block
335
The Khmer Block
338
The Myanmar Block
340
The Tibetan Block
343
THE PHILIPPINE SCRIPTS
344
Scripts of East Asia
347
THE HAN CHARACTERS
349
VARIANT FORMS OF HAN CHARACTERS
360
HAN CHARACTERS IN UNICODE
362
The CJK Unified Ideographs Area
367
KOREAN
387
The Hangul Jamo Block
390
The Hangul Compatibility Jamo Block
392
HALFWIDTH AND FULLWIDTH CHARACTERS
393
The Halfwidth and Fullwidth Forms Block
396
RUBY
400
The Interlinear Annotation Characters
401
The Yi Syllables Block
404
Scripts from Other Parts of the World
405
MONGOLIAN
407
The Mongolian Block
409
ETHIOPIC
411
The Ethiopic Block
413
The Cherokee Block
414
CANADIAN ABORIGINAL SYLLABLES
415
The Unified Canadian Aboriginal Syllables Block
417
Runic
418
Ogham
419
Old Italic
420
Gothic
421
Deseret
422
Numbers Punctuation Symbols and Specials
425
NUMBERS
426
Alphabetic Numerals
427
Numerals
429
Han Characters as Numerals
430
Other Numeration Systems
433
Numeric Presentation Forms
435
National and Nominal Digit Shapes
436
ScriptSpecific Punctuation
437
The CJK Symbols and Punctuation Block
440
Dashes and Hyphens
444
Quotation Marks Apostrophes and SimilarLooking Characters
446
Paired Punctuation
454
Dot Leaders
455
SPECIAL CHARACTERS
456
Line and Paragraph Separators
457
Segment and Page Separators
459
Control Characters
460
Characters That Control Word Wrapping
461
Characters That Control Glyph Selection
464
The Grapheme Joiner
473
Bidirectional Formatting Characters
474
Deprecated Characters
475
Interlinear Annotation
476
The Object Replacement Character
477
The General Substitution Character
478
Tagging Characters
479
Noncharacters
481
SYMBOLS USED WITH NUMBERS
482
Numeric Punctuation
483
Unit Markers
484
Math Symbols
485
Mathematical Alphanumeric Symbols
489
OTHER SYMBOLS AND MISCELLANEOUS CHARACTERS
490
Braille
492
Other Symbols
493
Presentation Forms
494
Miscellaneous Characters
496
Implementing and Using the Unicode Standard
497
Unicode Text
499
USEFUL DATA STRUCTURES
500
TESTING FOR MEMBERSHIP IN A CLASS
501
The Inversion List
504
Performing Set Operations on Inversion Lists
506
MAPPING SINGLE CHARACTERS TO OTHER VALUES
512
The Compact Array
514
TwoLevel Compact Arrays
519
MAPPING SINGLE CHARACTERS TO MULTIPLE VALUES
520
Exception Tables
521
a MAPPING MULTIPLE CHARACTERS TO OTHER VALUES
523
Tries as Exception Tables
527
Tries as the Main Lookup Table
530
SINGLE VERSUS MULTIPLE TABLES
533
Conversions and Transformations
535
CONVERTING BETWEEN UNICODE ENCODING FORMS
536
Converting Between UTF16 and UTF32
538
Converting Between UTF8 and UTF32
540
Converting Between UTF8 and UTF16
546
UNICODE NORMALIZATION
557
Canonical Decomposition
559
Compatibility Decomposition
565
Canonical Composition
567
Optimizing Unicode Normalization
575
Testing Unicode Normalization
576
CONVERTING BETWEEN UNICODE AND OTHER STANDARDS
577
Getting Conversion Information
578
Converting Between Unicode and Multibyte Encodings
579
Handling Exceptional Conditions
581
Dealing with Differences in Encoding Philosophy
583
Choosing a Converter
584
CASE MAPPING AND CASE FOLDING
585
Case Mapping on a String
587
Case Folding
588
TRANSLITERATION
589
Searching and Sorting
595
THE BASICS OF LANGUAGESENSITIVE STRING COMPARISON
596
Multilevel Comparisons
600
Ignorable Characters
603
French Accent Sorting
604
Contracting Character Sequences
605
Expanding Characters
606
Putting It All Together
607
Other Processes and Equivalences
608
LANGUAGESENSITIVE COMPARISON ON UNICODE TEXT
609
Reordering
611
A General Implementation Strategy
613
The Unicode Collation Algorithm
617
The Default UCA Sort Order
619
Alternate Weighting
624
Optimizations and Enhancements
626
LANGUAGEINSENSITIVE STRING COMPARISON
630
SORTING
632
Exposing Sort Keys
635
Minimizing Sort Key Length
636
SEARCHING
638
The BoyerMoore Algorithm
640
Using the BoyerMoore Algorithm with Unicode
644
Whole Word Searches
645
USING UNICODE WITH REGULAR EXPRESSIONS
646
Rendering and Editing
651
LINE BREAKING
652
LineBreaking Properties
653
Implementing Boundary Analysis with Pair Tables
657
Implementing Boundary Analysis with State Machines
659
Performing Boundary Analysis Using a Dictionary
662
A Few More Thoughts on Boundary Analysis
663
Performing Line Breaking
664
LINE LAYOUT
666
GLYPH SELECTION AND POSITIONING
672
Font Technologies
673
Poor Mans Glyph Selection
676
Glyph Selection and Placement in AAT
678
Glyph Selection and Placement in OpenType
682
SpecialPurpose Rendering Technology
684
SPECIAL TEXTEDITING CONSIDERATIONS
685
Accepting Text Input
691
Handling Arrow Keys
692
Handling Discontiguous Selection
695
Handling MultipleClick Selection
696
Unicode and Other J Technologies
699
The W3C Character Model
700
XML
704
HTML and HTTP
705
URLs and Domain Names
706
Mail and Usenet
708
UNICODE AND PROGRAMMING LANGUAGES
712
Java
713
C and C++
714
Javascript and JScript
715
ICU
716
UNICODE AND OPERATING SYSTEMS
717
MacOS
718
CONCLUSION
719
Glossary
721
r A Bibliography
811
Index
817
Copyright

Other editions - View all

Common terms and phrases

About the author (2003)

Richard Gillam is a senior development engineer at Trilogy, a leading developer of large-enterprise e-commerce solutions. He is a former member of IBM's Globalization Center of Competency, where he was one of the original designers of the open-source International Components for Unicode and was responsible for several of the international frameworks in the Java Class Libraries. Rich is a former columnist for C++ Report, a regular presenter at the International Unicode Conferences, and a Specialist Member of the Unicode Consortium.



0201700522AB08092002

Bibliographic information