180 likes | 313 Vues
This presentation outlines innovative approaches for dynamic glyph generation based on variable length encoding for Chinese, Japanese, and Korean (CJK) texts. It focuses on the challenges of encoding morphemes, particularly the missing characters in Chinese text. Solutions proposed include using basic parts for encoding, creating a glyph decomposition database, and implementing a dynamic glyph generator compatible with existing systems. The integration into current operating systems and addressing the issues related to quality and generation speed are also discussed.
E N D
Dynamic Glyph Generation Based on variable length encoding schema Yap Cheah Shen eForth Technology. Glyph & Typesetting Workshop Kyoto, 29Nov2003
Outline of Presentation • Morpheme: Latin vs. Han • Latin text encoding • Missing character in Chinese text • Solution • Implementation details • Glyph decomposition database • Topological conversion of strokes • Automatic frame calculation • Integrating to existing OS • Other issue
Morpheme: Latin vs. Han • Morpheme is the smallest meaningful unit in a language. • For Latin text, it is “word”. • For Chinese text, it is Hanzi or Kanji. • Representing a real-world idea, morpheme keeps changing from time to time • Morphemes form an open-set.
Latin Text Encoding • Alphabets form a fix set of symbols. • All words can be represented as sequences of alphabets. • They are the ideal encoding units for Latin text; e.g., ASCII. • No “missing word” encoding problem.
Missing Characters in Chinese Text • Not all existing Hanzi are encoded. • Hanzi are in an open-set , theoretically, historically and practically. • Wrong assumptions and designs of existing encoding schema. • Unending loop of assigning code point, OS update, new font, new input method table Industries are happy. (users suffer)
Solution-1 • Parts or components as encoding unit. 日 月 金 木 水 火 土 人 心 手 口 女 艹 疒 犭 • Most characters can be represented by a finite set of basic parts. • Strokes are used to construct rarely used parts.( thousand of parts appear only once or twice)
Solution -2 • A close-set of basic parts and strokes as encoding unit. • 3 Joining operator : horizontal , vertical, and enclosing. • 1 Shielding operator : for hiding stroke • Prefix notation : allowing recursive composition.
Solution-3 • Ordinary CJK fix-length encoding schema, numeric value as character code. • Input method table • Convert input keystroke to character code. • Static Font file • Glyph data is pre-designed • Access glyph data by character code. • Text file • Sequence of character code.
Solution-4 • Additional feature of variable length encoding CJK environment. • Input • Character can be sorted, filtered by parts. • Compatible with any existing input method. • Display • Font file stores commonly used characters and parts. • Generate glyph on the fly by glyph descriptive sequence. • Storage and data-exchange • Compatible with Unicode. • Ideographic description sequence.
Dynamic Glyph Generator • Input: • Various type of Variable length descriptive character code sequence. • 構字式 of Academia Sinica • 組字式 of CBETA • Unicode ideographic descriptive characters • Output: display & print • True-type compatible outline • Rasterized bitmap. • Macromedia Flash, SVG • The Task: a layout problem, fitting a 1 dimensional sequence into a 2 dimensional square.
Implementation -1 The system consists of 3 major parts • Glyph decomposition database • Courtesy of Prof. Hsieh from Academia Sinica, Taiwan http://www.sinica.edu.tw/~cdp/ • Outline of strokes and components • Beijing ZhongYi Co. professional outline font vendor. http://www.zhongyicts.com.cn/ • The eForth system: putting everything together, hardware-software co-engineering.
Implementation-2 • Glyph decomposition database • All CJK glyph defined by Unicode 4.0 , 71000+ in total. • 549 basic parts, stroke sequence are preserved • 3996 total parts • Total parts frequency :165122 • Accumulated frequency: • Top 50 : 51389 = 31% • Top 200 : 87381 = 53% • Top 1000: 129393 = 78%
Implementation-3 • Stroke are describe as a outline with skeletal line. • Both outline and skeletal line are Quadric Bezier curves. • Outline points are recalculated according to scaled- skeletal line. • Result: • Stroke data is highly reusable • Stroke weights are adjustable
Implementation-4 • Automatic frame calculation • Algorithm of estimating the complexity of each parts, to decide the proportion of the part in result glyph. • 漁: 氵25%, 魚 70% , roughly. • 觀 : 雚 55%, 見 40%, roughly. • Result: • Clear glyph descriptive expressions • Search engine friendly • Human readable
Integrating into existing OS/GUI • String manipulation library • Number of characters • -1 for operators, +1 for characters • Characters width • Graphic sub-system • drawing a text line (e.g. ExtTextOut) • Text handling widgets • Awareness of glyphs expression for caret, selection and delete/backspace.
Other Issues • Quality of the glyph • Trade-off with space: More part outlines, better quality. • Speed of generation • No problem for IBM PC, glyph generation is rare. • For handheld device, Hardware acceleration is recommended.
Examples ⿱ Vertical combination ⿰ Horizontal combination ⿴ enclosing – hide • 盟 = ⿰明皿 or ⿰⿱日月皿 • 李世民 = 民-5 hide 5th stroke • 玄燁 = 玄-5 • 丘-4 = U+20009