Unicode Support for Mathematics

# Unicode Support for Mathematics

## Unicode Support for Mathematics

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Unicode Support for Mathematics Murray Sargent III Microsoft

2. Overview • Unicode math characters • Semantics of math characters • Unicode and markup • Multiple ways of encoding math characters • Not yet standardized math characters • Inputting math symbols

3. Unicode Math Characters • 340 math chars exist in ASCII, U+2200 – U+22FF, arrows, combining marks of Unicode 3.0 • 996 math alphanumeric characters are in Unicode 3.1’s Plane 1 • 591 new math symbols and operators are in Unicode 3.2’s BMP • One math variant selector • One new combining character (reverse solidus).

4. Basic Set of Alphanumeric Characters • Latin digits (0 - 9) • Upper- & lowercase Latin letters (a - z, A - Z) • Uppercase Greek letters Α - Ω plus the nabla ∇ and the variant of theta Θ given by U+03F4 • Lowercase Greek letters α - ω plus the partial differential sign ∂ and glyph variants of ε, θ, κ, φ, ρ, and π • Only unaccented forms of letters are used

5. Math Alphanumeric Characters • Math needs various Latin and Greek alphabets like normal, bold, italic, script, Fraktur, and open-face • May appear to be font variations, but have distinct semantics • Without these distinctions, you get gibberish, violating Unicode rule: plain text must contain enough info to permit the text to be rendered legibly, and nothing more • Plain-text searches should distinguish between alphabets, e.g., search for script H shouldn’t match H, etc. • Reduces markup verbosity

6. Legibility Loss Without math alphabets, the Hamiltonian formula  H =  dτ[εE2 + μH2]  becomes an integral equation H =  dτ[εE2 + μH2]

7. Math Alphanumeric Chars (cont) Plain a-z, A-Z, 0-9, -, -Ω Bold a-z, A-Z, 0-9, -, -Ω Italic a-z, A-Z, -, -Ω Bold italic a-z, A-Z, -, -Ω Script a-z, A-Z Bold script a-z, A-Z Fraktur a-z, A-Z Bold Fraktur a-z, A-Z Double struck a-z, A-Z, 0-9 Sans-serif a-z, A-Z, 0-9 Sans-serif bold a-z, A-Z, 0-9, -, -Ω Sans-serif italic a-z, A-Z Sans-serif bold italic a-z, A-Z, -, -Ω Monospace a-z, A-Z, 0-9

8. How Display Math Alphabets? • Can use Unicode surrogate pair mechanisms available on OS • Alternatively, bind to standard fonts and use corresponding BMP characters • Second approach probably faster and to display Unicode one needs font binding in any event. But most traditional fonts are not suited to math alphabetic characters • A single math font may look more consistent

9. Math Alphabetics via Glyph Variants • One approach to the math alphanumerics would be to use a set of math glyph variant selectors • Such a tag would follow a base character imparting a math style • Approach was dropped since it seemed likely to be abused • One math variant selector does exist to offer a different line slant for some composite symbols • Other variant selectors are being defined for nonmath purposes, e.g., Han variants

10. Multiple Character Encodings • As with nonmath characters, math symbols can often be encoded in multiple ways, composed and decomposed • E.g., ≠ can be U+003D, U+0338 or U+2260 • Recommendation: use the fully composed symbol, e.g., U+2260 for ≠ • For alphabetic characters, use combining-mark sequences to get consistent typography • Some representations use markup for the alphabetic cases. This allows multicharacter combining marks.

11. Compatibility Holes • Compatibility holes (reserved positions) exist in some Unicode sequences to avoid duplicate encodings (ugh!) • E.g., U+2071-U+2073 are holes for ¹²³, which are U+00B9, U+00B2, and U+00B3, respectively • Math alphanumerics have holes corresponding to Letterlike symbols. • Recommendation: you can use the hole codes internally, but must import and export the standard codes.

12. Nonstandard Characters • People will always invent new math characters that aren’t yet standardized. • Use private use area for these with a higher-level marking that these are for math. • This approach can lead to collisions in the math community (unless a standard is maintained) • Cut/copy in plain text can have collisions with other uses of the private use area

13. Unicode and Markup • Unicode was never intended to represent all aspects of text • Language attribute: sort order, word breaks • Rich (fancy) text formatting: built-up fractions • Content tags: headings, abstract, author, figure • Glyph variants: Poetica font: 58 ampersands; Mantinia font: novel ligatures (TT, TE, etc.) • MathML adds XML tags for math constructs, but seems awfully wordy

14. Unicode Plain Text • Can do a lot with plain text, e.g., BiDi • Grey zone: use of embedded codes • Unicode ascribes semantics to characters, e.g., paragraph mark, right-to-left mark • Lots of interesting punctuation characters in range U+2000 to U+204F • Extensive character semantics/properties tables, including mathematical, numerical

15. Unicode Character Semantics • Math characters have math property • Math characters are numeric, variable, or operator, but not a combination • Properties are useful in parsing math plain text • MathML doesn’t use these properties: every quantity is explicitly tagged • Properties still can be useful for inputting text for MathML (noone wants to type all those tags!) • Sometimes default properties need to be overruled • Would be useful to have more math properties

16. Plain Text Encoding • TEX fraction numerator is what follows a { up to keyword \over • Denominator is what follows the \over up to the matching } • { } are not printed • Simple rules give unambiguous “plain text”, but results don’t look like math • How to make a plain text that looks like math?

17. Simple plain text encoding • Simple operand is a span of alphanumeric characters • E.g., simple numerator or denominator is terminated by any operator • Operators include arithmetic operators, most whitespace characters, all U+22xx, an argument “break” operator (displayed as small raised dot), sub/superscript operators • Fraction operator is given by the Unicode fraction slash operator U+2044

18. Fractions • abc/d gives • More complicated operands use parentheses ( ), brackets [ ], or { } • Outermost parens aren’t displayed in built-up form • E.g., plain text (a + c)/d displays as • Easier to read than TEX’s, e.g., {a + c \over d} • MathML: <mfrac><mrow><mi>a</mi><mo>+</mo> <mi>c</mi></mrow><mrow><mi>d</mi> </mrow></mfrac> • Neat feature: plain text looks like math

19. Subscripts and Superscripts • Unicode has numeric subscripts and superscripts along with some operators (U+2070-U+208E) • Others need some kind of markup like <msup>…</msup> • With special subscript and superscript operators (not yet in Unicode), these scripts can be encoded nestibly • Use parentheses as for fractions to overrule built-in precedence order

20. Presentation markup • Presentation markup directs how the math should be rendered. <mrow> <mi>E</mi> <mo>=</mo> <mrow> <mi>m</mi> <mo>&InvisibleTimes;</mo> <msup> <mi>c</mi> <mn>2</mn> </msup> </mrow> </mrow>

21. Content markup • Content markup describes the meaning of the expression, not the format. <rel> <eq/> <ci>E</ci> <apply> <times> <ci>m</ci> <apply> <power/> <ci>c</ci> <cn>2</cn> </apply> </times> </apply> </rel>

22. Unicode TEX Example

23. Symbol Entry • GUI PCs can display a myriad glyphs, mathematics symbols, and international characters • Hard to input special symbols. Menu methods are slow. Hot keys are great but hard to learn • Reexamine and improve symbol-input and storage methods • With left/right Ctrl/Alt keys, PC keyboard gives direct access to 600 symbols. Maximum possible = 2100 = 1030 • Use on-screen, customizable, keyboards and symbol boxes • Drag & drop any symbol into apps or onto keyboards

24. Hex to Unicode Input Method • Type Unicode character hexadecimal code • Make corrections as need be • Type Alt+x to convert to character • Type Alt+x to convert back to hex (useful especially for “missing glyph” character) • Resolve ambiguities by selection • Input higher-plane chars using 5 or 6-digit code • New MS Word standard

25. Built-Up Formula Heuristics • Math characters identify themselves and neighbors as math • E.g., fraction (U2044), ASCII operators, U2200–U22FF, and U20D0–U20FF identify neighbors as mathematical • Math characters include various English and Greek alphabets • When heuristics fail, user can select math mode: WYSIWYG instead of visible math on/off codes

26. Operator Precedence • Everyone knows that multiply takes precedence over add, e.g., 3+5×3 = 18, not 24 • C-language precedence is too intricate for most programmers to use extensively • TEX doesn’t use precedence; relies on { } to define operator scope • In general, ( ) can be used to clarify or overrule precedence • Precedence reduces clutter, so some precedence is desirable (else things look like LISP!) • But keep it simple enough to remember easily

27. Layout Operator Precedence Subscript, superscript ¯ ­ Integral, sum ò S P Functions Ö Times, divide / * × · • Other operators Space ". , = - + Tab Right brackets )]}| Left brackets ([{ End of paragraph FF EOP

28. Mathematics as a Programming Language • Fortran made great steps in getting computers to understand mathematics • Java and C# accept Unicode variable names • C++ has preprocessor and operator overloading, but needs extensions to be really powerful • Use Unicode characters including math alphanumerics • Use plain-text encoding of mathematical expressions • Can’t use all mathematical expressions as code, but can go much further than current languages go • When to to multiply? In abstract, multiplication is infinitely fast and precise, but not on a computer

29. void IHBMWM(void) { gammap = gamma*sqrt(1 + I2); upsilon = cmplx(gamma+gamma1, Delta); alphainc = alpha0*(1-(gamma*gamma*I2/gammap)/(gammap + upsilon)); if (!gamma1 && fabs(Delta*T1) < 0.01) alphacoh = -half*alpha0*I2*pow(gamma/gammap, 3); else { Gamma = 1/T1 + gamma1; I2sF = (I2/T1)/cmplx(Gamma, Delta); betap2 = upsilon*(upsilon + gamma*I2sF); beta = sqrt(betap2); alphacoh = 0.5*gamma*alpha0*(I2sF*(gamma + upsilon) /(gammap*gammap - betap2)) *((1+gamma/beta)*(beta - upsilon)/(beta + upsilon) - (1+gamma/gammap)*(gammap - upsilon)/ (gammap + upsilon)); } alpha1 = alphainc + alphacoh; }

30. Conclusions • Unicode provides great support for math in both marked up and plain text • Unicode character properties facilitate plain-text encoding of mathematics but aren’t used in MathML • Heuristics allow plain text to be built up • Need two more Unicode assignments: subscript and superscript operators • On-screen keyboards and symbol boxes aid formula entry • Unicode math characters could be useful for programming languages