Working with digital texts part 2

Working with digital texts part 2 Francesca Benatti and David King, The Open University

How does a computer process text?

Character encoding A character is the smallest component of a script that has a semantic value An pre-computer character encoding: Morse alphabet used in telegraphs Computers represent all data as pulses of electricity With charge = 1 Without charge= 0 One bit of information = one switch (either 1 or 0) Eight bits = one byte

ASCII American Standard Code for Information Interchange: ASCII One byte = eight bits = one ASCII character ASCII maps the 127 most common characters to binary equivalents Thus the Western alphabet can be transmitted/stored as pulses of electricity Hi! = 101000 1101001 0100001

Pros and cons of ASCII Pros: Interoperability Ubiquity Cons: Limited character set only works for modern English No representation of non-Western scripts, diacritic marks etc.

Youcannotdo this in ascii! ART. I. 1. Ιστορία Σουλιου καῖ Παργας, περιεχουσα την χρονολογίαν καῖ τους αὐτῶν πολεμους μετὰ του Ἀλῆ Πασια : viz. The History of Suli and Parga, containing their Chronology as well as their Wars against Ali Pacha. Venice, 1815.

Unicode (UTF-8) Up to 4 bytes per character (up to 1.2 million characters) 1 byte = ASCII (128 characters) 2 bytes = all Latin scripts with diacritics, Greek, Cyrillic, Hebrew, Arabic … (1,920 characters) 3 bytes = Chinese, Japanese, Korean … 4 bytes = historical scripts, emojis

Encoding and research 92.4% of websites use UTF-8 But some systems still rely on ASCII e.g. email addresses Strange characters in the middle of a text are usually a sign of character encoding problems

Line endings – What is a line? The answer is that it depends on your computer's operating system (OS). Line ending is marked by a special, non-displayable sequence: Windows CRLF carriage return linefeed \r\n Mac CR carriage return \r Unix LF linefeed \n Thus, Windows files of the same content are always bigger than their Mac or Unix equivalents because each line ends with two bytes instead of one. This can make file comparison tricky, even though the textual contents are identical!

Line endings – When it goes wrong From Dante’s Purgatory, eighth canto: Now was the hour that wakens fonddesire In men at sea, and melts their thoughtful hearts, Who in the morn have bid sweet friends farewel ; And pilgrim newly on his road with love Thrills, if he hear the vesper bell from far, That seems to mourn for the expiring day. Now was the hour that wakens fond desire In men at sea, and melts their thoughtful hearts, Who in the morn have bid sweet friends farewel ; And pilgrim newly on his road with love Thrills, if he hear the vesper bell from far, That seems to mourn for the expiring day.

What formats can digital texts take?

World Wide Web Consortium - W3C

What are markup languages? • Markup languages are a way of annotating electronic documents. Usually markup either: • Specifies how something should be displayed. • What something means. • Some common markup languages: • HTML – Hypertext Markup Language • XHTML – eXtensible Hypertext Markup Language • SGML – Standard General Markup Language • XML eXtensible Markup Language • We’ll be looking at two today in slightly more depth, both of which you’re likely to discover more about later! • HTML • XML

Hypertext Markup Language (HTML)

Tags in Markup • Tags look like this: • <p>This tag tells your browser to display the text within it as a paragraph</p> • Elements start with an opening tag in angle brackets, e.g. <p>, and finish with a closing tag, e.g. </p>. • Elements that open within an element should close before the first element closes: • <element1><element2></element2></element1> • Both the above principles apply to XML as well – in fact HTML is more forgiving of errors than XML so following the principles is vital to allow automated processing to work.

eXtensible Markup Language (XML)

Exercise: Looking at an example of XML • We’re now going to look at Old Bailey Online, which contains the digitised proceedings of the Old Bailey Criminal Court. • First, go to https://oldbaileyonline.org or google ‘Old Bailey Online’. • There is a search box on the right hand side of the home page – use this to search for anything that might take your fancy…. • We’re going to have a guided discussion around what you can see, based on the following: • 1.) What different ways of viewing the digitised Old Bailey Online records are there, and how do they differ? What does each type of material allow you to do, or not? • 2.) Can you identify any of the elements in the XML file? What categories are there, and how have the resource creators dealt with the unique nature of court records? • 3.) What do you think about the interface for displaying the proceedings? With regard to the Mitchell Whitelaw article that we set for reading, how far does the Old Bailey Online allow for multiple approaches to the data?

Text Encoding Initiative - TEI The best way to add your scholarly knowledge to a digital/digitised text Example: Old Bailey Online Interoperable community standard based on W3C protocols

Why TEI? Computers can only deal with explicit data Text encoding makes aspects of texts beyond words explicit to a computer by "marking them up" in a language computer can understand This language is called XML (eXtensible Markup Language) TEI markup can describe the structure of a text (chapters, lines, speakers), its material layout (e.g. manuscript corrections) and its interpretation The TEI Guidelines set a common grammar of text encoding, agreed upon by a community of humanists, social scientists, linguists and librarians

Example Compare use of italics I did not review it (italics = emphasis) Persuasion (italics = book title) was published in 1818. These uses of italics seem identical to a computer Using TEI instead you can specify why italics are used: Emphasis: I did <emph>not</emph> review it Book title: <title>Persuasion</title> was published in 1818

What can you do with (TEI)? Mark the four main classes of textual phenomena Structural (chapters, verses...) Renditional (fonts, aligments, colours...) Logical and semantic (titles of books, placenames, languages...) Analytic (theme, motive...)

What can you do with (TEI)? Examples

TEI-encoded editions

Ways of structuring data

Further reading Lou Burnard, What is the Text Encoding Initiative?, 2014 . Available open access: http://books.openedition.org/oep/426 With Criminal Intent project http://criminalintent.org/ Pechenick et al. 'Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution" PLoS One, October 2015 http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0137041 Ted Underwood The Stone and the Shell http://tedunderwood.com/

That’s all folks! Thanks for listening/participating, and enjoy the rest of the workshops!

Working with digital texts part 2