Shoebox – Starting out and lexical management

Shoebox – Starting out and lexical management

Shoebox / Toolbox • What is it?Shoebox is a data management program for language data. It is not a text editor, but nor is it a database management system in the sense usually understood (it is not an implementation of a relational database system). • Where do I get it?Shoebox:http://www.sil.org/computing/shoebox/index.htmlCost: US$19.95Toolbox:http://www.sil.org/computing/toolbox/Freeware

Shoebox and Toolbox • Which program / version should I use?If you use a Windows PC, then you should certainly use Toolbox – it has features which are not in Shoebox such as better Unicode compliance, better xml export and (best of all) it supports scrolling from a mouse. • If you work on Mac, you don’t have a choice – Shoebox works on Mac, but Toolbox doesn’t (except under VirtualPC). And Shoebox runs under OS9. • The application is not officially available for Linux (or UNIX). But Shoebox and Toolbox will actually run under Linux with an add-on (ask Baden).

Why Shoebox? • Advantages • Good functionality • Choice of output possibilities • Portability – simple file formats • Drawbacks • Data input not always easy (but no other application is better!) • Manuals are not always easy to use • Way interlinear is stored means you need to revise often • Everything is text, only weak data typing is possible

Installing • When you install Shoe/Toolbox, (versions later than 4), the installation creates a folder called “My Shoebox Settings” on your C: drive. • By default, all Shoebox files will be stored here, but you can easily change the setting. • But you have to careful moving things! • Look at the sample files for help.

Basic concepts • ProjectA project is the work unit in Shoebox. A project file (.prj) is a shell file which holds information about what files are included in the project and what their properties are. • Database typeWhen you set up a project, you have to define the type of the database files which you want to use in that project – that is, you must specify what fields are included in the file and what the properties of the fields are. (Shortly, we will work through setting up a lexicon database) • Language encodingA crucial property which you have to set for each field in a database is the language encoding which will be used. A language encoding includes information about: • List of characters • Case pairs • Sort order • Onscreen presentation • Variables

Language Encoding - 1 • Exploring the Help files, there is lots of useful information on how to set properties of a Language Encoding, but nothing about how to choose the characters which you want to use! • You can do this in two ways: • Create a new encoding file then use the Language Encoding dialogues to work through all the bits and pieces which you want to do. (tricky) • Create a new encoding file, then open it in a text editor and manipulate it there. (easier)

Language Encoding - 2 • Case pairs – you have to tell the program which pairs of symbols to treat as alphabetically equivalent, e.g. A = a for sorting and for parsing. • Sort order – you have to tell the program what order you want the alphabet to be in for sorting, e.g. if you use glottal stop, where should it be in the alphabet? • Fonts – you can specify the on-screen characteristics of each language encoding which you use. This is useful to make the screen easier to read. • You can also specify screen characteristics for fields when you define a database type, overriding or modifying the language settings • Neither of these options affects the presentation of data when you export to the Multi Dictionary Formatter (MDF) – MDF uses its own font settings regardless.

Language Encoding - 3 • Variables – you have to specify which characters will be included in which sets of variables. The default groupings which are set in the program are: • Everything • Lower case • Upper case • Vowels • Consonants • Nasals • Punctuation • Digits • These variable definitions are used for wildcards in searches, and for specifying some morphological processes in parsing.

Language encoding and data input • Inputting non-ASCII characters is a problem! • One solution is to use a keyboard mapping utility – Tavultesoft Keyman is recommended for Shoebox • One option available in a language encoding is to associate a keyboard mapping with that language • Keyboard definitions are available for Keyman, but if what you want doesn’t exist, you have to make a definition yourself • An alternative, assuming you are using Unicode, is to use UniPad • A Unicode text editor • Keyboards can be made by dragging and dropping • Keyboards are both hard (you type) and soft (click on display on screen)

Database types - 1 • Relational database (e.g. Access) • One field (or a combination of fields) must have data and function as unique identifier • Every field specified in the definition occurs in every record • Every field specified in the definition occurs only once in each record • Non-relational database (Shoebox) • One field specified in the definition must occur in every record as unique identifier – the record marker • Other fields can occur many times in each record

Database types – Markers 1 • Shoebox database files are a special sort of text file: Standard Format Marker files • A new record starts with the occurrence of a record marker field • Each field has the structure: • Marker – ‘\’ character + identifying string • Text content – whatever is stored in the field • Return – indicates end of field • NB – database definitions and language encodings are also SFM files

Database types – markers 2 • When you define a database type, you define a set of markers • First you must define the record marker • For a lexicon, the head word is a good choice, as this will provide the default sort order for the file • For each marker and its associated field, you can specify various properties.

Marker properties - 1 • Marker – from standard list for MDF, or mnemonic • Name – should be unambiguous, relates to marker • Hierarchy – more to follow on this • Following field – useful if one field will always occur with another one • Language encoding – ensures that needed characters are available for that field • Description – important documentation for other users (or you in a few years!) • Font – you can allow the default font settings which go with the language encoding, or you can override them

Marker properties - 2 • Although Shoebox doesn’t permit any strong data typing, you can do a little bit to make things more secure • You can specify that a field cannot be empty (other then the record marker which must have data anyway) • You can specify that a field will not contain spaces • Range set – you can specify that a field will only contain one of a set of specified values, useful or e.g. part of speech, semantic domains

Database types – dates • Date stamping – you can include a date field (usually \dt) in your database and enable automatic date stamping • Date stamping happens on insertion of a record and then again whenever a record is edited – if you want to preserve the information about when you first entered a record, this has to be done manually • You have to create a date field before you can enable date stamping

Hierarchies • Hierarchies are used to create structure within records • This feature is especially valuable in a lexicon file which has sub-entries with multiple part of speech and gloss information • Hierarchies are defined for each field in the Markers window of the Database Type dialogue • The predefined MDF_4.0 database type has a complex hierarchy included • A properly defined hierarchy ensures that all relevant information is retrieved in sorts and filters i.e. glosses for all sub-entries rather than just the first gloss entry

Other database properties • There are plenty of other features which can be set for a database • Many of these are not so relevant to lexica – interlinear, jump path etc. • We will return to some of these this afternoon

MDF fields • The full definition of the MDF_4 database type has 103 fields specified • It is unlikely that you will want to use all of these! • There are three possible approaches: • Use the preset and just don’t bother about the fields you don’t use • Eliminate fields from the preset until you have what you want • Create a new database definition from scratch • We’ll work through option 1 here

Entries in the dictionary • The record marker for a MDF file is \lx – the lexeme • This can be morpheme smaller than a word • Other forms can be included: • A citation form \lc • A phonetic form \ph • Alternative forms: to be listed in a dictionary, these are entered under \va, for interlinear use they are typically entered under \a which is not a defined field in MDF_4

Sub-entries, sense numbers and homonyms • Homonyms should be used where forms are identical but there is no semantic relationship • Homonyms are identified only by a number in the field \hm • Sub-entries should be used where a word or phrase is derived from the root • Sub-entries are identified by numbers in the field \se • Where a form has multiple sense within the same part of speech, the senses are identified by a number in the field \sn

The hierarchy in entries • The hierarchical structure of entries set up by the various markers is:Head item homonym 1 pos1 sense1 sense2 pos2 sense1 subentry1 pos1 sense1 subentry2 pos1 sense1 homonym2 pos1 sense1 sense2

Word classes • As we just saw, word classes are very important in the hierarchical structure • The field used for this information is \ps • A field is also available for word class names in a second language \pn • The MDF format assumes that you will work with three (or four) languages: • A vernacular language (the object language) • A national language • An international language (probably English) • (a regional language can also be used) • The MDF_4 file recommends use of range sets for these word class fields – this is unrealistic at early stages, you have to know a lot about a language before you are confident about listing word classes exhaustively

Glosses and definitions • Single word glosses for use in interlinears can be entered in English (\ge) and the national language (\gn) (\gr is also available) • More extended definitions can be entered in English (\de), national language (\dn) and the vernacular (\dv) (\dr is also available) • Encyclopedic information can be entered in all three (four) languages: \ee, \en, \ev, (\er)

Semantic information • MDF_4 offers a semantic domain field (\sd - English) and also a thesaurus field (\th - vernacular) • For both, use of a range set is recommended, but again this is unrealistic in the early stages of research • It is better to allow categories to be added freely until a good picture is obtained of the semantic domains needed, then move to restricting the possible entries

Examples • MDF_4 has five fields for including example phrases or sentences • \rf – to provide a reference to the example • \xv – vernacular text (i.e. the actual example) • \xe – English text, a free translation • \xn – national language text, a free translation • \xr – regional language text • As Shoebox is a non-relational database, it is possible to use each of these fields several times in one record – you can include as many examples as you like for each entry • There is a hierarchy here: • \rf is under sense number, and allows you to give a reference for each example • \xv is under \rf and over the other \x.. fields, ensuring that the translations for each example stay together

Notes • The MDF_4 definition specifies many fields for notes • All the fields which are defined will be exported in the MDF process • So maybe more important than the distinctions allowed in the preset is a distinction between information which will appear in the dictionary, and information which is for your use only • I recommend creating a notes field which isn’t part of the MDF presets! • \so – a field for the source of the data

Miscellaneous • \bw – borrowed word, for entering the source language • \cf – cross-reference, plus fields for glosses for the referenced item • \mr – morphology, for showing the internal structure of morphologically complex items (note that this may not be desirable for interlinear glossing!) • Various reversal fields – used in making finder lists, can be used if you don’t want the given gloss to be the reversal of an entry

Housekeeping • Date stamping is very valuable – for example it can be useful to be able to sort or filter entries by date • But as noted before, if you want to keep track of both the date of insertion of a record and the date of last edit, you will need two fields and you will have to manually enter the date in the first one • MDF_4 also allows a status field (\st) which is very useful for tracking whether an entry is complete and fully checked, whether it is in the last printed version of a dictionary etc.

Other stuff • Obviously we have only looked at a few of the fields which are defined in MDF_4 • It is worth looking through the entire list to see what could be relevant to your needs • Reversal fields are certainly worth investigating • But it is also worth remembering that you can achieve a lot with a reasonably small number of fields

Range sets • Range sets, as previously mentioned, are used to limit the values which can appear in a field • Often, it is not possible to specify a set of values when you start work on a language • When you have some data, Shoebox can automatically create a set of values for you from what is already entered • You must remember to check the “Use a Range Set” box in the Marker properties section of the Database Type dialogue

Consistency checks • Shoebox can perform some checking of data for you automatically • If you choose Consistency Check from the Tools menu • If you specify that data should be checked in an export process • When you move to a new record if you have Check Consistency When Editing enabled on the Tools menu • In any of these cases, the program will check: • That data matches any Data Property settings • That data matches any Range Set settings • That Jump Path destinations are valid links • It is valuable to constrain data as much as possible and to reduce the possibility for entering invalid data

Export processes • The most important export process when working with a lexicon is the Multi Dictionary Formatter (MDF) • This powerful facility creates fully formatted dictionaries and finder lists from your lexicon file • The results are Rich Text Format files (.rtf) which can be opened and manipulated in most word processing packages (such as Word)

MDF basics • You can choose to export your data to a bilingual dictionary or a trilingual dictionary • If bilingual, you can choose whether the second language is English or the relevant national language • If trilingual, English and the national language are used • (Regional language apparently vanishes at this point)

Other options • Data can be filtered (i.e. only entries which correspond to some criteria are included) • Fields can be excluded • Some formatting can be controlled: • Header and footer material • Total number of entries is printed • Output file can be .rtf or web pages (HTML)

Other export possibilities • You can export all your data as a document in .rtf format, or a text format which Shoebox describes as ‘standard’ • In these exports, you can export the records in the current window, or all records • You can define other export processes for yourself – if you feel brave!

Lexique Pro • Lexique Pro is a freeware tool distributed by SIL via www.lexiquepro.com • It is intended to produce versions of lexica for distribution to people who are not Shoebox users • The program makes a version which is well-formatted for on-screen viewing • It also makes an executable file (.exe) to distribute the lexicon to other people – this will install a run-time version of Lexique Pro and the database extracted from your lexicon onto another persons computer • You can also export your lexicon as web pages

Shoebox – Starting out and lexical management

Shoebox – Starting out and lexical management

Presentation Transcript

Lexical Approach

The big M method

CLASSIFYING TRAITS (II): THE ‘BIG FIVE’ DEVELOPMENT OF THE ‘BIG 5’ TAXONOMY: LEXICAL APPROACH FACTOR ANALYSIS

Lexical Analyzer

Introduction to PowerPoint

Terminology Organization in Terminology Management Systems

Agricultural Production Management

LEXICON AND LEXICAL SEMANTICS WORDNET

Dr. Bill Vicars Lifeprint.com

FCAPS

Starting in 10 minutes

Compiler Lexical Analysis

inspire. transform. achieve.

Lexical networks, lexical centrality, and text mining

Chapter 3: Lexical Analysis

Risk Management Conserving your wealth.

Acquisition of Lexical Knowledge for NLP

Lexical Analysis

CHAPTER 3 LEXICAL ANALYSIS

Lexical networks, lexical centrality, and text mining

STARTING YOUR OWN IT COMPANY

Liina Pylkkänen Department of Linguistics/ Center for Neuromagnetism New York University