1 / 26

Words (etc.)

This text explores the vague notion of "words" and the challenges they pose in language processing. It discusses differences between words, special terminology, and the arbitrariness of sense differences.

tinas
Télécharger la présentation

Words (etc.)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Words (etc.) John Barnden School of Computer Science University of Birmingham Natural Language Processing 1 2015/16 Semester 2

  2. Some Problems • Intro Exercise-Set B suggests that our intuitive notion of a “word” is vague, and inadequate even for informal purposes. • E.g., we think of sentences as formed of words, but probably wouldn’t think of “£12,000” as a word, or of “VAT” as a proper word (except when it means a large container, of course), but would nevertheless agree that “Mary bought the car for £12,000 plus VAT” was a properly formed sentence! • There are many sorts of abbreviation, acronym and special symbol in written language, and the relationship of such special units to spoken language is diverse. • And what about punctuation marks – are they words? They carry meaning, after all. • And what about things like “um”, “ah”, “ow”, “owwww” (in speech or text)? • New words are created every day. A new one for me in 2011: “treggings” at Marks and Spencers. • Example of a “portmanteau” word, formed from melding other words – here “trousers” and “leggings”.

  3. Differences between Words • In language study, differences of meaning or sound or spelling sometimes are, and sometimes aren’t, taken to indicate different words. • “Present”[noun:=gift] and “present”[verb as in: present a proposal] are typically regarded as different words though spelled the same. (Same spelling, different meaning and sound.) • “Bank”[noun:financial] and “bank”[noun:of a river] may be taken to be different words, but may instead be regarded as one word with a variable meaning. (Same spelling and sound, different meaning.) NB: there’s also one or more verbs spelled “bank,” including one with a financial sense. • “Patent”[noun:legal doc.] can be pronounced in two different ways, but both are typically taken to be versions of just one word. (Same meaning and spelling, different sound.) • “Realize” and “realise”: typically regarded as alternative spellings of the same word. (Same meaning and sound, different spelling.)

  4. Special Terminology • Language studiers have introduced specialized terms to try to make things more precise, but still there’s some looseness and variability in the terminology. • Some terms (see Appendix at end of these slides for more detail): • Homographs, homophones, homonyms • Lemmas • Citation forms • Wordforms, lexical forms • Lexemes • Lexical items

  5. Terminology, contd • A lexical form (also: wordform) is a particular written string or spoken sound that would be regarded as a word of the language. • [that loosely follows J&M p.120] • So all occurrences of the written string “presents” would be occurrences of the same writtenlexical form, irrespective of sound or meaning. • (All occurrences of spoken items that sound like “to” would be of the same spokenlexical form, irrespective of spelling or meaning.) • A lemma [J&M p.645] is a particular lexical form used as a sort of standard or basic version of some lexical forms that are just close variants of the same meaning (different number, different tense, etc.). • Thus “carpet” and “carpets” are different lexical forms but have the same lemma, “carpet”. • The lemma for “sing”, “sang” and “sung” is “sing”.

  6. Different Sorts of Meaning-Difference • Ambiguity (or homonymy according to some definitions): where a particular lexical form has a variety of senses. • A special case is polysemy, where different senses for a lexical form are related in some way. • E.g., “bank” can mean a financial institution or a building serving customers of it. • A “newspaper” can mean an institution or a particular physical object produced by that institution. • Exercise: “window” • Some people restrict “ambiguity” (and/or “homonymy” ) to apply only to those cases of different-sense that aren’t cases of polysemy.

  7. Arbitrariness of Sense Differences • The number of senses a lexical form has, and what they are, is in large part a matter of choice and convenience for particular purposes. • Different dictionaries, NLP systems, etc. divide up senses differently. • Consider the verb “cut”, as applied to physical objects. Cutting proceeds significantly differently according to the type of object (cake, grass, meat, hair, ...). Do these correspond to different senses of “cut”?? • Consider “cut” as applied to government expenditure. Does this involve a different sense of cut – or is there just one very generalized sense that applies to expenditure andhairand grass and ... • To what extent do lexical forms have clearly identifiable senses at all? I.e., perhaps the sense in action at a given point is often at least partly affected by the unique situation being talked about? Two uses of “cut” hardly ever have the same meaning????

  8. Special Word Classifications • Words are classified in a variety of ways. • We’ll look at a few main classifications: • into “parts of speech” (such as nouns and verbs; also often called “lexical categories” or “word classes”, though these terms are more ambiguous). • into “proper nouns” (David) and “common nouns” (car). • into “open class” or “closed class”. • Those classifications are of particular importance to NLP.

  9. Why Important? • Parts of speech (POSs) [NB: about text as well as speech!] • constitute a basic level of grammatical analysis • help with more complex grammatical analysis of utterances • are useful by themselves in specialized practical tasks such as “information extraction” and “named entity recognition” • are useful in more academic practical tasks such as searching corpora (large bodies of recorded language) for examples of desired types: e.g. Can search for all examples of the word “spaceship” preceded by an adjective and followed by a verb. • Proper nouns • Are important in practical tasks, e.g. named entity recognition and document summarization. • Closed class words • are important in grammatical analysis of utterances and signalling some particular phenomena in some tasks • But by contrast may need to be suppressed in some types of task.

  10. Parts of Speech (POSs) • Lexical forms are traditionally put into categories such as noun, pronoun, determiner, article, verb, adjective, adverb, preposition, particle, conjunction, interjection. E.g.: • Determiner: e.g.: the, a, this, ... and possibly: every, which • Article: the, a, an • Particle: e.g.: up, off, in, at, by when closely tied into a phrasal verb: e.g. “take up”. • Preposition: e.g.: up, off, in, at, by in freer uses • Conjunction: e.g.: and, or, but, since, if, when, because • Interjection: e.g.: hello, wow • Many lexical forms have more than one POS (consider love, three, off, that, kill) • Modern linguistic theories may propose more categories, from 12 upwards. • NLP systems typically use many more categories: from mid-30s (e.g. within the set of 45 “tags” in the Penn Treebank tagset [textbook]) up to 140+ • See various lists of POSs and related “tagsets” in sections 5.1-5.3 of textbook.

  11. How are POSs Defined? • With great difficulty! • One try: use (partly-)conceptual criteria as in: • Noun: denotes an entity or entity concept (car, snow, love, Tony Blair, Santa Claus, ...) • Verb: means an action, state, relationship, etc. (to push, to sleep, to love, to be, ...) • Adjective: adjusts an entity concept denoted by a nearby noun (red, sad, fake, ...) • Adverb: qualifies some event or state denoted by a nearby verb, adjective, adverb, clause, etc. (boldly, tomorrow, here, loosely, very(?), ...) • Determiner: specifies what specific entities (denoted via other words), if any, are being talked about by a nearby noun • NB: the qualifications above about what POSs are nearby a word are my own. • Such criteria are often mentioned, but are problematic. • “destruction” is a noun but refers to an action • auxiliary verbs, such as “have” in combination, have a special function • “my”: sometimes classed as adjective, sometimes as determiner.

  12. How are POSs Defined?, contd. • Something else that may help: morphological data, as in: • Certain lexical forms clump together like this: • carrot, carrots, carrot’s, carrots’ • man, men, man’s, men’s • Certian other lexical forms pattern like this: • criticize, criticizes, criticized, criticized, criticizing • sing, sings, sang, sung, singing • So we can postulate two different classes and call them nouns and verbs. • The distinction being drawn is: • one class has a singular/plural dimension and a possessive/non-possessive dimension; • the other has a singular/plural dimension, a tense dimension, and ... • In English, this sort of approach doesn’t extend very well to other words, which largely don’t inflect (change shape) etc.

  13. How are POSs Defined?, contd. • What may be more helpful is “distributive” data – i.e. data about how words go together in meaningful linguistic expressions, as in: • Only certain words can follow “the” or “a[n]”; of these, some must usually be followed by other words to be considered a meaningful unit; others don’t need to be. • So we can have: the/a car, the/a big car • But not usually: the/a big. • And car and big act differently after forms of “be”: • We can have : the thing is big but not the thing is car. • Certain words such as is and hated need words around them to make sense. • This sort of data (together perhaps with the above conceptual and morphological data) may make it useful to divide words into nouns, adjectives, determiners, etc. • Having done this, we may find it relatively easy to specify in general a grammar, i.e. a description of the strings of lexical forms that are allowed in the language, based on their assigned classes.

  14. [How are POSs Defined?, contd.] • The classification of some words in a given language is contentious, and differs between different schemes (see textbook). • The way words are classified in one language may not work well for another relatively distant one. We shouldn’t expect the notion of noun, verb, adverb, etc. familiar from one language to correspond in a simple way to categories in another, distant language. • However, even languages like English and Japanese can be given at least roughly the same (main) POSs, even though there’s a lot of detailed difference in the morphological and distributive data. • So perhaps one test of a classification scheme is how well it survives across languages.

  15. Proper and Common Nouns • Proper nouns are, roughly speaking, those that refer to specific (though not necessarily real) entities of certain types in specific contexts: • David Cameron, David, Santa Claus, The Guardian, Love Actually [a film title], Edgbaston, University of Birmingham, Department of the Environment, School of Psychology, Prolog, English(?) • NB: there are many Davids; many universities may have a School of Psychology: the “specificity” is only within particular contexts, possibly fleeting. • Common nouns are the remaining nouns: • car, carrot, bandwith, fifteen [in some uses], relationship, baking [as in the baking of the cake] • Not clear whether the following should count as proper nouns: • Christianity, Islam, Act V, January, Tuesday [when used to refer to a day, not an actress!] • Not clear why nouns like fifteen and love aren’t considered to be proper nouns: they can be considered to refer to specific entitites, after all.

  16. Proper and Common Nouns, contd. • In English, proper nouns are usually spelled with an initial capital letter, if a single word, or at least have the main words within them initially-capitalized • But sometimes not, for special effect (advertisements) or choice (poet e.e. cummings) • But (in English) not all words spelled with a capital letter are proper nouns: they may be common nouns or not nouns at all: • Blairite [can be noun or adjective] • Englishwoman [noun], English [can be adjective (a “proper adjective”)] • Islamic [not noun – a proper adjective] • I [pronoun] • Birminghamize [not noun – a “proper verb”?] • Many acronyms and abbreviations: NB, PS, NLP, CS, ABM, VD, DVD [most are nouns] • Anyway, the capitalization semi-criterion breaks down in many other languages (incl. ones close to English such as German) and in English sentence starters, document titles and section headings, headlines, etc. • Many proper nouns are spelled the same as common nouns or other words (apart from capitalization): Peter, Blacksmith.

  17. Closed Classes of Words • These are classes whose membership is largely or completely fixed, and either relatively very smallor very rule-prescribed. E.g.: • Prepositions • Particles • Determiners • Conjunctions • Pronouns • Auxiliary verbs: e.g. can, should, may, be, have, do • Basic degree modifiers: e.g.: very, quite, more, too • [when followed by an adjective/adverb] • Such words are usually regarded as “function words”: have special roles in grammar—as arguably in all the above cases, in varying degrees. (But NB the degree modifier class in general is fairly open, and only the basic members might be regarded as function words.) • Not completely fixed membership: e.g.: • ordinary language evolution • differences across dialects, slangs, etc. (“in back of” in Amer. Eng., not Brit.)

  18. Open Classes of Words • Classes to which new members can be freely added, and often are. • Notably: • Nouns • Non-auxiliary Verbs • Adjectives • Adverbs • Interjections. • Some fairly recent examples: • tweet[verb & noun, related to Twitter communications: NB the lexical formexisted before, both as verb and noun], • treggings, globish [a newly arising global form of English], mobile [noun for phone], Blairite • remote [noun, short for remote control] • Newly invented proper nouns (or common nouns conscripted): Johnathan[new to me anyway], the Gherkin [a building in London], Agatha Mabel Barnden

  19. A Difficult Case: Numerals • Written-out numerals: e.g.: one, seven, first, seventh, thirty-nine, thousand, millionth, dozen, score, triple, quarter?, fourfold?, twice? • Are they open or closed class? • Membership seems very fixed and rule-prescribed. • But: • There are lots of numerals – more than in other closed classes • and infinitely many if allow strings such as “1053” and “MCMLVII” as numerals • and perhaps for other reasons: “thousand-and-fifty-three-fold”. • Invention of things such as zillion, squillion and nth, ith, jth

  20. What Now? • Next, we’ll talk about “morphology”, helped by our new knowledge of words and classes of them. • That will involve the question of how to compute the morphology of words. • After that we’ll be in a position to look at “POS tagging”: actually computing the POSs of words in discourse (and then annotating the words with their POSs). • To some extent at least POS tagging also includes, or can include, finding proper nouns, and intrinsically includes making the open/closed-class distinction and finding function words. • Exercise: What’s the written plural of “POS” and how do you pronounce it? • POSs? POSes? PsOS? POS? [the last seems to be preferred currently]

  21. Appendix: More Detail on Some Matters

  22. Word Separation • Words are typically not separated from each other in speech, and can subtly affect each other’s sound. So separation of speech into words is itself a somewhat theoretical (if commonsensically natural) act. • In some languages (e.g., Chinese, Japanese, old Latin) words are not (or not always) separated in writing.

  23. More Terminology • A lexical item or lexical entry is often used to mean the (main) items listed in a dictionary or lexicon (lexicon = database of words in, e.g., an AI system) and to which meanings are given by the dictionary or lexicon. • So lexical items are typically lemmas in effect, giving meanings for lexical forms. • But note that dictionaries often list irregular inflected forms separately, e.g. “sung” (as past participle of “sing”). • An item in a dictionary can be a phrase rather than a single word.

  24. Homographs, etc. • According to J&M (pp.290, 646, 648): • “Homographs”: words with the same spelling but different sound, such as “live”[verb] and “live”[adjective]. [I think J&M also mean that the words have different meanings, excluding the cases like “patent” above.] • “Homophones”: words with the same sound but different spelling, such as “to”, “too” and “two”. [I think J&M also mean that the words have different meanings, excluding the cases like “realise/realize” above.] And note that “to” has more than one meaning. • “Homonyms”: different word senses (meanings)that are of words with the same spelling and sound, as in “bank”. • BUT: Other academics, and dictionaries, may define that terminology somewhat differently. • Webster’s Third New International defines “homonym” to mean various things, none the same as J&M’s definition!! One (!!) of the meanings is: one of two or more words spelled and pronounced alike but different in meaning. And “homonym” can also mean the same as “homograph” or “homophone”!!

  25. More Terminology • A wordform (also: lexical form) [loosely following J&M p.120] is a particular written string or spoken sound that would be regarded as a word of the language. • So all occurrences of the written string “presents” would be occurrences of the same written wordform, irrespective of sound or meaning. All occurrences of the spoken item that sounds like “to” would be of the same spoken wordform, irrespective of spelling or meaning. • I’ll use lexical form in preference to wordform to emphasize inclusion of special units such as an abbreviation, acronym, or numeral. I’ll mainly be concerned with written forms. • A lexeme [J&M p.645] is a lexical form (spoken or written) together with a particular sense (meaning) for it. • A lemma or citation form [J&M p.645] is a particular lexical form used as a sort of standard or basic version of the wordform in a lexeme. Thus “carpet” and “carpets” are in different lexemes and lexical forms but have the same lemma, “carpet”. The lemma for “sing”, “sang” and “sung” is “sing”. • Caution: J&M note on p.646 that “lemma” is sometimes used to mean the sense part of a lexeme. Also, they themselves give a definition significantly different from the above on p.120!!

  26. More Terminology, contd. • A lexical item or lexical entry is often used to mean the (main) items listed in a dictionary or lexicon (lexicon = database of words in, e.g., an AI system) and to which meanings are given by the dictionary or lexicon. • So lexical items are typically citation forms. • But note that dictionaries often list irregular inflected forms separately, e.g. “sung” (as past participle of “sing”). • An item in a dictionary can be a phrase rather than a single word. • “Lexical item” sometimes means the same as my “lexical form”.

More Related