1 / 46

Interpretation and fault-tolerant identification of relationship data

Interpretation and fault-tolerant identification of relationship data. Holger Wandt Colloquium Taal en Spraak KU Nijmegen Wednesday 3 March 2004. Overview. The use of knowledge tables Relationship data: segmentation, storage Attributes Statistics Rules A closer look

Télécharger la présentation

Interpretation and fault-tolerant identification of relationship data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Interpretation and fault-tolerant identification of relationship data Holger Wandt Colloquium Taal en Spraak KU Nijmegen Wednesday 3 March 2004

  2. Overview • The use of knowledge tables • Relationship data: segmentation, storage • Attributes • Statistics • Rules • A closer look • How do we use the knowledge and the rules in interpretation? • The Rolodex-demo

  3. ANK Engineering Ltd. Appleford

  4. Monsieur e/o Madame Durand

  5. Int. Transp. Ond. Joh. Tilburg Hardinxv./Giessend. e/o

  6. Fysiotherapeutisch CentrumArie en Jolanda KruizengaIntake Unit 1

  7. Dr. John Park jr. BA, MR EconS, MKM

  8. Siemens ElectroCom GmbH & Co.Postdienstautomatisierung und Technologieentwicklung

  9. DE POSTc/o mevrouw A. Vanderwalle-Van DammeIndustrieel Ingenieur Logistiek

  10. RegTP, Regulierungsbehörde für Telekommunikation und Post

  11. CQCS International Consulting

  12. Chowhounds DelightRestaurant & BarAttn: John Peter Arnold

  13. Eerste Roelofarendveense Papierfabriek Anno 1931 NVh.o.d.n. “Papier Hier”

  14. NATIONALE SOCIALE VERZEKINGSKAS VOOR MIDDENSTAND EN BEROEPEN SUKKURSALE BRUGGE V.Z.W. / A.S.B.L.

  15. Suomen Posti OYTuotteet/ Mediapalvelut/ Osoitepalvelut

  16. Let’s summarize…. • Surnames • Given names • Forms of address • Titles • Prefixes/infixes and prepositions/articles • Additions • Professions • Geographical items • Legal forms • Company words • Divisions • Company names • Ordinals

  17. Relationship data • LCR manages and maintains 3 knowledge databases for each country: • 1stbase • Fambase • DicMan • LCR manages and maintains country specific synonym tables

  18. Storage of relationship data • Segmentation (define groups of data) • Attributes of groups • Attributes of particular items • Link between items (abbreviation, plural, etc.)

  19.  STATISTICS BE DE NL Surnames 337410 1006097 277312 Given names 20618 22425 25569 FoA 269 131 136 Titles 284 1739 279 Prefix/Infix & articles/prepositions 654 664 498 Additions 324 192 143 Professions 968 2792 355 Geogr. items 12416 32248 18611 Legal forms 236 1835 138 Company words 20467 8121 5920 Divisions 172 160 90 Company names 1967 1504 684 Ordinals 421 293 71

  20. General and country specific rules • Capitalization • Punctuation • Word break • Abbreviation

  21. Capitalization • Belgium: • Flemish: Karin Van der Ploeg • Walloon: Henri de La Censerie • Germany: • E.v. Buskirk KG • Verband der Chemischen Industrie e.V. • Netherlands: • Puffelen r.a., Victor van • Puffelen RA, de heer Van

  22. Punctuation • Mr Theodor St.John • mr. Olaf Oudendijk • Martin Klaus Lehmann • Martin, Klaus & Lehmann • HA.DI.WE. Inh: Hans-Dieter Weber • Don Quichotte N.V./S.A. • Don Quichotte NV/SA

  23. Epitaph Here lies my beloved wife Christine In heaven she is not in hell I know It’s written for everyone to be seen

  24. Word break J.P.L. den He- yer Groepsex- cursies General and country specific rules: • In NL: ma-chi-nes • In GB: ma-chines NEVER: mac-hines

  25. Abbreviation General rule for BE, DE and NL: Every word must not be abbreviated further than its first Vowel-Consonant (VC) group or its first Consonant-Vowel-Consonant (CVC) group. Abbreviation – abbrev. – abbr. Consonant – conson. – cons. There are country specific abbreviations: Ges.m. beschränkt. Haft. / Handelsmij./ Stnrs. / R.P. and RR.PP. But beware of the Hotel Association Française

  26. A closer look: Family names • Prefixes • Names consisting of several parts • Names with a foreign language attribute • Diacritic symbols

  27. Prefixes • In NL separation of prefix and family name is necessary for sorting purposes • In the Human Inference databases: • 22.000 family names with prefix in BE • 15.000 family names with prefix in DE • 30.000 family names with prefix in NL • Validation of names: Le Galloudec, but not Galloudec

  28. Names consisting of several parts • Double-barrelled names with and without hyphen: Adelheid de Boer-van Buiten Dirk Segaert vanden Bussche • Double-barrelled name with infix: Arie Gansneb genaamd Tengnagel tot den Bonckenhave • Double-barrelled name without infix: Martina Galloux Wittevrouw

  29. Names with a foreign language attribute • Three categories: Arabic: el Bahlaoui Husseini al Fharid Chinese/Vietnamese: Cuong Buo Chan Spanish/Portuguese: Fonseca Aranda de Pereira Rodriguez

  30. Diacritic symbols • All diacritics have to be recorded in the database. • Preferences in Capital Conversion • Validation of names • Examples: • Büch • Hällström • Özgüleç • Güçlütürk

  31. Interpretation of relationship data • Different kinds of relationship data • Different attributes • General and country specific rules (capitalization, abbreviation, etc.) • Signification differs due to context • Due to the ambiguity of relationship data, correct interpretation is no picnic

  32. Different kinds of relationship data with different attributes • Betonmortelfabriek BEMOTI Tilburg bv • Tilburgse Betonmortelfabriek BEMOTI bv • RegTP, Regulierungsbehörde für Telekommunikation und Post • CQCS International Consulting • Servicebureau Jansen/ Jansen Elektroservice • De Boer Landbouwmachines/ De Boer Machinebouw

  33. Signification can differ as consequence of context, rules for abbreviation, capitalization and punctuation • Art Gallery Wandt & Wandt • Wandt Fachhandel für Kunstart. • Art. Wandt Kunsthandel • van Walbeek, M.B.A. • Van Walbeek, MBA

  34. Significations: How can they be determined? • Does the item exist in the particular knowledge universe? • Can the significations be resolved or deducted (acronyms and compounds)? • If the item does not exist in the knowledge universe, what is the most probable signification, considering the context?

  35. Can the item be deducted or resolved? • NeVoBo Nederlandse Volleybalbond • KLM Koninklijke Nederlandse Luchtvaartmaatschappij • AAAA • Maschinenfabrik Mertens • Carburateurbinnenverlichtingsfabriek Mertens

  36. The item is not found in the knowledge universe • Harry Edward Johnson • Harry Edward Ireallygotaweirdsurname • IBM Computing • HAL Computing • Hermans Groente & Fruit, A’dam • Johnson Sarvice & Cnosult, Chelsee

  37. Context Metzgerei Theo Frankfurt given name/surname? Metzgerei Theo Frankfurt given name/ geographical item? Karin Jansen – Bloemen given name/surname/company word? Karin Jansen – Bloemen given name/surname – surname (maiden name)?

  38. Patterns Restaurant Die Vier Jahreszeiten Café Het Nerveuze Schaap Jasmijn Bloemen en Planten Helena Catering & Imbiß Consultingservice QCS Amsterdam Aardappelhandel ABC Paterswolde

  39. Patterns? chr. bond v. ambtenaren chr. bond van zomers KARL OTTO GRAF LAMBSDORFF EVA MARIA BARON POTOCKI Hi-Fi Johanson & Gruber GmbH Em-Lo Emmerich und Lohmeier GmbH

  40. Multiple occurrences An item must be stored in all its significations • Beh.  Behandlung, Behälter, Behörde, Behinderte • Ond.  Onderzoek, Onderhoud, Onderneming, Onderwijs, Onderling

  41. Interpretation step by step • Read appellation • Divide appellation in relevant sections and ascribe all possible significations to the sections • Apply context and grouping rules and chose the most probable combination of significations • Score the found items, the small context, the large context and the corrections for special cases.

  42. Knowledge Universe Appearance Context <WORD> Interpretation Signification

  43. The rolodex demo

  44. For more information: h.wandt@humaninference.com

More Related