1 / 60

BiographyNed

BiographyNed. eScience Center 21 March 2013. Why a good case for eScience ?. Involves big data with high complexity Rich meta data joining diverse textual sources and selections of data Incomplete and noisy Potential to investigate difficult questions, e.g. :

marcel
Télécharger la présentation

BiographyNed

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BiographyNed eScience Center 21 March 2013

  2. Why a good case for eScience? • Involves big data with high complexity • Rich meta data joining diverse textual sources and selections of data • Incomplete and noisy • Potential to investigate difficult questions, e.g.: • How did the current Dutch elite develop from the colonial past? • Biographies may represent different views and realities and thus answers to questions: • hero or villain • 2.8 textual sources per person

  3. What will we do? • Develop generic text mining technology that converts textual data to structured data • Taking into account nature of historical text • Enrich and externally link data repository of Dutch biographies • Develop visualizations and interactions on the data set to support historical research • Develop a range of cases that demonstrate the possibilities and impossibilities of the data set and technology

  4. Nature of eHumanities Patterns in data Interpretation Value Line composition in paintings Twitter patterns during elections Cubism Democratic participation

  5. The rise of the Japanese middle class German nobles in the Interbellum Narratives Patterns in data Interpretation Value Line composition in paintings Twitter patterns during elections Cubism Democratic participation Cases: persons/objects/events 19th-century Japanese prints Biographical descriptions of Prince Bernhard

  6. Statisticsonavailableinformation

  7. TextualInformation per person

  8. Availability of Information in the portal

  9. Presence of informationforgovernors of Dutch Indies (% on 71 individuals)

  10. The HistoricalPerspective • History and Biography • Where do eScience and History meet? • Use Cases

  11. Historical Research The Art and Science of History: Drawing up a narrativefromprimary and secondarysourceswhichapproximateshistoricalreality as well as possible.

  12. Building Blocks and Concrete • Building blocks: factsderivedmainlyfromarchivalfindings and existingliterature • Concrete: the methodshistoriansuse to put themtogetherinto a narrative/synthesis. • The Narrative: a historicalsynthesiswhichcannotbescientifically proven (only made likely) basedonfactswhichcanbe proven orfalsified. There is necessarily a creative element in drawing up a narrative

  13. Example: Grand Pensionary Johan de Witt (1625-1672) • Building blocks: born in 1625; son of Jacob and Anna van den Corput; appointed grand pensionary in 1653;murdered in the Hague in 1672; enemy of William (III) of Orange; William ofOrange rewardedone of the instigators of the murder • Concrete: (logic) Basedon these last data itis likelythat William ordered the death of Johan • Narrative: William probablyordered the death of Johan <= propositionbasedonfacts and reasoning

  14. The House of History

  15. The Importance of Provenance The onlyway to falsifypresentedhistoricalfacts is bygoing back to the originalsource(s) and look at thosesourcescritically. Highly important to beable to knowwhatinformation comes fromwhereexactly.

  16. OurSourcesHere • The Metadata: building blocks • The entries in biographicaldictionariesthemselves: short historicalnarratives

  17. Status of Biography in Academia and Society • Despiteimprovedeffortsthiscentury to embedbiography in academictheories and methods, somestill do notconsiderit (e.g. somesocialhistorians) a worthyacademic discipline, beingtooanecdotal and limited. • Biography is the most popularnon-fiction genre in bookstores(frombothacademic and layauthors)

  18. Where do eScience and History meet? (I) “And when the capsule biography of anindividual is combinedwith 50,000 others, many of themrelatively obscure, […] and whenthey are all powerfullysearchable online, the socialhistorian’sgrumblesaboutbiography’slimitations as anapproach to historicalstudydissolvesintonothingness.”(Brian Harrison, 2004, formereditor of the Oxford Dictionary of National Biography)

  19. Where do eScience and History meet? (II) • Quantitative analyses of a largergroup of people(prosopography).Surpassing the anecdotal. B. Finding relations/networksbetweenpeoplewhich are otherwise hard to detect

  20. Where do eScience and History meet? III C. Insight in Historiography and historicalselectivity. Who was described/included and why? “Undoubtedly I have deprivedmanyinterestingwomenbynotincludingthem. The onlything I cansay to defendmyselfis this: historywriting is also a process of ruthlessselection.” (Els Kloek, HeadBiography portal and mainauthor 1001 vrouwen) D. Thematic research. E.g.: Whendid the discovery of Americastart to influencepeople’s lives?

  21. BiographyNed Use Cases In the initial stages of the research a list of possiblehistoricalquestionswithinone of thosefourthemes was drawn up (subject to change) , which the demonstratorshouldbeable to giveusananswer to, or at least point into a direction/trend.

  22. Case I: Makinglifeeasier: Group portrait of the Governors-General • Highest Official in the Dutch indies 1610-1949 • 71 men (still a relativelysmallgroup) • Whatcan we sayabout these men as a group? • Who was appointed and whatqualitiesdidhe have to have? • Etc ….

  23. Case I: data mining • Family connections (parents/wife/children, otherrelevant connections <= patronage) • Place of Birth • Education • Religion • Career(patterns) • Age at appointment • Duration of holding the office • Reasonforleaving the office • Place of Death

  24. Case I: Time and Effort More than 1 full weekto manually mine thisinformationfrom the Biography Portal. Can a historian do thiswith (almost) the sameresults in underonehourifhelpedby the demonstrator?

  25. Case II: Makingthingspossible: The Dutch Nation & Identity • Whowereselected to beincluded in National BiographicalDictionaries and why? (what was theirclaim to fame?) • Are there different perspectiveson the sameperson over the time and howcanthisbeexplained? • Who was deemed most important? (basedon the length of the entries) • What time periods are most represented? • Is there a difference in claim to fameforpeoplefrom different periods in history, orbetween men and women? • Whichwords are used most often and can we link them to nationalidentities?

  26. Case II: More Questions … • Whatevents are mentioned most often and what does thatsayabout the status questionisof how the Dutch see/sawthemselves? • What are the differences in the answers to these questionsbetweenseveralnationalbiographicaldictionaries? • Are people and eventsdescribedorappreciateddifferently over time? Does the perspectivechange? • How does thisrelate to biographicaldictionaries, nations and identitieselsewhere in Europe?

  27. Conversion to Linked Data

  28. A crash courseonLinked Data Online machine readable data with links • Simple facts called ‘RDF Triples’ • Thorbecke > hasBirthPlace > Zwolle Some technology concepts: • Schemas: To structure LD • RDF Stores: To store LD • SPARQL: To access LD Huge growth in the past years: • More than 300 data sources • More than 30 billion triples

  29. The conversionprocess Purely syntactic conversion • Preserve the original structure of the data • Prevent loss of information • Allow for reinterpretation of the original data in the future Data Preservation

  30. The conversionprocess Conversion steps: • Retrieval of XML dump of the Biography Portal • Initial conversion to ‘crude’ RDF • Using ClioPatria and the XMLRDF tool for ClioPatria • RDF restructuring • Linking to other sources • Essential step in the ‘Linked Data’ philosophy

  31. The conversionprocess Data schema: • Based on the structure of the original XML files • Needs to facilitate the coupling of different biographies of the same person, without compromising the original data • Needs to facilitate the incorporation of several enrichments, following from NLP, Entity Reconciliation, etc. • Compatible with existing schemas such as the Europeana Data Model,PROV, RDAgr2, FOAF, DC terms

  32. BiograpyNed schema Provenance Meta Data NNBW “Thorbecke” Biographical Description Person Meta Data Birth Event 1798 Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse… Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse… Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse… Biography Parts Thorbecke Enrichment NLP Tool Biographical Description Person Meta Data Birth Event 1798-01-14 Zwolle

  33. Retrieving Information from Text

  34. The texts in the Biography Portal • Collection of biographicaldictionaries • Dutch, includingfrom the 19th and early 20th century and even olderquotes • Sources (different dictionaries/collections) have theirownstyle • Metadata available (thoughlargedifferences in completeness)

  35. Challenges and Advantages • Challenges: • Littleworkon NLP and biographies • Performance of Dutch NLP tools onvariations of Dutch • Advantages: • High quality metadata coverageseveralcategories of information (supervised machine learning) • Withinsources, clear and similarstructure of texts

  36. General Approach • Start byusingadvantages: • Use metadata to label information • A basic IR system canbebuildusingsentencenumber and lemmas as features • Enhance performance with NLP tools • Builduponinformationretrieve in the first steps to tackle more challengingtasks

  37. A Basic System • Supervised Machine Learning • Two step identificationprocess (Wu and Weld 2007;2010, Fader et al. 2011) • Identifysentencethatcontainsinformation • Sequencetagging to identifyinformationwithin the sentence

  38. Adding NLP • Location & Date recognition (GeoNames) • (other) NamedEntities (VIAF enhancedwithnamesfrom metadata) • Dependingon performance of the system, we’llworkon: • Chunking, multiwordrecognition • Parsing • Word SenseDisambiguation

  39. Metadata & Project Goals • Duplicatedetection (metadata and text) • Events/Networkdiscovery • Education (begin, end, location) • Occupation (begin, end, location) • Relations (parents, partners) • Temporal relations betweenevents

  40. Output first system • Bettercoverage of categoriesmentionedabove • A timelinefor a person’slife (birth, education, occupation, locations, death) • NamedEntities in text (dates, locations, persons)

  41. Beyond the first system The informationprovidedby the first system can beused to: • Identifyalternativedescriptions of events(same time, location and/or participants) • Identify relations betweenevents(samelocations & time, consequent events, sameparticipants, etc.) • Initialnetworks of people

  42. Methodological issues and textinterpretation • Resultsshouldbereproducible • Code release (including scripts, configurations, …) • Documentation • Open source data • The setupshouldbemodular • Combine output of different tools • Flexiblechoice of methodsused

  43. EvaluationChallenges (1/2) • How to evaluate the extraction tools? • Partialevaluationusing metadata (10-fold cross-validation), but: • No preciseindication of precisionorrecall (incomplete metadata…) • Biographieswithrich metadata are notnecessarilyrepresentativeManuallyannotated data needed!

  44. EvaluationChallenges (2/2) • How to compare performance NLP tools? • Littleworkonbiographies, littleor none on Dutch ones… • How hard are oldertexts? Can we quantify?Systematiccomparison: • Englishbiographies (wikipedia) • Dutch biographies (wikipedia) • Biographiesfrom the portal

  45. Reproducibility/Replication • What do resultsmeaniftheycannotbereproduced? • Whatvariation in resultscanbeexpectedbasedon details notmentioned in papers? • Whichinformation is needed to replicateresultsorfind the origin of differences?Paper submitted ACL 2013 (joint workwith Marieke van Erp and others)

  46. Representations (tools) • How to represent and combine output of different tools? • Compatibility (easy to convert output of external NLP tools) • Flexibility (beable to containalternativerepresentations and interpretations)Integraterepresentations in NIF (joint workwith Jesper Hoeksema and Willem van Hage)

  47. Representation (events) • How to combine knowledgefrom the NLP community and Linked Data community? • Combination of textualinformationwithexternal resources • Complete representation of informationfromtext (location, retrievalmethod)Paper submitted to workshop on Events: Definition, detection, coreference and representation (joint work with Marieke van Erp, Willem van Hage, Sara Tonelli, and others)

  48. Current state of affairs • Basic system usingsentencenumber and lemmasformaincategories metadata (evaluationongoing) • Module forlabelinglocations and dates in text (adaptions to be made formodularity) • Annotationeffortstartedforevaluation (selection of approximately 700 texts)

  49. Demonstrator

More Related