600 likes | 674 Vues
BiographyNed. eScience Center 21 March 2013. Why a good case for eScience ?. Involves big data with high complexity Rich meta data joining diverse textual sources and selections of data Incomplete and noisy Potential to investigate difficult questions, e.g. :
E N D
BiographyNed eScience Center 21 March 2013
Why a good case for eScience? • Involves big data with high complexity • Rich meta data joining diverse textual sources and selections of data • Incomplete and noisy • Potential to investigate difficult questions, e.g.: • How did the current Dutch elite develop from the colonial past? • Biographies may represent different views and realities and thus answers to questions: • hero or villain • 2.8 textual sources per person
What will we do? • Develop generic text mining technology that converts textual data to structured data • Taking into account nature of historical text • Enrich and externally link data repository of Dutch biographies • Develop visualizations and interactions on the data set to support historical research • Develop a range of cases that demonstrate the possibilities and impossibilities of the data set and technology
Nature of eHumanities Patterns in data Interpretation Value Line composition in paintings Twitter patterns during elections Cubism Democratic participation
The rise of the Japanese middle class German nobles in the Interbellum Narratives Patterns in data Interpretation Value Line composition in paintings Twitter patterns during elections Cubism Democratic participation Cases: persons/objects/events 19th-century Japanese prints Biographical descriptions of Prince Bernhard
Presence of informationforgovernors of Dutch Indies (% on 71 individuals)
The HistoricalPerspective • History and Biography • Where do eScience and History meet? • Use Cases
Historical Research The Art and Science of History: Drawing up a narrativefromprimary and secondarysourceswhichapproximateshistoricalreality as well as possible.
Building Blocks and Concrete • Building blocks: factsderivedmainlyfromarchivalfindings and existingliterature • Concrete: the methodshistoriansuse to put themtogetherinto a narrative/synthesis. • The Narrative: a historicalsynthesiswhichcannotbescientifically proven (only made likely) basedonfactswhichcanbe proven orfalsified. There is necessarily a creative element in drawing up a narrative
Example: Grand Pensionary Johan de Witt (1625-1672) • Building blocks: born in 1625; son of Jacob and Anna van den Corput; appointed grand pensionary in 1653;murdered in the Hague in 1672; enemy of William (III) of Orange; William ofOrange rewardedone of the instigators of the murder • Concrete: (logic) Basedon these last data itis likelythat William ordered the death of Johan • Narrative: William probablyordered the death of Johan <= propositionbasedonfacts and reasoning
The Importance of Provenance The onlyway to falsifypresentedhistoricalfacts is bygoing back to the originalsource(s) and look at thosesourcescritically. Highly important to beable to knowwhatinformation comes fromwhereexactly.
OurSourcesHere • The Metadata: building blocks • The entries in biographicaldictionariesthemselves: short historicalnarratives
Status of Biography in Academia and Society • Despiteimprovedeffortsthiscentury to embedbiography in academictheories and methods, somestill do notconsiderit (e.g. somesocialhistorians) a worthyacademic discipline, beingtooanecdotal and limited. • Biography is the most popularnon-fiction genre in bookstores(frombothacademic and layauthors)
Where do eScience and History meet? (I) “And when the capsule biography of anindividual is combinedwith 50,000 others, many of themrelatively obscure, […] and whenthey are all powerfullysearchable online, the socialhistorian’sgrumblesaboutbiography’slimitations as anapproach to historicalstudydissolvesintonothingness.”(Brian Harrison, 2004, formereditor of the Oxford Dictionary of National Biography)
Where do eScience and History meet? (II) • Quantitative analyses of a largergroup of people(prosopography).Surpassing the anecdotal. B. Finding relations/networksbetweenpeoplewhich are otherwise hard to detect
Where do eScience and History meet? III C. Insight in Historiography and historicalselectivity. Who was described/included and why? “Undoubtedly I have deprivedmanyinterestingwomenbynotincludingthem. The onlything I cansay to defendmyselfis this: historywriting is also a process of ruthlessselection.” (Els Kloek, HeadBiography portal and mainauthor 1001 vrouwen) D. Thematic research. E.g.: Whendid the discovery of Americastart to influencepeople’s lives?
BiographyNed Use Cases In the initial stages of the research a list of possiblehistoricalquestionswithinone of thosefourthemes was drawn up (subject to change) , which the demonstratorshouldbeable to giveusananswer to, or at least point into a direction/trend.
Case I: Makinglifeeasier: Group portrait of the Governors-General • Highest Official in the Dutch indies 1610-1949 • 71 men (still a relativelysmallgroup) • Whatcan we sayabout these men as a group? • Who was appointed and whatqualitiesdidhe have to have? • Etc ….
Case I: data mining • Family connections (parents/wife/children, otherrelevant connections <= patronage) • Place of Birth • Education • Religion • Career(patterns) • Age at appointment • Duration of holding the office • Reasonforleaving the office • Place of Death
Case I: Time and Effort More than 1 full weekto manually mine thisinformationfrom the Biography Portal. Can a historian do thiswith (almost) the sameresults in underonehourifhelpedby the demonstrator?
Case II: Makingthingspossible: The Dutch Nation & Identity • Whowereselected to beincluded in National BiographicalDictionaries and why? (what was theirclaim to fame?) • Are there different perspectiveson the sameperson over the time and howcanthisbeexplained? • Who was deemed most important? (basedon the length of the entries) • What time periods are most represented? • Is there a difference in claim to fameforpeoplefrom different periods in history, orbetween men and women? • Whichwords are used most often and can we link them to nationalidentities?
Case II: More Questions … • Whatevents are mentioned most often and what does thatsayabout the status questionisof how the Dutch see/sawthemselves? • What are the differences in the answers to these questionsbetweenseveralnationalbiographicaldictionaries? • Are people and eventsdescribedorappreciateddifferently over time? Does the perspectivechange? • How does thisrelate to biographicaldictionaries, nations and identitieselsewhere in Europe?
A crash courseonLinked Data Online machine readable data with links • Simple facts called ‘RDF Triples’ • Thorbecke > hasBirthPlace > Zwolle Some technology concepts: • Schemas: To structure LD • RDF Stores: To store LD • SPARQL: To access LD Huge growth in the past years: • More than 300 data sources • More than 30 billion triples
The conversionprocess Purely syntactic conversion • Preserve the original structure of the data • Prevent loss of information • Allow for reinterpretation of the original data in the future Data Preservation
The conversionprocess Conversion steps: • Retrieval of XML dump of the Biography Portal • Initial conversion to ‘crude’ RDF • Using ClioPatria and the XMLRDF tool for ClioPatria • RDF restructuring • Linking to other sources • Essential step in the ‘Linked Data’ philosophy
The conversionprocess Data schema: • Based on the structure of the original XML files • Needs to facilitate the coupling of different biographies of the same person, without compromising the original data • Needs to facilitate the incorporation of several enrichments, following from NLP, Entity Reconciliation, etc. • Compatible with existing schemas such as the Europeana Data Model,PROV, RDAgr2, FOAF, DC terms
BiograpyNed schema Provenance Meta Data NNBW “Thorbecke” Biographical Description Person Meta Data Birth Event 1798 Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse… Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse… Johan Rudolph Thorbecke werdin 1798 geboren op 14 januari in Zwolle en komt uit een half-Duitse… Biography Parts Thorbecke Enrichment NLP Tool Biographical Description Person Meta Data Birth Event 1798-01-14 Zwolle
The texts in the Biography Portal • Collection of biographicaldictionaries • Dutch, includingfrom the 19th and early 20th century and even olderquotes • Sources (different dictionaries/collections) have theirownstyle • Metadata available (thoughlargedifferences in completeness)
Challenges and Advantages • Challenges: • Littleworkon NLP and biographies • Performance of Dutch NLP tools onvariations of Dutch • Advantages: • High quality metadata coverageseveralcategories of information (supervised machine learning) • Withinsources, clear and similarstructure of texts
General Approach • Start byusingadvantages: • Use metadata to label information • A basic IR system canbebuildusingsentencenumber and lemmas as features • Enhance performance with NLP tools • Builduponinformationretrieve in the first steps to tackle more challengingtasks
A Basic System • Supervised Machine Learning • Two step identificationprocess (Wu and Weld 2007;2010, Fader et al. 2011) • Identifysentencethatcontainsinformation • Sequencetagging to identifyinformationwithin the sentence
Adding NLP • Location & Date recognition (GeoNames) • (other) NamedEntities (VIAF enhancedwithnamesfrom metadata) • Dependingon performance of the system, we’llworkon: • Chunking, multiwordrecognition • Parsing • Word SenseDisambiguation
Metadata & Project Goals • Duplicatedetection (metadata and text) • Events/Networkdiscovery • Education (begin, end, location) • Occupation (begin, end, location) • Relations (parents, partners) • Temporal relations betweenevents
Output first system • Bettercoverage of categoriesmentionedabove • A timelinefor a person’slife (birth, education, occupation, locations, death) • NamedEntities in text (dates, locations, persons)
Beyond the first system The informationprovidedby the first system can beused to: • Identifyalternativedescriptions of events(same time, location and/or participants) • Identify relations betweenevents(samelocations & time, consequent events, sameparticipants, etc.) • Initialnetworks of people
Methodological issues and textinterpretation • Resultsshouldbereproducible • Code release (including scripts, configurations, …) • Documentation • Open source data • The setupshouldbemodular • Combine output of different tools • Flexiblechoice of methodsused
EvaluationChallenges (1/2) • How to evaluate the extraction tools? • Partialevaluationusing metadata (10-fold cross-validation), but: • No preciseindication of precisionorrecall (incomplete metadata…) • Biographieswithrich metadata are notnecessarilyrepresentativeManuallyannotated data needed!
EvaluationChallenges (2/2) • How to compare performance NLP tools? • Littleworkonbiographies, littleor none on Dutch ones… • How hard are oldertexts? Can we quantify?Systematiccomparison: • Englishbiographies (wikipedia) • Dutch biographies (wikipedia) • Biographiesfrom the portal
Reproducibility/Replication • What do resultsmeaniftheycannotbereproduced? • Whatvariation in resultscanbeexpectedbasedon details notmentioned in papers? • Whichinformation is needed to replicateresultsorfind the origin of differences?Paper submitted ACL 2013 (joint workwith Marieke van Erp and others)
Representations (tools) • How to represent and combine output of different tools? • Compatibility (easy to convert output of external NLP tools) • Flexibility (beable to containalternativerepresentations and interpretations)Integraterepresentations in NIF (joint workwith Jesper Hoeksema and Willem van Hage)
Representation (events) • How to combine knowledgefrom the NLP community and Linked Data community? • Combination of textualinformationwithexternal resources • Complete representation of informationfromtext (location, retrievalmethod)Paper submitted to workshop on Events: Definition, detection, coreference and representation (joint work with Marieke van Erp, Willem van Hage, Sara Tonelli, and others)
Current state of affairs • Basic system usingsentencenumber and lemmasformaincategories metadata (evaluationongoing) • Module forlabelinglocations and dates in text (adaptions to be made formodularity) • Annotationeffortstartedforevaluation (selection of approximately 700 texts)