1 / 27

Digging Up Data:

Digging Up Data: The Archaeotools project, Faceted Classification and Natural Language Processing in an archaeological context. Stuart Jeffrey, Julian Richards, Fabio Ciravegna , Stewart Waller, Sam Chapman, Ziqi Zhang , Tony Austin.

mark-silva
Télécharger la présentation

Digging Up Data:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Digging Up Data: The Archaeotools project, Faceted Classification and Natural Language Processing in an archaeological context. Stuart Jeffrey, Julian Richards, Fabio Ciravegna, Stewart Waller, Sam Chapman, Ziqi Zhang, Tony Austin. UK e-Science All Hands Meeting, Edinburgh, 9th September 2008

  2. AHRC-EPSRC-JISC eScience research grants scheme: Joint Information Systems Committee PARTNERS: Natural Language Processing Research Group, Department of Computer Science, University of Sheffield AIM: To allow archaeologists to discover, share and analyse datasets and legacy publications which have hitherto been very difficult to integrate into existing digital frameworks BUILDS UPON: Common Information Environment Enhanced Geospatial browser

  3. Workpackage 1 - Advanced Faceted Classification /Geo-spatial browser – 1m+ records; 4 primary facets (What, Where, When and Media). Workpackage 2 – Natural language processing /Data-mining of Grey Literature; plus tagging Workpackage 3 – Data-mining of Historic Literature; plus geoXwalk Three distinct Workpackages:

  4. Datasets include: National Monuments Records (Scotland, Wales, England) Excavation Index (EH) Archive Holdings Local Authority Historic Environment Records Thesauri include: Thesaurus of Monuments Types (TMT) Thesaurus of Object Types MIDAS Period list UK Government list of administrative areas, County, District, Parish (CDP) – Not MIDAS

  5. Oracle RDBMS MIDAS XML Record RDF Resource Information Extraction Input When, Where, What ontologies as entries to faceted index Knowledge triple store XML Docs of Thesaurus Information Extraction Input Query User Interface

  6. “WHAT” • Records that have no subject information • Records that use terms not foundin TMT, so these records cannot be indexed (6,442 unique terms)

  7. “WHEN” • Records that have no temporal information • Records that use period terms not foundin MIDAS so these records cannot be indexed (457 types of irresolvable dates) 1066, 1001-1100,11th Centuary, C11, 11C, Eleventh Century

  8. “WHERE” • Records that have no spatial information • Records that use terms not foundin CDP, so these records cannot be indexed.

  9. linear

  10. Workpackage 1 - Advanced Faceted Classification /Geo-spatial browser – 1m+ records; 4 primary facets (What, Where, When and Media). Workpackage 2 – Natural language processing /Data-mining of Grey Literature; plus tagging Workpackage 3 – Data-mining of Historic Literature; plus geoXwalk Three distinct Workpackages:

  11. XML tagging of semantic content CIDOC: CRM

  12. Information Extraction in Archaeotools • What (subject) • Where (place name) • When (temporal info) • Grid reference (easting and northing) • Report title • Report creator • Report publisher • Report publisher contact • Report publication date • Event date • Bibliography & references

  13. Un-annotated texts are negative examples Example annotations in highlighted colours are positive examples • Features of this annotation: • first_letter_capitalised: true • word_found_in_gazetteer: true followed_by: period preceded_by: the

  14. Rule based systems are good for extracting information that match with simple patterns, and/or occur in regular contexts, thus are applied to: • Grid reference (easting and northing) • Report title* • Report creator* • Report publisher* • Report publication date* • Report publisher contact • Bibliography & references Machine Learning is good for extracting information that can not be matched by patterns, or occur irregularly with contexts, or are large amount, thus is applied to: • What (subject) • Where (place name) • When (temporal info) • Event date

  15. Workpackage 1 - Advanced Faceted Classification /Geo-spatial browser – 1m+ records; 4 primary facets (What, Where, When and Media). Workpackage 2 – Natural language processing /Data-mining of Grey Literature; plus tagging Workpackage 3 – Data-mining of Historic Literature; plus geoXwalk Three distinct Workpackages:

  16. http://ads.ahds.ac.uk/project/archaeotools/

More Related