180 likes | 314 Vues
Gazetteers for Temporal and Spatial Reference Extraction. CS6604 Class Project. Outline. Description Motivation Problem Statement Objectives Prior Work Gazetteer Structure Temporal Model IR Query Method Project Framework Experimental Methods Summary. Description.
E N D
Gazetteers for Temporal and Spatial Reference Extraction CS6604 Class Project
Outline • Description • Motivation • Problem Statement • Objectives • Prior Work • Gazetteer Structure • Temporal Model • IR Query Method • Project Framework • Experimental Methods • Summary Gazetteers for Temporal and Spatial Reference Extraction
Description • Gazetteers: Manual and digital repositories of geographic information, mainly linking names to locations. • A specific form of Spatial Database: • Point and polygonal coverage areas • Demographic information • Hierarchal information • Various Gazetteers created for different areas and purposes • This project focuses on designing a gazetteer for finding spatial and temporal references, based on both Spatial Database and Information Retrieval (IR) techniques Gazetteers for Temporal and Spatial Reference Extraction
Motivation • Gazetteers are focused on location only. • Why is that Bad? • Can’t tell me location of important events • Can’t tell me locations associated with important people • Can’t tell me how locations have changed over time • Can’t tell me what locations applied to specific time periods • Gazetteers not directly helpful in getting location information within documents • Name ambiguity (50 records in the ADL Gazetteer) • Document ambiguity • SQL is not a solution • Solution set is a candidate set, compared to correct solution • SQL statements require enough understanding, and can be complicated. Gazetteers for Temporal and Spatial Reference Extraction
Problem Statement • Given the following: • Name ambiguity • Document ambiguity • Focus on location reference extraction from documents • Explicit (London, England) • Implicit (Location of the Queen of England) • How can we: • Construct a Gazetteer to help in the discovery of such references • How to best query a gazetteer for that information Gazetteers for Temporal and Spatial Reference Extraction
Objectives • Research the addition of temporal information within Gazetteers to enhance the capability • Look at how other information can be used to reinforce the choice of gazetteer entries to describe references. • Experiment on how IR techniques can be used to query gazetteer spatial databases, that are normally queried via SQL • Overcome ambiguity in SQL answers • Show a possible alternative to current location extract techniques. Gazetteers for Temporal and Spatial Reference Extraction
Assumptions • Concepts of autocorrelation and co-location apply to textual document references • Autocorrelation – that similar information will be fund close together. • Co-location – Different but related items may be close together • Text shall contain references to other items close to the main location that will reinforce the correct choice, and this should happen close to the text references in question. • Other references, such as events and people, have an inherent location attribute that can reinforce the correct location choice “After the subway explosions, I went to the Eye, and then Buckingham Palace while I was in London.” Gazetteers for Temporal and Spatial Reference Extraction
Prior Work • Alexandria Digital Library (ADL) Gazetteer • University of California, Santa Barbara effort, within the overall ADL effort. • ~6 million entries • Has tried to standardize the format, description, and distribution of gazetteer data. • Has a published, detailed schema • Incorporates some time elements, especially for naming entities, but time data is usually not populated. (Current snapshot) Gazetteers for Temporal and Spatial Reference Extraction
Prior Work • Electronic Cultural Atlas Initiative • Effort of the University of California, Berkeley, to get time-varying data stored and displayed of different cultures. Gazetteers for Temporal and Spatial Reference Extraction
Prior Work • TimeMap • University of Syndey display of time varying map/location data. • Originally used in the archeology Gazetteers for Temporal and Spatial Reference Extraction
Prior Work • MetaCarta • Conducts geoparsing and geocoding of text documents, and sends back possible location references with relative strength values. • Uses Natural Language Techniques (NLP) to find possible location references, then straight database lookup • Contains a gazetteer of ~14 million entries. • All entries need an initial probability, in case there is not enough information in the text. Gazetteers for Temporal and Spatial Reference Extraction
Gazetteer Structure • Entities (locations and people/things) have relationships with time periods and events. • All entries in each table have beginning and ending dates, and values for uncertainty. Time Period Occur during Applies to Lived during Occurs at Events Location covers Areas People/ things Part of Present at Gazetteers for Temporal and Spatial Reference Extraction
Temporal Model • Used Bitemporal Conceptual Data Model (BCDM), developed for the TSQL2 query language. • Why: • General documents will cover all history – from Big Bang to NOW. • Dealing with different granularity – from seconds to eons • Need to worry about NOW • SQL does not deal with date before 1000 AD • Time values stored as text. Must be handled at the application level Gazetteers for Temporal and Spatial Reference Extraction
IR Query Method • Documents with spatial and temporal references are the queries. • Broken into chunks to avoid mixing unrelated locations. • The Gazetteer’s locations, people, time period and event entries become documents • Focused on location documents • Location documents are formed for each location entry by doing a SQL spatial operation, finding what other locations are overlapping in the specific location’s MBR. • A location document should become a listing of contained locations, which in a IR methodology should make textual clues lead to a stronger relevance measure. • Location documents contain which database entry they came from. • The entire document collection is indexed. • Similarity measures are taken during initial query using td-idf, and initial results above a certain threshold are taken. Gazetteers for Temporal and Spatial Reference Extraction
IR Query Method • Initial documents from the query are treated as a candidate list • Results need to be clustered by type of document – location, event, etc. • For each possible location record reflected by a returned document, assemble the relevant documents from the tables that handle events, people/things, etc. • For each collection of documents related to a location entry, combine the documents’ term frequencies, and conduct relevance feedback on the original query vector using the Ide_Regular method • Requery the system for location documents • Finally, the location names are searched for in the original query document. • If found, then a match is made and provided to the user • If the location is not found, then the system provides that name an an implied location Gazetteers for Temporal and Spatial Reference Extraction
Project Framework • MySQL holds the gazetteer spatial database, and does the R-Tree indexing of the locations MBRs • Lucene is used to conduct the indexing of the documents and conduct the full-text search • The Alexandria Digital Library Gazetteer is used as the gazetteers source data. • Retrieved ~5.9 Million locations with names and MBRs. • Originally in PostgreSQL and PostGIS. • Unsupported, and downloads do not restore on current PostgreSQL/PostGIS versions. • TREC 2004 also indexed and searched for test documents Gazetteers for Temporal and Spatial Reference Extraction
Experimental Methods • Focused on querying documents with a specific city – London, to see the effect of IR selection without feedback. • Search TREC for 100 documents with the word “London” and identify the use of the word “London” in the text. • Make up documents that reference London and other events related to London which are in the database. • Use manually created documents to find the similarity threshold for best results. • Use the TREC documents as queries, and identify the true and false positives, and the true and false negatives. • Still conducting tests. Gazetteers for Temporal and Spatial Reference Extraction
Summary • Developed the basic design of a gazetteer that can be used not just for locations but other reference extractions in text documents. • Current work in this area has been more focused on humanities studies and direct human-machine interaction, rather than automated IR. • Automated feedback using location-related data still needs to be done as future work. Gazetteers for Temporal and Spatial Reference Extraction