1 / 19

Automated Building of OAI Compliant Repository from Legacy Collection

Automated Building of OAI Compliant Repository from Legacy Collection. Kurt Maly Maly@cs.odu.edu Department of Computer Science Old Dominion University May, 2006. Contents. Introduction Background System Architecture Metadata Extraction Approach Experiments Screenshots. Introduction.

michel
Télécharger la présentation

Automated Building of OAI Compliant Repository from Legacy Collection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Maly@cs.odu.edu Department of Computer Science Old Dominion University May, 2006 ELPUB 2006 June 14-16 Bansko Bulgaria

  2. Contents • Introduction • Background • System Architecture • Metadata Extraction Approach • Experiments • Screenshots ELPUB 2006 June 14-16 Bansko Bulgaria

  3. Introduction • Key problem : Extracting Metadata from a legacy collection • OCR is not sufficient for making ‘legacy’ documents searchable. • Manual metadata extraction is costly and time-consuming • It would take about 60 employee-years to create metadata for 1 million documents. (estimation made by Lou Rosenfeld on DCMI 2003 workshop). • Automatic extraction tools are essential for rapid dissemination at reasonable cost ELPUB 2006 June 14-16 Bansko Bulgaria

  4. Background: Digital Library and OAI-PMH • Digital Library (DL) • A DL is a network accessible and searchable collection of digital information. • DL provides a way to store, organize, preserve and share information. • Open Archive Initiatives Protocol for Metadata Harvesting (OAI-PMH) is a framework to to provide interoperability among heterogeneous DLs. • Based on metadata harvesting • Data Providers and Service Providers ELPUB 2006 June 14-16 Bansko Bulgaria

  5. Background: Metadata Extraction • Rule-based Approach • Basic idea • Use a set of rules to define how to extract metadata based on human observation. • For example, a rule may be “ The first line is title”. • Pros & Cons • No need for training from samples • Can extract different metadata from different documents • Rule writing may require significant technical expertise ELPUB 2006 June 14-16 Bansko Bulgaria

  6. Background: Metadata Extraction (cnt.) • Machine-Learning Approach • Basic idea • Learn the relationship between input and output from samples and make predictions for new data • Pros & Cons • Good adaptability but it has to be trained from samples – time consuming • Performance degrades with increasing heterogeneity • Difficult to add new fields to be extracted • Difficult to select the right features for training ELPUB 2006 June 14-16 Bansko Bulgaria

  7. Background: Document Classification • Classify document pages into groups based on their visualsimilarity: • the geometrical arrangement of components • the typographic features such as font • Existing Approaches • MXY-Tree • recursively cuts a page into blocks by separators (e.g. lines) as well as white spaces. A page is converted to a tree. • M*N bins • cuts a page into m*n equal size bins; a bin is either a text bin (if more than half are text) or white space bin ELPUB 2006 June 14-16 Bansko Bulgaria

  8. System Architecture ELPUB 2006 June 14-16 Bansko Bulgaria

  9. System Architecture (cont.) • Main components: • Scan and OCR: Commercial OCR software is used to scan the documents. • Metadata Extractor: Extract metadata by using rules and machine learning techniques. The extracted metadata are stored in a local database. In order to support Dublin Core, it may be necessary to map extracted metadata to Dublin Core format. • OAI layer: Make the digital collection interoperable. The OAI layer accepts all OAI requests, get the information from database and encode metadata into XML format as responses. • Search Engine ELPUB 2006 June 14-16 Bansko Bulgaria

  10. Template-Based Metadata Extraction ELPUB 2006 June 14-16 Bansko Bulgaria

  11. Template-Based Metadata Extraction- Document Classification • classify documents into groups based on the visualsimilarity of their metadata pages (page with richness in metadata) . • the geometrical arrangement of metadata fields on the metadata page • the typographic features such as font size, text alignment, and text height • The identification of metadata pages by a set of rules ELPUB 2006 June 14-16 Bansko Bulgaria

  12. Template-Based Metadata Extraction- Document Classification MXY-Tree Similarity Integration Document Pages m*n bins ELPUB 2006 June 14-16 Bansko Bulgaria

  13. Template sample ELPUB 2006 June 14-16 Bansko Bulgaria

  14. Experiments- Document Classification • downloaded 7413 documents from the DTIC collection • randomly selected 200, 400, 800, 1200, 2000, 3000, 4000, 5000, 6000 documents & Classified them into groups ELPUB 2006 June 14-16 Bansko Bulgaria

  15. Experiments- Metadata Extraction • Selected 100 documents from DTIC; divided them into 7 classes; created a template for each class ELPUB 2006 June 14-16 Bansko Bulgaria

  16. Template-based experiment ELPUB 2006 June 14-16 Bansko Bulgaria

  17. Screenshots – OAI ELPUB 2006 June 14-16 Bansko Bulgaria

  18. Screenshots – Search Engine ELPUB 2006 June 14-16 Bansko Bulgaria

  19. Conclusions • We describe how to automate the task of converting existing corpus into an OAI-compliant repository • We propose our metadata extraction approach to address the challenge of getting desirable accuracy for a large heterogeneous collection of documents ELPUB 2006 June 14-16 Bansko Bulgaria

More Related