190 likes | 423 Vues
ELPUB 2006 June 14-16 Bansko Bulgaria. 15. Selected 100 documents from DTIC; divided them ... Bulgaria. 17. Screenshots OAI. ELPUB 2006 June 14-16 Bansko Bulgaria ...
E N D
1. Automated Building of OAI Compliant Repository from Legacy Collection
Kurt Maly Maly@cs.odu.edu Department of Computer Science Old Dominion University May, 2006
2. Contents
Introduction Background System Architecture Metadata Extraction Approach Experiments Screenshots
3. Introduction
Key problem : Extracting Metadata from a legacy collection OCR is not sufficient for making legacy documents searchable. Manual metadata extraction is costly and time-consuming It would take about 60 employee-years to create metadata for 1 million documents. (estimation made by Lou Rosenfeld on DCMI 2003 workshop). Automatic extraction tools are essential for rapid dissemination at reasonable cost
4. Background: Digital Library and OAI-PMH
Digital Library (DL) A DL is a network accessible and searchable collection of digital information. DL provides a way to store, organize, preserve and share information. Open Archive Initiatives Protocol for Metadata Harvesting (OAI-PMH) is a framework to to provide interoperability among heterogeneous DLs. Based on metadata harvesting Data Providers and Service Providers
5. Background: Metadata Extraction
Rule-based Approach Basic idea Use a set of rules to define how to extract metadata based on human observation. For example, a rule may be The first line is title. Pros & Cons No need for training from samples Can extract different metadata from different documents Rule writing may require significant technical expertise
6. Background: Metadata Extraction (cnt.)
Machine-Learning Approach Basic idea Learn the relationship between input and output from samples and make predictions for new data Pros & Cons Good adaptability but it has to be trained from samples time consuming Performance degrades with increasing heterogeneity Difficult to add new fields to be extracted Difficult to select the right features for training
7. Background: Document Classification
Classify document pages into groups based on their visual similarity: the geometrical arrangement of components the typographic features such as font Existing Approaches MXY-Tree recursively cuts a page into blocks by separators (e.g. lines) as well as white spaces. A page is converted to a tree. M*N bins cuts a page into m*n equal size bins; a bin is either a text bin (if more than half are text) or white space bin
8. System Architecture
9. System Architecture (cont.)
Main components: Scan and OCR: Commercial OCR software is used to scan the documents. Metadata Extractor: Extract metadata by using rules and machine learning techniques. The extracted metadata are stored in a local database. In order to support Dublin Core, it may be necessary to map extracted metadata to Dublin Core format. OAI layer: Make the digital collection interoperable. The OAI layer accepts all OAI requests, get the information from database and encode metadata into XML format as responses. Search Engine
10. Template-Based Metadata Extraction
11. Template-Based Metadata Extraction- Document Classification
classify documents into groups based on the visual similarity of their metadata pages (page with richness in metadata) . the geometrical arrangement of metadata fields on the metadata page the typographic features such as font size, text alignment, and text height The identification of metadata pages by a set of rules
12. Template-Based Metadata Extraction- Document Classification
Document Pages MXY-Tree m*n bins Similarity Integration Furture: sim= a*sim_tree + b* sim_bin. Current: Convert two pages to two MXY-Trees, computing the edit distance between two trees. If D> min (length of tree1, length of tree2) /4, return 0 Convert two pages into m*n bins, computing the similarity (the percentage of bins with same type)If sim < 0.7, return 0. Else return 1.Furture: sim= a*sim_tree + b* sim_bin. Current: Convert two pages to two MXY-Trees, computing the edit distance between two trees. If D> min (length of tree1, length of tree2) /4, return 0 Convert two pages into m*n bins, computing the similarity (the percentage of bins with same type)If sim < 0.7, return 0. Else return 1.
13. Template sample
14. Experiments- Document Classification
downloaded 7413 documents from the DTIC collection randomly selected 200, 400, 800, 1200, 2000, 3000, 4000, 5000, 6000 documents & Classified them into groups
15. Selected 100 documents from DTIC; divided them into 7 classes; created a template for each class
Experiments- Metadata Extraction
16. Template-based experiment
17. Screenshots OAI
18. Screenshots Search Engine
19. Conclusions
We describe how to automate the task of converting existing corpus into an OAI-compliant repository We propose our metadata extraction approach to address the challenge of getting desirable accuracy for a large heterogeneous collection of documents