280 likes | 289 Vues
Biological Data Extraction and Integration A Research Area Background Study. Cui Tao Department of Computer Science Brigham Young University. Research Field Overview. My research. Semantic Web. Data Integration. Schema Matching. Information Extraction. Bioinformatics.
E N D
Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University
Research Field Overview My research Semantic Web Data Integration Schema Matching Information Extraction Bioinformatics
Information Extraction • “Information extraction systems process text documents and locate a specific set of relevant items.” [Califf99]
Information Extraction • “Information extraction systems process text documents and locate a specific set of relevant items.” [Califf99] • “Because the WWW consists primarily of text, information extraction is central to all effort that would use the web as a resource for knowledge discovery.” [Freitag98]
Information Extraction • Traditional information extraction • Hidden web crawling • Biological data extraction
Traditional Information Extraction • Different groups of IE tools: [Laender02] • Wrapper generation tools • NLP-based and learning-based tools • Ontology-based tools
Traditional Information Extraction • Wrapper generation tools • Lixto [Baumgartner01] • Supervised wrapper generation • Semi-automatically • Not robust; Does not work well with unstructured data • ROADRUNNER [Crescenzi01] • Fully automatic wrapper generation • Does not generate robust and general wrappers • Only works for highly regular web pages
Traditional Information Extraction • NLP-based and learning-based tools • SRV [Freitag98] • Top-down learner • Learns based on simple and relational features • Single slot filling • RAPIER [Califf99] • Bottom-up learner • Learns pre-filler, slot filler, and post-filler patterns • Only works for free text • Single slot filling
Traditional Information Extraction • Ontology-based tools • BYU Ontos [Embley99] • Based on domain-specific extraction ontologies • Robust to changes • Multiple slot filling • Ontologies has to be built manually
Hidden Web Crawling • Traditional IE tools: publicly indexable web pages • Hidden web crawling • Crawl the hidden web according to a user’s query • HiWE (Hidden Web Exposer) [Raghavan01] • Source form representation task-specific DB concepts • Fill out and submit forms • Retrieve information hidden behind the form
Biological Data Extraction • Mainly from plain text • Extract biological terms • Dictionary-based • Rule-based • Extract relationships between biological terms/elements • Example systems • BLAST-based name identifier [Krauthammer00] • PASTA (Protein Active Site Template Acquisition) [Gaizauskas03]
The Semantic Web • Machine-understandable web • Gives information a well-defined meaning • Allows automation of tasks • Provides biologists • Intelligent information services • Personalized web resources • Semantically empowered search engines
The Semantic Web • Semantic web languages • XOL (XML-based Ontology Exchange Language) • SHOE (Simple HTML Ontology Extension) • OML (Ontology Markup Language) • RDF(S) (Resource Description Framework (Schema)) • OIL (Ontology Interchange Language) • DAML+OIL (DARPA Agent Markup Language + OIL) • OWL (Ontology Web Language) • Semantic Annotation • Old: indexing of publications in libraries • New: information extraction
Schema Matching • Previous methods [Raghavan01]: • Individual matchers vs. combining matchers • Schema-based matchers vs. instance-based matchers • Learning-based matchers vs. rule-based matchers • Element-level matchers vs. structure-level matchers
Schema Matching • LSD (Learning Source Description) [Doan01] • Semi-automatic • Learning-based • Both schema-level and instance-Level • Only 1-1 mappings • GLUE & CGLUE [DMD+03] • Ontology alignment • CGLUE: Complex (non-1-1) mappings
Schema Matching • Cupid [Madhavan01] • Rule-based matcher • Both element-level and structure-level • Schema-based • Works on hierarchical schemas with schema tree • Linguistic similarity & structure similarity • Matches tree elements by weighted similarities
Schema Matching • COMA (COmbing MAtch) [Do02] • Combines different matchers • Interactive with users • Also an evaluation platform for different matchers
Biological Data Integration • Challenge: • Huge amount, growing rapidly • Highly diverse in granularity and variety • Different terminologies, ID systems, units • Unstable and unpredictable • Different interface and querying capabilities
Biological Data Integration • SRS (Sequence Retrieval System) [Etzold96] • Keyword-based retrieval system • Returns simple aggregation of matched records • Only works for relational databases • BioKleisli [Davidson97] • Integrated digital library in biomedical domain • No global schema or ontology • A mediator works on top of source-specific wrappers • Horizontal integration
Biological Data Integration • DiscoveryLink [Haas01] • Mediator-based, wrapper-oriented • Provides virtual DB access from different sources • Cannot deal with complex source data • Hard to add new sources • Requires knowledge of specific query language • TAMBIS (Transparent Access to Multiple Bioinformatics Information Sources) [Stevens00] • Mediator-based • Uses global ontology and schema • Maps source and target concepts manually • Not robust to changes • Hard to add new sources
Bioinformatics • Biological ontology • Bioinformatics data source discovery • Trustworthiness and provenance
Bioinformatics • Biological ontology • GO (Gene Ontology) [Ashburner00] • Controlled vocabulary • Molecular Function (7278 terms) • Biological Process (8151 terms) • Cellular Component (1379 terms) • Is represent knowledge hierarchically
Bioinformatics • Biology Ontology • LinKBase [Verschelde03] • Originally a biomedical ontology • Over 2,000,000 medical concepts • Over 5,300,000 instantiations • 543 relations • Expanded using GO • Only describes simple binary relationships
Bioinformatics • Bioinformatics data source discovery • First step in integrating or answering queries • Example System: [Rocco03]: • Pre-defined classes with class descriptions • Tries to map a source with a class • Trustworthiness and provenance • Trustworthiness: • Consistency • Reliability • Competence • Honesty • Provenance • Record History • Transformations • Annotations • updates
My research Semantic Web Schema Matching Information Extraction Bioinformatics Summary and Future Work • Overcome drawbacks of existing systems • Elaborate new algorithms to solve the problem of locating and extracting data from heterogeneous biological sources