Mopping up the Flood of Data with Web Services

Mopping up the Flood of Data with Web Services Gary Wiggins Indiana University School of Informatics wiggins@indiana.edu

Overview of the Talk • Data Mining and Knowledge Discovery • DMKD in Bioinformatics • DMKD in Chemistry • Public Chemistry Databases for DMKD • Overview of Web Services • NIH-funded Projects Underway or Planned at Indiana University • Educational Opportunities at IU

Data Mining and Knowledge Discovery (DMKD) • Techniques began to be used around 1989 • Rapid growth in the mid 1990s, with DMKD field emerging around 1995 • Built on DM tools such as Machine Learning

Data Mining • One of the steps in Knowledge Discovery • Concerned with the actual extraction of knowledge from data • Efficient and scalable methods for mining interesting patterns and knowledge and discovering hidden facts contained in large databases

Data Mining Techniques • Efficient classification methods • Clustering • Outlier analysis • Frequent, sequential, and structured pattern analysis • Visualization and spatial/temporal analysis tools

Knowledge Discovery (KD) • “KD is a nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns from large collections of data.” --Fayyad et al., as quoted by Cios and Kurgan • The KD process involves: • Understanding and preparation of the data • Data Mining (DM) • Verification and application of the discovered knowledge

Framework for KD Process • Steps range from very few, e.g., • Data collection and understanding • Data mining • Implementation • To multi-step models, e.g., Cios and Kurgan’s six-step DMKD process model

Cios and Kurgan’s Six-Step DMKD Process Model • Understanding the problem domain • Understanding the data • Preparation of the data ~50% or more of effort spent on this step • Data mining • Evaluation of the discovered knowledge • Using the discovered knowledge

General Data Mining/Data Analysis Systems • SAS Enterprise Miner • SPSS • Insightful S-Plus • IBM DB2 Intelligent Miner • Microsoft SQLServer 2005 • SGI MLC++ and MineSet Tree Visualizer • Inxight VizServer

Trends: Major Conferences • Knowledge Discovery and Data Mining (KDD) 2005 • http://www.informatik.uni-trier.de/~ley/db/conf/kdd/kdd2005.html • International Conference on Machine Learning (ICML) 2006 • http://www.icml2006.org/icml2006/technical/accepted.html • SIAM Conference on Data Mining 2006 • http://www.siam.org/meetings/sdm06/proceedings.htm

12th Annual SIGKDD International Conference onKnowledge Discovery and Data Mining, Philadelphia, August 20-23, 2006 • Areas of Interest on the Research Track: • Applications of data mining (biomedicine, business, e-commerce, defense) • Data and result visualization • Data warehousing • Data mining for community generation, social network analysis and graph-structured data • Foundations of data mining • Interactive and online data mining • KDD framework and process • Mining data streams • Mining high-dimensional data • Mining sensor data • Mining text and semi-structured data • Mining multi-media data • Novel data mining algorithms • Privacy and data mining • Robust and scalable statistical methods • Pre-processing and post-processing for data mining • Security issues • Spatial and temporal data mining

Trends in DMKD • OLAP (On-Line Analytical Processing) • Data warehousing • Association rules • High Performance DMKD systems • Visualization techniques • Applications of DM • More recently: • Database products that incorporate DM tools • New developments in design and implementation of the DMKD process • Information visualization products as end-user queries • XML

XML: the Key to DM and KD? • Or simply a data exchange protocol? • Allows for the description and storage of structured or semi-structured data and their relationships • Can be used to exchange data in a platform-independent way • BUT—only one paper at the major conferences listed earlier that dealt with XML

XML helps: • Standardize communication between diverse DM tools and databases (I/O procedures) • Build standard data repositories sharing data between different DM tools that work on different software platforms • Implement communication protocols between DM tools • Provide a framework for integration of and communication between different DMKD steps

Predictive Model Markup Language (PMML) and Other Tools • In conjunction with XML, PMML enables the automation of sharing of discovered knowledge between different domains and tools • XML-RPC • SOAP (Simple Object Access Protocol) • UDDI • OLAP • OLE DB-DM

Discovery Informatics: Definition • "Discovery Informatics is the study and practice of employing the full spectrum of computing and analytical science and technology to the singular pursuit of discovering new information by identifying and validating patterns in data." --William W. Agresti in 2003

Discovery Informatics • Discovery and Application of Information • Data Mining and Machine Learning are two aspects of Discovery Informatics.

Trends: Bioinformatics Conferences • International Conference on Instelligent Systems for Molecular Biology (ISMB) 2006 • http://ismb2006.cbi.cnptia.embrapa.br/papers.html • Research in Computational Molecular Biology (RECOMB) 2006 • http://www.informatik.uni-trier.de/~ley/db/conf/recomb/recomb2006.html • Pacific Symposium on Biocomputing (PSB) 2006 • http://helix-web.stanford.edu/psb06/

Main Areas of Research in Bioinformatics • Sequence alignment • Alternative splicing • Microarray analysis • Functional analysis • Analysis of single nucleotide polymorphisms (SNPs) • Natural language text analysis

DMKD Sessions at Major Bioinformatics Conferences • Databases and Data Integration • Text Mining and Information Extraction • Semantic Webs

Data Mining in Bioinformatics (Bajcsy) • Data cleaning, data preprocessing, and semantic integration of heterogeneous, distributed biomedical databases • Existing data mining tools for biodata analysis • Development of advanced, effective, and scalable data mining methods in biodata analysis

Preprocessing of Biodata • Integration of multiple microarray gene experiments must resolve inconsistent labels of genes to form a coherent data store. • Focus on quantitative quality metrics based on analytical and statistical data descriptors and on relationships among variables.

Semantic Integration of Heterogeneous Biomedical Databases • Combine multiple sources into a coherent data store • Find sematically equivalent real-world entities from several biomedical sources • Problems • Different labels for the same concept: gene_id vs. g_id • Time asynchronization: same gene analyzed at multiple development stages

Approaches for Semantic Integration of Biodata • Construction of integrated biodata warehouses or biodatabases • Construction of a federation of heterogeneous distributed biodatabases • Must build up mapping rules or semantic ambiguity resolution rules across multiple databases

Existing Data Mining Tools for Biodata Analysis-I • Sequence Analysis, e.g., • NCBI/BLAST, ClustalW, HMMER, PHYLIP, MEME, TRANSFAC, MDScan, Vector NTI, Sequencher, MacVector • Structure Prediction and Visualization, e.g., • RasMol, Raster3D, Swiss-Model, Scope, MolScript, Cn3D

Existing Data Mining Tools for Biodata Analysis-II • Genome Analysis, e.g., • CAP3, Paracel GenomeAssembler, GenomeScan, GeneMark, GenScan, X-Grail, ORF Finder, GeneBuilder • Pathway Analysis and Visualization, e.g., • KEGG, EcoCyc/MetaCyc, GenMapp • Microarray Analysis, e.g., • ScanAlyze/Cluster/TreeView, Scanalytics MicroArray Suite, Profiler, Silicon Genetics

Biospecific Data Analysis Software Systems • Agilent GeneSpring • Spotfire • Invitrogen VectorNTI

Text Mining in Bioinformatics • Techniques have progressed from simple recognition of terms to extraction of interaction relationships in complex sentences. • Search objectives have broadened to a range of problems, e.g., • Improving homology search • Identifying cellular location • Deriving genetic network technologies

Current Work in Biomedical Text Mining (Cohen and Hersh) • Text mining operates at a finer level of granularity than information retrieval and text summarization. • TM examines relationships between specific kinds of information contained within and between documents. • Areas of active research: • Named entity recognition (genes, proteins, etc.) • Text classification • Synonym and abbreviation extraction • Relationship extraction • Hypothesis generation • Integrated frameworks

Systems Biology • Requires a shift in focus from genes and proteins to the system’s structure and dynamics • Four key properties: • System structures • System dynamics • Control method • Design method • Systems Biology Markup Language (SBML) and CellML

iSpecies.org

Data Mining in Chemistry “Modern experimentation (whether “classical” or high-throughput) should be based on the productive interplay of statistical techniques (design-of-experiments), molecular modeling as well as cheminformatics.” --Ulrich S. Schubert

Session on “Integration of Informatics and Knowledge Management Informatics”* • Integration of Informatics at the Systems Level and at the Data LevelChris L. Waller, Ph.D., Director, World Wide Chemistry Informatics, Pfizer Global Research & Development • Integrated Knowledge Management at Bayer HealthCare: Pharmacophore Informatics William J. Scott, Ph.D., Team Leader, Department for Chemistry Research, Bayer Pharmaceuticals Corporation • Building a Knowledge Enabled OrganizationCory R. Brouwer, Ph.D., Associate Director, Knowledge Management Informatics, Pfizer Global Research & Development • Knowledge Management: Building a Knowledge Enabled OrganizationVictor Lobanov, Ph.D., Principal Scientist, MDI, Johnson & Johnson Pharmaceutical R&D *10th Annual Cheminformatics Conference, May 23-16, 2006, Philadelphia

Impact of HTS and Combinatorial Chemistry Research • Most impact in: • the pharmaceutical industry • medical research • catalyst research • More recently: • polymer and materials research.

Diversity of Data Mining in Chemistry • On 5/7/2006 there were 4072 references to either “datamining” or “data mining” in Chemical Abstracts. • 3416 different index terms were assigned to those records. • 2772 used 1-5 times (81%) • 298 used 6-10 times (9%) • 103 used 11-15 times (3%) • 71 used 16-20 times (2%) • 38 used 21-25 times (1%) • 24 used 26-30 times (1%) • 110 for 31-480 times (3%) • Most frequent co-term: “bioinformatics” with 480 hits or 12% of the occurrences

SFS graph

Components of the Semantic Web for Chemistry • XML – eXtensible Markup Language • RDF – Resource Description Framework • RSS – Rich Site Summary • Dublin Core – allows metadata-based newsfeeds • OWL – for ontologies • BPEL4WS – for workflow and web services • Murray-Rust et al. Org. Biomol. Chem. 2004, 2, 3192-3203.

Chemical Markup Language (CML) • Much of the semantics in a chemical article can be supported by CML • Molecules • Structures • Reactions and reaction schemes • Spectra (including annotations) • Physicochemical data • XML dictionaries and lexicons provide linguistic and semantic support for markup • Will lead to quicker authoring and higher quality of embedded structures and data through machine validation

Key Factors in the Success of the Chemical Semantic Web • Institutional Repositories: services deployed and supported at an institutional level to offer dissemination management, stewardship, and where appropriate, long-term preservation of both the intellectual work created by an institutional community and the records of the intellectual and cultural life of the institutional community • Open Access Movement

Knowledge-Driven Bioinformatics Enhanced with Chemistry

Text Mining (Banville) • “In the pharmaceutical field, it is ideally the marriage of biological and chemical information that needs to be the ultimate focus of text data mining applications.” • Problems: • Lack of universal publication standards for identifying each unique chemical entity • Selective indexing policies of A&I services • Need to understand how chemical structures link to biological processes

OSCAR3 Service • Open Java source application under development by Peter Murray-Rust group at Cambridge (Not published yet) • Extracts chemical information from either a paragraph of experimental data or a full paper (e.g. melting points, infra-red and NMR data, and mass spectral information) • Produces an XML instance highlighting the chemical information with an Extensible Stylesheet Language (XSL) file • At IU, we are attaching SOAP input/output engine for a web service based on OSCAR3.

OSCAR at Work in the Future

Science.gov PubMed Google Scholar e-Prints Dspace etc. Semantic Scholars’ Grid I Local MDStore Local HarvestStore Fetch MD and Documents Gatherer Query and Get list Indexer Analyzer Index all Local MD Run filter such asOSCAR2 onharvested MDand documents Store new MD

ACM IEEE Google Scholar CiteULike Wiley Connotea etc. Del.icio.us etc. Foreign User Interface Semantic Scholars’ Grid II Local MDStore Plug-in SynchronizeSSG andforeign MD Updater CommunityTools SSGViewer Instant Citation Index etc. Update local MD Control foreign interactions View all MD’ Access Community Tools Update and viewforeign MD

Chemical Datamining Software • SureChem • http://surechem.reeltwo.com/ • CLiDE • Recognizes structures, reactions, and text • http://www.simbiosys.ca/clide/ • OSCAR • “OSCAR1” to check experimental data • http://www.ch.cam.ac.uk/magnus/checker.html • http://www.rsc.org/Publishing/ReSourCe/AuthorGuidelines/AuthoringTools/ExperimentalDataChecker/ • CSR (Chemical Structure Reconstruction) • http://www.scai.fraunhofer.de/uploads/media/MZ-ERCIM05_04.pdf • MDL DocSearch—combines MDL’s Isentris platform and EMC’s Documentum

ChemDB http://cdb.ics.uci.edu/CHEM/Web/

Mopping up the Flood of Data with Web Services