1 / 96

Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration. Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012. Contents. Review of reading assignment Webs of data and semantic web Data on the web, linked data Deep web Data discovery

Télécharger la présentation

Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012

  2. Contents • Review of reading assignment • Webs of data and semantic web • Data on the web, linked data • Deep web • Data discovery • Data integration • Summary • Next week

  3. Reading • Data Quality European Union Presentation • ISO Technical Standards - General Reference

  4. Webs of data • Early Web - Web of pages • http://www.ted.com/index.php/talks/tim_berners_lee_on_the_next_web.html • Semantic web started as a way to facilitate “machine accessible content” • Initially was available only to those with familiarity with the languages and tools, e.g. your parents could not use it • Webs of data grew out of this • One specific example is W3C’s Linked Open Data

  5. Semantic Web • http://www.w3.org/2001/sw/ • “The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. It is a collaborative effort led by W3C with participation from a large number of researchers and industrial partners. It is based on the Resource Description Framework (RDF). See also the separate FAQ for further information.”

  6. Terminology • Semantic Web • An extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation, www.semanticweb.org • Primer: http://www.ics.forth.gr/isl/swprimer/ • Semantic Grid • Semantic services to use the resources of many computers connected by a network to solve large scale computational/ data problems • Ontology (n.d.). The Free On-line Dictionary of Computing. http://dictionary.reference.com/browse/ontology • An explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that hold among them.

  7. Semantic Web Layers http://www.w3.org/2003/Talks/1023-iswc-tbl/slide26-0.html, http://flickr.com/photos/pshab/291147522/

  8. Application Areas for SW • Smart search • Annotation (even simple forms), smart tagging • Geospatial • Implementing logic (rules), e.g. in workflows • Data integration • Verification …. and the list goes on • Web services • Web content mining with natural language parsing • User interface development (portals) • Semantic desktop • Wikis - OntoWiki, SemanticMediaWiki • Sensor Web • Software engineering • Explanation

  9. Semantic Web Basics • The triple: {subject-predicate-object} Interferometeris-aoptical instrument Optical instrumenthasfocal length • W3C is the primary (but not sole) governing org. • RDF • OWL 1.0 and 2.0 - Ontology Web Language • RDF • programming environment for 14+ languages, including C, C++, Python, Java, Javascript, Ruby, PHP,...(no Cobol or Ada yet ;-( ) • OWL programming for Java • Closed World - where complete knowledge is known (encoded), AI relied on this • Open World - where knowledge is incomplete/ evolving, SW promotes this

  10. Ontology Spectrum Thesauri “narrower term” relation Selected Logical Constraints (disjointness, inverse, …) Frames (properties) Formal is-a Catalog/ ID Informal is-a Formal instance General Logical constraints Terms/ glossary Value Restrs. Originally from AAAI 1999- Ontologies Panel by Gruninger, Lehmann, McGuinness, Uschold, Welty; – updated by McGuinness. Description in: www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-abstract.html

  11. SW != ontologies on the web (!) • Ontologies are important, but use them only when necessary as identified by use cases • The Semantic Web is about integrating data on the Web; ontologies (and/or rules) are tools to achieve that when necessary • SW ontologies != some big (central) ontology • The ethosof the Semantic Web is on sharing, ie, sharing possibly many small ontologies • A huge, central ontology could be difficult to manage in terms of maintenance. • Semantic web languages such as OWL contain primitives for equivalence and disjointness of terms and meta primitives for versioning info • The practice: • SW applications using ontologies mix large number of ontologies and vocabularies (FOAF, DC, and others) • the real advantage comes from this mix: that is also how new relationships may be discovered • One readable background article from the metadata world is available at: http://www.metamodel.com/article.php?story=20030115211223271

  12. Semantic Web Myths • ‘the Semantic Web is a reincarnation of Artificial Intelligence on the Web’ (closed world versus open world) • ‘it relies on giant, centrally controlled ontologies for "meaning" (as opposed to a democratic, bottom-up control of terms)’ • ‘one has to add metadata to all Web pages, convert all relational databases, and XML data to use the Semantic Web’ • ‘one has to learn formal logic, knowledge representation techniques, description logic, etc, to use it’ • ‘it is, essentially, an academic project, of no interest for industry’

  13. Integrating Multiple Data Sources • The Semantic Web lets us merge statements from different sources • The RDF Graph Model allows programs to use data uniformly regardless of the source • Figuring out where to find such data is a motivator for Semantic Web Services #Ionosphere hasCoordinates #magnetic name hasLowerBoundaryValue “100” “Terrestrial Ionosphere” hasLowerBoundaryUnit “km” Different line & text colors represent different data sources

  14. Drill Down /Focused Perusal • The Semantic Web uses Uniform Resource Identifiers (URIs) to name things • These can typically be resolved to get more information about the resource • This essentially creates a web of data analogous to the web of text created by the World Wide Web • Ontologies are represented using the same structure as content • We can resolve class and property URIs to learn about the ontology …#NeutralTemperature …#Norway Internet locatedIn measuredby ...#ISR ...#FPI type operatedby …#EISCAT ...#MilllstoneHill

  15. Statements about Statements • The Semantic Web allows us to make statements about statements • Timestamps • Provenance / Lineage • Authoritativeness / Probability / Uncertainty • Security classification • … • This is an unsung virtue of the Semantic Web #Danny’s #Aurora hasSource hasDateTime hascolor 20031031 Red Ontologies Workshop, APL May 26, 2006

  16. ‘Collecting’ the ‘data’ • Part of the (meta)data information is present in tools ... but thrown away at output e.g., a business chart can be generated by a tool: it ‘knows’the structure, the classification, etc. of the chart, but, usually, this information is lost storing it in web data would be easy! • SW-awaretools are around (even if you do not know it...), though more would be good: • Photoshop CS stores metadata in RDF in, say, jpg files (using XMP) • RSS 1.0 feeds are generated by (almost) all blogging systems (a huge amount of RDF data!)

  17. ‘Collecting’ the ‘data’ • Scraping - different tools, services, etc, come around every day: • get RDF data associated with images, for example: service to get RDF from flickr images • service to get RDF from XMP • XSLT scripts to retrieve microformat data from XHTML files • RSS scraping in use in VO projects in Japan • scripts to convert spreadsheets to RDF – e.g. see the tools, tutorials, demos at http://logd.tw.rpi.edu

  18. ‘Collecting’ the ‘data’ • SQL - A huge amount of data in Relational Databases • Although tools exist, it is not feasible to convert that data into RDF • Instead: SQL ⇋ RDF ‘bridges’are being developed: a query to RDF data is transformed into SQL on-the-fly • Reading for this week, article by Berners Lee and Sahoo et al. • RDB2RDF W3 working group - http://www.w3.org/2001/sw/rdb2rdf/ • D2RQ/ D2RServer • Commercial solutions appearing • NoSQL • Other ‘graph’ forms…

  19. More Collecting • RDFa (formerly known as RDF/A) extends XHTML by: • extending the link and meta to include child elements • add metadata to any elements (a bit like the class in microformats, but via dedicated properties) • It is very similar to microformats, but with more rigor: • it is a general framework (instead of an ‘agreement’on the meaning of, say, a class attribute value) • terminologies can be mixed more easily • GRDDL - Gleaning Resource Descriptions from Dialects of Languages • ATOM (follow on to RSS)

  20. Linked open data • http://linkeddata.org/guides-and-tutorials • http://tomheath.com/slides/2009-02-austin-linkeddata-tutorial.pdf (we will look at some of these slides now, #1-25 and 30-37) • And of course: • http://logd.tw.rpi.edu/ • http://data-gov.tw.rpi.edu/wiki

  21. http://richard.cyganiak.de/2007/10/lod/ • Latest 295 • 2011-09-19 295 • 2010-09-22 203 • 2009-07-14 95 • 2009-03-27 93 • 2009-03-05 89 • 2008-09-18 45 • 2008-03-31 34 • 2008-02-28 32 • 2007-11-10 28 • 2007-11-07 28 • 2007-10-08 25 • 2007-05-01 12

  22. 2009-03-05 (Chris Bizer)

  23. September 2011 “Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”

  24. (Class 2) Management • Creation of logical collections • Physical data handling • Interoperability support • Security support • Data ownership • Metadata collection, management and access. • Persistence • Knowledge and information discovery • Data dissemination and publication

  25. Data Management and WOD • Is this the grand solution? • How is the data managed? • Found? • Curated? • What about the metadata? • What problems are introduced? • See: Parsons and Fox (2012): http://mp-datamatters.blogspot.com/

  26. Data on the Web, Internet • Data behind web services • Data files on web sites • We have covered data as service approaches • Thinking you have found data when you have really only found information and metadata • The real difference between this topic and the next one is: • Access and dissemination • Level of curation (and often description)

  27. Data on the internet • http://www.dataspaceweb.org/ • Data files on other protocols • FTP • RFTP • GridFTP • SABUL • XMPP/AMQP • Others…

  28. Deep web • Data behind web services • Data behind query interfaces (databases or files) • Introduces a different curation problem

  29. The loose definition • Something that a crawler cannot find and/or index • Creates the other definition of shallow web • Has many implications for discovery, access and use • Curation is more complex to satisfy this definition, i.e. not a matter of just putting files ‘on the web’ • 50, 100, 1000 times the ‘shallow web’?

  30. Managing (in) the deep web • Sometimes, the deep web aspect of a data source can be due to extreme obscurity, language peculiarities, NO metadata, NO documentation • There are no known studies of how effective data management (what you are learning) could change the percentage of deep/ shallow • Semantics are often put forward as a solution http://www.mkbergman.com/458/new-currents-in-the-deep-web/

  31. Internet impacts on management • Management of data that is… on the Internet! • Web – ‘stateless’ • Curation, Preservation – highly stateful (by definition) • You will hear terms such as digital curation and digital preservation (search on these) but what about internet curation and internet preservation (Internet Archive?) • What others??

  32. (Class 2) Management • Creation of logical collections • Physical data handling • Interoperability support • Security support • Data ownership • Metadata collection, management and access. • Persistence • Knowledge and information discovery • Data dissemination and publication

  33. Thus data frameworks are appearing • Many – meaning they go beyond web sites, they incorporate many of the data management functions • Initially syntactic – e.g. OPeNDAP, ADDE, ODATA, OODT • Application oriented – e.g. virtual observatories • Semantic – e.g. Virtual Solar-Terrestrial Observatory • ALL of these are changing the nature of data management and role of data ‘providers’ cf. ?

  34. Some Definitions DAP = Data Access Protocol • Model used to describe the data; • Request syntax and semantics; and • Response syntax and semantics. OPeNDAP • The software; • Numerous reference implementations; • Core/libraries and services (servers and clients). OPeNDAP Inc. • OPeNDAP is a 501.c(3) non-profit corporation; • Formed to maintain, evolve and promote the discipline neutral DAP that was the DODS core infrastructure. BOM, Melbourne, VIC 20071015 (Fox)

  35. Considerations with regard to the development of DAP and OPeNDAP • Many data formats • Many different client types Many data providers • Many different semantic representations of the data • Many different security requirements BOM, Melbourne, VIC 20071015 (Fox)

  36. Broad Vision A world in which a single data access protocol is used for the exchange of data between network based applications regardless of discipline. A layer above TCP/IP providing for syntactic and semantic consistency not available in existing protocols such as FTP. BOM, Melbourne, VIC 20071015 (Fox)

  37. Practical Considerations The broad vision: • Is syntactically achievable, but • Was not semantically achievable, at least not fully, but perhaps in the near term. BOM, Melbourne, VIC 20071015 (Fox)

  38. OPeNDAP Inc. Mission Statement To maintain, evolve and promote a data access protocol (DAP) and reference implementation software (OPeNDAP) for the syntactically consistent exchange of data over the network. The DAP should provide syntactic interoperability across disciplines and allow for semantic interoperability within disciplines. BOM, Melbourne, VIC 20071015 (Fox)

  39. The Data Access Protocol (DAP) • The DAP has been designed to be as general as possible without being constrained to a particular discipline or world view. • The DAP is a discipline neutral data access protocol; it is being used in astronomy, medicine, earth science,… • Provides data format and location, and data organization transparency • Is metadata neutral BOM, Melbourne, VIC 20071015 (Fox)

  40. DAP comparisons • File-based • GridFTP/FTP • HTTP • SRB • Service-based • Open-Geospatial Consortium, WCS, WMS, WFS, … • Virtual Observatory (Astronomy), SIAP, SSAP, STAP,… BOM, Melbourne, VIC 20071015 (Fox)

  41. Who is using DAP/ OPeNDAP? • Science examples • PMEL with their Tsunami inundation modeling • Ocean regional modelers to extract open boundary conditions • Visualization of data sets using MATLAB/IDL/… • Service examples • Live Access Server • Mapserver – OGC services and OPeNDAP data access (future) • Digital Library Service - metadata and catalogue info BOM, Melbourne, VIC 20071015 (Fox)

  42. Data Access Protocol (DAP2) - Current • DAP2 currently a NASA/ESE ‘Standard’ • Current servers implement DAP2 DAP3 • DAP 2 + XML responses (implemented) BOM, Melbourne, VIC 20071015 (Fox)

  43. DAP4 • DAP4 improvements over DAP3: • Additional datatypes • Swath • Blob - GIF, MPEG,… • Additional functionality • Check sum • Modulo • The additional datatypes will enable the DAP to be used in a wider variety of circumstances and are a direct response to users’ requests. BOM, Melbourne, VIC 20071015 (Fox)

  44. What DAP means to me • Data access and transport • Response types: DAP objects versus file type • A DAP URL is essentially an HTTP URL with additional restrictions placed on the abs-path component. • DAP2-URL = "http://" host [ ":" port ] [ abs-path] • abs-path = server-path data-source-id [ "." ext[ "?" query ] ] • server-path = [ "/" token ] • data-source-id = [ "/" token ] • ext = "das" | "dds" | "dods" • The server-path is the pathname to the server, whereas data-source-id is the pathname to the data. BOM, Melbourne, VIC 20071015 (Fox)

  45. OPeNDAP V3 Architecture Client Cgi style access Data • CGI-style access • Uses web server • HTTP protocol • Several request and response types • Reads data files, Databases, et c., returns info • May return DAP2 objects or other data • Client can be application, web browser or specialized server/service BOM, Melbourne, VIC 20071015 (Fox)

  46. OPeNDAP V4 (Hyrax) Architecture Client OLFS BES Data • OPeNDAP Lightweight Front end Server (OLFS) • Receives requests and asks the BES to fill them • Uses Java Servlets • Does not directly ‘touch’ data • Multi-protocol • Back End Server (BES) • Reads data files, Databases, et c., returns info • May return DAP2 objects or other data • Does not require web server BOM, Melbourne, VIC 20071015 (Fox)

  47. Binaries Generated There are approximately 80 binaries built on a nightly basis. They are built for the following platforms/operating systems: • Linux • FC4 • FC5 • MacOS-X (universal binaries when possible) • Windows XP, win32 • Java 1.5 (Tomcat 5.5) • IRIX (in four variants), Solaris, AIX, OSF BOM, Melbourne, VIC 20071015 (Fox)

  48. OPeNDAP System Elements The OPeNDAP data access protocol is used by a variety of system elements. • Clients • Browser Interfaces • Data System Integrators (ODC) • Servers • Processing Servers • Aggregating Servers - OPeNDAP chains • Ancillary Information Services BOM, Melbourne, VIC 20071015 (Fox)

  49. Clients • Clients make requests and receive responses via the DAP. • Clients convert data from the OPeNDAP data model to the form required in the client application. BOM, Melbourne, VIC 20071015 (Fox)

More Related