1 / 23

Data Federation and Search

Data Federation and Search. Courtesy: Dean Allemang Working Ontologist , LLC dallemang@workingontologist.com. Problems. RDB. RDB. RDB. Spreadsheet. XML. ?. Relational Database. email. Challenges. Syntactic challenges Different formats Character encodings Upper/lower case

overton
Télécharger la présentation

Data Federation and Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Federation and Search Courtesy: Dean Allemang Working Ontologist, LLC dallemang@workingontologist.com

  2. Problems RDB RDB RDB Spreadsheet XML ? Relational Database email

  3. Challenges • Syntactic challenges • Different formats • Character encodings • Upper/lower case • Structural challenges • Grouping • References • Semantic challenges • Identity (when are we talking about the same thing?) • Mapping (zip code -> post code) • Conversions (e.g., $ -> ₩)

  4. The Approach RDB RDB RDB Spreadsheet XML Relational Database email ?

  5. Scenario • Data about companies and their sectors and last sale are stored in different formats • How to search information across different datasets. • Mizuho Financial Group, Inc., its Legal Entity Identifier, located in which country, its last sale, its marketCap, its sector • How about Guangshen Railway Company Limited?

  6. Scenario • LEIAsia.xml: Legal Entity Identifiers (LEI) for organizations or individual that could manage money (example is a small set of entities registered in Asia) • Using country code rather than country full name • ISO3166.xml: converting country code to country full name • Companylistnyse.csv and companylistnasdaq.csv : company listings from NASDAQ including marketCap, IPOyear, Sector, Industry, Summary Quote, LastSaleand so on.

  7. Solutions • Integrating above three datasets using RDF and then search using Sparql

  8. Software • All software used in this tutorial is open source, and all data sets are in the public domain. The tutorial materials include: • xsltproc, a processor for xslt from XMLSoft • xml2rdf3.xsl, an XML to RDF translator in XSLT from AstroGrid, • tab2n3.py, a spreadsheet (CSV) to RDF converter from MindSWAP. This runs in Python. • arc, a RDF/SPARQL processor based on Jena

  9. Components Data Sources Software XML2RDF3 NASDAQ listings NYSE listings TAB2N3 ISO Country Codes ARQ Legal Entity Identifier (LEI) Asia

  10. Exercise Architecture TAB2N3 NASDAQ listings NYSE listings ARQ ISO Country Codes XML2RDF3 Legal Entity Identifier (LEI) Asia

  11. Setup • Download and unzip jist2013.allemang.org.zip • Set up environmental variables @echo off rem change this path to work dir set testdir=C:\Ding@IU\Teaching\Fall2013\Z636\GuestLecture\jist2013\jist2013.allemang.org rem do not change below code set JENA_HOME=%testdir%\apache-jena-2.11.0 set PATH=%PATH%;%testdir%\bin;%testdir%\apache-jena-2.11.0\bin set PATH=%PATH%;%testdir%\apache-jena-2.11.0\lib;%testdir%\apache-jena-2.11.0\bat @echo on

  12. Converting XML to RDF <LegalEntity> <LegalName>Hyundai Capital Services, Inc.</LegalName> <OtherNames> <OtherName>Hyundai Auto Finance Co., Ltd.</OtherName> </OtherNames> <RegisteredAddress> <AddressLineOne>10th Floor</AddressLineOne> <AddressLineTwo>Hyundai Capital Building</AddressLineTwo> <AddressLineThree>15-21, Youido-dong</AddressLineThree> <City>Youngdungpo-Ku</City> <State>Seoul</State> <Country>KR</Country> <PostCode>150-706</PostCode> </RegisteredAddress> </LegalEntity>

  13. Converting XML to RDF S0-0 S0-0-2 LegalName OtherNames value Hyundai Capital Services, Inc. RegisteredAddress value S0-0-3 S0-0-4-0 10th Floor AddressLineOne OtherName value S0-0-4-1 Hyundai Capital Building AddressLineTwo S0-0-4 S0-0-3-0 AddressLineThree value S0-0-4-2 15-21, Youido-dong City value value S0-0-4-3 Youngdungpo-Ku State Country S0-0-4-4 value Seoul Hyundai Auto Finance Co., Ltd. value S0-0-4-5 KR PostCode S0-0-4-6 value 150-706

  14. Converting XML to RDF • Converting ISO3166.xml to RDF xsltproc -stringparamBaseURI "http://jist2013.org/ISO3166" xml2rdf3.xsl ISO3166.xml > iso3166.rdf arq --data iso3166.rdf -query queries/properties.rq arq --data iso3166.rdf -query queries/iso.rq

  15. Converting XML to RDF • Converting LEIAsia.xml to RDF xsltproc -stringparamBaseURI "http://jist2013.org/LEIAsia" xml2rdf3.xsl LEIAsia.xml > LEIAsia.rdf arq --data LEIAsia.rdf -query queries/properties.rq arq --data LEIAsia.rdf -query queries/name.rq arq --data LEIAsia.rdf -query queries/name1.rq arq --data LEIAsia.rdf -query queries/addresses.rq

  16. Converting XML to RDF

  17. Converting CSV to RDF <http://jist2013.org/nasdaq#FUBC> Sector Industry "Major Banks" name "Finance" Market Cap "260938080.01" "1st United Bancorp, Inc. (FL)"

  18. Same data in turtle @prefix : <http://jist2013.org/nasdaq#>. :nFUBC a :Item ; :symbol "FUBC"; :name "1st United Bancorp, Inc. (FL)"; :marketCap "260938080.01"; :sector "Finance"; :industry "Major Banks". :nABMD a :Item ; :symbol "ABMD"; :name "ABIOMED, Inc."; :marketCap "976161502.05"; :sector "Health Care"; :industry "Medical/Dental Instruments" . :nARAY a :Item ; :symbol "ARAY"; :name "Accuray Incorporated"; :marketCap "512861883.8"; :sector "Health Care"; :industry "Medical/Dental Instruments". :nACFN a :Item ; :symbol "ACFN"; :name "Acorn Energy, Inc."; :marketCap "86289929.7"; :sector "Consumer Services"; :industry "Military/Government/Technical" .

  19. Converting CSV to RDF • Using tab2n3.py to convert CSV to RDF py tab2n3.py -comma -schema -type -idfield -namespace http://jist2013.org/nasdaq <companylistnasdaq.csv >companylistnasdaq.ttl py tab2n3.py -comma -schema -type -idfield -namespace http://jist2013.org/nyse <companylistnyse.csv >companylistnyse.ttl arq --data companylistnasdaq.ttl -query queries/properties.rq arq --data companylistnasdaq.ttl -query queries/company.rq

  20. Federated Search 1 • Looking for companies mentioned in NYSE and LEIAsia arq --data companylistnyse.ttl --data LEIAsia.rdf -query queries/fed2.rq prefix owl: <http://www.w3.org/2002/07/owl#> prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix lei: <http://www.leiutility.org#> prefix iso: <http://iso.org/3166#> prefix leia: <http://jist2013.org/LEIAsia#> prefix nasdaq: <http://jist2013.org/nasdaq#> prefix nyse: <http://jist2013.org/nyse#> prefix xsd: <http://www.w3.org/2001/XMLSchema#> SELECT ?name WHERE { ?lei lei:LegalName ?lname . # from name.rq ?lnamerdf:value ?name . # from name.rq ?stock nyse:name ?name # from company.rq }

  21. Federated Search 2 Show all companies that are listed on the NYSE or NASDAQ, showing their market cap and the name of the country they are registered in. arq --data companylistnyse.ttl --data companylistnasdaq.ttl --data iso3166.rdf --data LEIAsia.rdf -query queries/fed3.rq Show the legal forms of all companies, sorted by country (which legal forms are used in which country?) arq --data companylistnyse.ttl --data companylistnasdaq.ttl --data iso3166.rdf --data LEIAsia.rdf -query queries/fed4.rq Show the legal forms of all publicly traded companies. arq --data companylistnyse.ttl --data companylistnasdaq.ttl --data iso3166.rdf --data LEIAsia.rdf -query queries/fed5.rq Sum up the market caps of all listed companies in each country. arq --data companylistnyse.ttl --data companylistnasdaq.ttl --data iso3166.rdf --data LEIAsia.rdf -query queries/fed6.rq

  22. Tutorial: Answer Questions • Mizuho Financial Group, Inc., its Legal Entity Identifier, located in which country, its last sale, its marketCap, its sector • arq --data companylistnyse.ttl --data companylistnasdaq.ttl --data iso3166.rdf --data LEIAsia.rdf -query queries/mizuho.rq • How about Guangshen Railway Company Limited?

  23. For more info about Dean’s Tutorial The hands-on exercise software is Windows-Only • Visit http://workingontologist.com/events • Click the link for the hands-on exercise materials. • Download jist2013.allemang.org.zip • Unzip it to your desktop • Open tutorial.html, and follow the directions there.

More Related