1 / 35

Assesing the APE-INV benchmark

Assesing the APE-INV benchmark. Andrea Maurino DISCo - Dip. di Informatica, Sistematica e Comunicazione Universit à di Milano Bicocca viale Sarca 336/14, 20124, Milano (Italy ). Index. The Benchmark The Methodology Adress verification Deduplication Schema Aware Deduplication

gerd
Télécharger la présentation

Assesing the APE-INV benchmark

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Assesing the APE-INV benchmark Andrea Maurino DISCo - Dip. di Informatica, Sistematica e ComunicazioneUniversità di Milano Bicoccaviale Sarca 336/14, 20124, Milano (Italy)

  2. Index • The Benchmark • The Methodology • Adress verification • Deduplication • Schema Aware Deduplication • On going work ••• ITIS Lab •••http://www.itis.disco.unimib.it

  3. Benchmark ••• ITIS Lab •••http://www.itis.disco.unimib.it

  4. Benchmark • 1997 tuples • French inventors only • October 2009 version ••• ITIS Lab •••http://www.itis.disco.unimib.it

  5. The methodology Postal address verification Deduplication Schema aware deduplication

  6. Postal Address Verification • Goal • To assess addresses in RAW table • To provide a standardized version for addresses • Input • Raw Table • Output • CorrectedRaw_langLat Table • Tool • AST version 1 • AST version 2 ••• ITIS Lab •••http://www.itis.disco.unimib.it

  7. Postal Address Verification • The problem: • are the following addresses corrected? “THOMSON-CSF - SCPI 173, bld Hausmann 75360 Paris Cedex 08” “9 avenue Saint Jacques 91600 Savigny Sur Org” • For postman? • According to Official list of address • According to the clean_address table? • Answer « AVENUE SAINT JACQUES 9 91600 SAVIGNY SUR ORGE ESSONNE ILE-DE-FRANCE FR » «  BOULEVARD HAUSSMANN 173 75379 PARIS PARIS ILE-DE-FRANCE FR» ••• ITIS Lab •••http://www.itis.disco.unimib.it

  8. Postal address Verification • Syntactic accuracy is the closeness of a value v to the elements of the corresponding definition domain D • Who store the domain D (all possibile French addresses)? • French post office • Google maps ••• ITIS Lab •••http://www.itis.disco.unimib.it

  9. AST 1.0 • AST (AddresS Tool) 1.0 exploits google API to • assess if and address is correct • Produce a standardize version of the address • 15000 request per IP/day (for free) • 5 seconds between two requests AST1.0 client AST1.0 server Queuing AST1.0 client … RAW Table Corrected RAW Table AST1.0 client ••• ITIS Lab •••http://www.itis.disco.unimib.it

  10. AST1.0 An address a is accurate if the application of AST 1.0 to it returns exactly one answer with an accuracy level 6. ••• ITIS Lab •••http://www.itis.disco.unimib.it

  11. AST1.0 ••• ITIS Lab •••http://www.itis.disco.unimib.it

  12. AST1.0 • Raw table includes 78.6% of accurate data (1569 addresses • According to AST • In order to evaluate the quality of such results we manually compared the standardized version of accurate addresses with the clean address table. • The results are the following: ••• ITIS Lab •••http://www.itis.disco.unimib.it

  13. AST 2.0 • The 2.0 version uses the Web version of Google API and exploit the “did you mean” feature to improve results • An address a is accurate if the application of AST 2.0 tool to it returns exactly one answer an no ”did you mean” sentence is present in the answer page. ••• ITIS Lab •••http://www.itis.disco.unimib.it

  14. AST 2.0 AST2.0 Server HTTP REQUEST Queuing AST2.0 Wrapper RAW Table HTML PAGE Corrected RAW Table ••• ITIS Lab •••http://www.itis.disco.unimib.it

  15. AST2.0 • No limitation to # of requests per IP and between two requests • Raw table includes 64.7% of accurate data (1292) addresses • The results are the following: ••• ITIS Lab •••http://www.itis.disco.unimib.it

  16. Comments • Results produced by AST1.0 can be further improved by appling AST2.0 to the non accurate addresses identified by AST1.0 • The coverage of addresses shown by Gmaps is good, but • In case of small towns there are no information • Historical data are not available (Stalingrad - Volgograd) • The benchmark could include also latitude and longitude information • How to measure this kind of precision? • AST1.0+AST2.0 is a good way to easily (and freely) assess postal addresses for the majority of countries in the world. ••• ITIS Lab •••http://www.itis.disco.unimib.it

  17. Final result for Postal Address Verification ••• ITIS Lab •••http://www.itis.disco.unimib.it

  18. Deduplication ••• ITIS Lab •••http://www.itis.disco.unimib.it

  19. Deduplication • Space reduction • Sorted neighborhood method • Comparison functions • Edit distance, Jaccard … • Decision model • Fellegi, Sutter • Tool • Fril • Febrl ••• ITIS Lab •••http://www.itis.disco.unimib.it

  20. Febrl • Febrl (Freely Extensible Biomedical Record Linkage) is a freeware data standardisation and probabilistic record linkage python-based tool. ••• ITIS Lab •••http://www.itis.disco.unimib.it

  21. Febrl • Environment: PowerEdge R710 with two Intel Xeon X5550 processor (2,66GHz, cache 8MB), 16GB Memory, 4 HD with 450GB SAS 15.000rpm • Database: Mysql • Index: SNM windows size =10 • Comparison functions • Person_name edit distance (threshold 0.45) • Person_address Longest common substring (threshold 0.45) • Decision model • Fellegi Sutter. • Score <0.45 match • 0.45<=Score <=0,75 possibile match • score>0.75 match ••• ITIS Lab •••http://www.itis.disco.unimib.it

  22. Febrl ••• ITIS Lab •••http://www.itis.disco.unimib.it

  23. Fril • FRIL (fine-grained record integration and linkage) • Java based • Two search methods: nested loop join (NLJ) and the sorted neighborhood method (SNM). • Comparison function: edit distance, Soundex, Q-gram, and equality ••• ITIS Lab •••http://www.itis.disco.unimib.it

  24. Fril • Environment: HP Pavillon DV6-2125EL with one Intel core I3-330 (2.13 GHz), 4 GB ram, one hd 500 GB Sata 7200RPM • Db server stored into the PowerEdge R710 • Search index: SNM • Comparison functions: • Person_name edit distance (threshold =0.3) • Address  edit distance (threshold =0.8 ) (very high threshold!!!) • Decision model • Fellegi Sutter ••• ITIS Lab •••http://www.itis.disco.unimib.it

  25. Fril ••• ITIS Lab •••http://www.itis.disco.unimib.it

  26. Evaluation • Febrl is much more time consuming than Fril (it does not use thread) • Febrl does not accept db connection • Fril is fast, with good usabiltiy, but not so precise • Thresholds are sometime too high (probably the benchmark does not include too much noise) • The use of a cleaned version of the raw table significantly increase the precision of results ••• ITIS Lab •••http://www.itis.disco.unimib.it

  27. Schema aware deduplication ••• ITIS Lab •••http://www.itis.disco.unimib.it

  28. Schema aware deduplication • Data do not live alone • Inventor table is one of the table of patstat • More information can be used to deduplication goals • In the literature • Group Linkage (a.k.a. Group ER) • Inter-relationship Deduplication • We introduce an approach • Domain independent • Exploiting context information via schema analysis • Covering multiple types of record linkage: • scattered information • Dirty data ••• ITIS Lab •••http://www.itis.disco.unimib.it

  29. Some results • Some preliminaryresults • Not so good, but the levelofinterconnectivityis low • Average 2 co-inventors for eachinventor ••• ITIS Lab •••http://www.itis.disco.unimib.it

  30. Ongoing work ••• ITIS Lab •••http://www.itis.disco.unimib.it

  31. Toward a new record linkage • “Pantarei” (Heraclitus) everything flows, everything is constantly changing. • Database may keep trace of these never ending changes • Examples • People change names • Xin Dong Xin Luna Dong • People change works • Havelymoves from Univ. of Wa. to Google • Nations change • YUGOSLAVIA  Serbia-Montenegro Serbia Kosovo ••• ITIS Lab •••http://www.itis.disco.unimib.it

  32. An example

  33. Another example

  34. Temporal Record Linkage • Temporal Record linkage is a new research area that it is in charge of discovering if two records represent the same real world object described at two different time stamps • Work made in collaboration with AT&T Research Labs ••• ITIS Lab •••http://www.itis.disco.unimib.it

  35. Preliminary results • Results under publication thus, sorry, we provide some preliminary results • PEI ADDS data from early binding results ••• ITIS Lab •••http://www.itis.disco.unimib.it

More Related