350 likes | 490 Vues
Assesing the APE-INV benchmark. Andrea Maurino DISCo - Dip. di Informatica, Sistematica e Comunicazione Universit à di Milano Bicocca viale Sarca 336/14, 20124, Milano (Italy ). Index. The Benchmark The Methodology Adress verification Deduplication Schema Aware Deduplication
E N D
Assesing the APE-INV benchmark Andrea Maurino DISCo - Dip. di Informatica, Sistematica e ComunicazioneUniversità di Milano Bicoccaviale Sarca 336/14, 20124, Milano (Italy)
Index • The Benchmark • The Methodology • Adress verification • Deduplication • Schema Aware Deduplication • On going work ••• ITIS Lab •••http://www.itis.disco.unimib.it
Benchmark ••• ITIS Lab •••http://www.itis.disco.unimib.it
Benchmark • 1997 tuples • French inventors only • October 2009 version ••• ITIS Lab •••http://www.itis.disco.unimib.it
The methodology Postal address verification Deduplication Schema aware deduplication
Postal Address Verification • Goal • To assess addresses in RAW table • To provide a standardized version for addresses • Input • Raw Table • Output • CorrectedRaw_langLat Table • Tool • AST version 1 • AST version 2 ••• ITIS Lab •••http://www.itis.disco.unimib.it
Postal Address Verification • The problem: • are the following addresses corrected? “THOMSON-CSF - SCPI 173, bld Hausmann 75360 Paris Cedex 08” “9 avenue Saint Jacques 91600 Savigny Sur Org” • For postman? • According to Official list of address • According to the clean_address table? • Answer « AVENUE SAINT JACQUES 9 91600 SAVIGNY SUR ORGE ESSONNE ILE-DE-FRANCE FR » « BOULEVARD HAUSSMANN 173 75379 PARIS PARIS ILE-DE-FRANCE FR» ••• ITIS Lab •••http://www.itis.disco.unimib.it
Postal address Verification • Syntactic accuracy is the closeness of a value v to the elements of the corresponding definition domain D • Who store the domain D (all possibile French addresses)? • French post office • Google maps ••• ITIS Lab •••http://www.itis.disco.unimib.it
AST 1.0 • AST (AddresS Tool) 1.0 exploits google API to • assess if and address is correct • Produce a standardize version of the address • 15000 request per IP/day (for free) • 5 seconds between two requests AST1.0 client AST1.0 server Queuing AST1.0 client … RAW Table Corrected RAW Table AST1.0 client ••• ITIS Lab •••http://www.itis.disco.unimib.it
AST1.0 An address a is accurate if the application of AST 1.0 to it returns exactly one answer with an accuracy level 6. ••• ITIS Lab •••http://www.itis.disco.unimib.it
AST1.0 ••• ITIS Lab •••http://www.itis.disco.unimib.it
AST1.0 • Raw table includes 78.6% of accurate data (1569 addresses • According to AST • In order to evaluate the quality of such results we manually compared the standardized version of accurate addresses with the clean address table. • The results are the following: ••• ITIS Lab •••http://www.itis.disco.unimib.it
AST 2.0 • The 2.0 version uses the Web version of Google API and exploit the “did you mean” feature to improve results • An address a is accurate if the application of AST 2.0 tool to it returns exactly one answer an no ”did you mean” sentence is present in the answer page. ••• ITIS Lab •••http://www.itis.disco.unimib.it
AST 2.0 AST2.0 Server HTTP REQUEST Queuing AST2.0 Wrapper RAW Table HTML PAGE Corrected RAW Table ••• ITIS Lab •••http://www.itis.disco.unimib.it
AST2.0 • No limitation to # of requests per IP and between two requests • Raw table includes 64.7% of accurate data (1292) addresses • The results are the following: ••• ITIS Lab •••http://www.itis.disco.unimib.it
Comments • Results produced by AST1.0 can be further improved by appling AST2.0 to the non accurate addresses identified by AST1.0 • The coverage of addresses shown by Gmaps is good, but • In case of small towns there are no information • Historical data are not available (Stalingrad - Volgograd) • The benchmark could include also latitude and longitude information • How to measure this kind of precision? • AST1.0+AST2.0 is a good way to easily (and freely) assess postal addresses for the majority of countries in the world. ••• ITIS Lab •••http://www.itis.disco.unimib.it
Final result for Postal Address Verification ••• ITIS Lab •••http://www.itis.disco.unimib.it
Deduplication ••• ITIS Lab •••http://www.itis.disco.unimib.it
Deduplication • Space reduction • Sorted neighborhood method • Comparison functions • Edit distance, Jaccard … • Decision model • Fellegi, Sutter • Tool • Fril • Febrl ••• ITIS Lab •••http://www.itis.disco.unimib.it
Febrl • Febrl (Freely Extensible Biomedical Record Linkage) is a freeware data standardisation and probabilistic record linkage python-based tool. ••• ITIS Lab •••http://www.itis.disco.unimib.it
Febrl • Environment: PowerEdge R710 with two Intel Xeon X5550 processor (2,66GHz, cache 8MB), 16GB Memory, 4 HD with 450GB SAS 15.000rpm • Database: Mysql • Index: SNM windows size =10 • Comparison functions • Person_name edit distance (threshold 0.45) • Person_address Longest common substring (threshold 0.45) • Decision model • Fellegi Sutter. • Score <0.45 match • 0.45<=Score <=0,75 possibile match • score>0.75 match ••• ITIS Lab •••http://www.itis.disco.unimib.it
Febrl ••• ITIS Lab •••http://www.itis.disco.unimib.it
Fril • FRIL (fine-grained record integration and linkage) • Java based • Two search methods: nested loop join (NLJ) and the sorted neighborhood method (SNM). • Comparison function: edit distance, Soundex, Q-gram, and equality ••• ITIS Lab •••http://www.itis.disco.unimib.it
Fril • Environment: HP Pavillon DV6-2125EL with one Intel core I3-330 (2.13 GHz), 4 GB ram, one hd 500 GB Sata 7200RPM • Db server stored into the PowerEdge R710 • Search index: SNM • Comparison functions: • Person_name edit distance (threshold =0.3) • Address edit distance (threshold =0.8 ) (very high threshold!!!) • Decision model • Fellegi Sutter ••• ITIS Lab •••http://www.itis.disco.unimib.it
Fril ••• ITIS Lab •••http://www.itis.disco.unimib.it
Evaluation • Febrl is much more time consuming than Fril (it does not use thread) • Febrl does not accept db connection • Fril is fast, with good usabiltiy, but not so precise • Thresholds are sometime too high (probably the benchmark does not include too much noise) • The use of a cleaned version of the raw table significantly increase the precision of results ••• ITIS Lab •••http://www.itis.disco.unimib.it
Schema aware deduplication ••• ITIS Lab •••http://www.itis.disco.unimib.it
Schema aware deduplication • Data do not live alone • Inventor table is one of the table of patstat • More information can be used to deduplication goals • In the literature • Group Linkage (a.k.a. Group ER) • Inter-relationship Deduplication • We introduce an approach • Domain independent • Exploiting context information via schema analysis • Covering multiple types of record linkage: • scattered information • Dirty data ••• ITIS Lab •••http://www.itis.disco.unimib.it
Some results • Some preliminaryresults • Not so good, but the levelofinterconnectivityis low • Average 2 co-inventors for eachinventor ••• ITIS Lab •••http://www.itis.disco.unimib.it
Ongoing work ••• ITIS Lab •••http://www.itis.disco.unimib.it
Toward a new record linkage • “Pantarei” (Heraclitus) everything flows, everything is constantly changing. • Database may keep trace of these never ending changes • Examples • People change names • Xin Dong Xin Luna Dong • People change works • Havelymoves from Univ. of Wa. to Google • Nations change • YUGOSLAVIA Serbia-Montenegro Serbia Kosovo ••• ITIS Lab •••http://www.itis.disco.unimib.it
Temporal Record Linkage • Temporal Record linkage is a new research area that it is in charge of discovering if two records represent the same real world object described at two different time stamps • Work made in collaboration with AT&T Research Labs ••• ITIS Lab •••http://www.itis.disco.unimib.it
Preliminary results • Results under publication thus, sorry, we provide some preliminary results • PEI ADDS data from early binding results ••• ITIS Lab •••http://www.itis.disco.unimib.it