Real-Time Entity Resolution Using Transliteration for Arabic Datasets

Using Transliteration with Entity Resolution for Arabic Datasets MarwahAlian Ghazi Al-Naymat Banda Ramadan

RealTime Entity Resolution • Entity Resolution : • is the process of Identifying records that represents the same real world entity. • Real world entity could be: • a person, a product, or a business or any other object in the real world.

The number of comparisons required for query-based matching in the real-time ER process

Examples of duplicates • A patient who is represented several times in a hospital DB • A product that is inserted many times in an inventory list.

Dynamic Similarity-Aware Inverted Indexing (DySimII) • DySimII is a real-time ER technique that works with dynamic databases • It aims at providing real-time entity resolution for a stream of query records. • To use this approach, three indexes are needed

DySimII steps • Step 1: Data Preprocessing • Step 2: Indexing • Step 3: Record Pair Comparison • Step 4: Classification • Step 5: Evaluation

Data Preprocessing • Cleaning the data and make it standard to be used. Cleaning Data ready to be used Initial Data

Indexing • Creates candidate records that potentially corresponds to matches. • Values are added to indexes whenever a query is being processed.

Inverted Index • an inverted index is an index data structure storing a mapping from content such as words or numbers to its locations in a database file, or in a document or a set of documents.

Inverted Indexes • Three Indexes are used in DySimII • Block Index (BI) • Similarity Index (SI) • Record Index (RI)

Block Index (BI) • An inverted index that stores unique attribute values and their blocking key values. • Blocking key is generated using encoding function (phonics or soundex)

Similarity Index (SI) • An inverted index that stores pre-calculated similarities between attribute values that are in the same block. • Keys in SI are unique attribute values, each key point to a list of precalculated similarities between this value and all other values in the same block.

Record Index (RI) • Stores all unique attribute values and their associated record identifier. • Keys in RI are unique attribute values where each key points to a list of all record Identifiers that have the same attribute value.

DySimII indexing Example • RI – record index • BI – blocking index • SI – similarity index Ramadan, B., Christen, P., Liang, H., Gayler, R., Hawking, D.: Dynamic similarity-aware inverted indexing for real-time entity resolution. (DMApps’13), Australia (2013)

DySimII With Arabic Data • Several indexing techniques are used for Arabic unstructured data (texts and documents) • But there is no indexing technique used for arabic entity resolution. • In this work we apply DySimII on Arabic data, experiment its efficiency on Arabic ER.

Block Index (BI) • This Index uses phonetics for names or words to be the blocking indexing key value • For Arabic data we use Transliteration before passing words to the soundex function

Similarity Index (SI) SI 0.89 محمد محمود 0.6 أحمد ….. 0.4 سليمان سالم أحمد 0.89 محمود 0.6 محمد

Record Index (RI) محمد أحمد سالم ….. RI r2 r10 r4 r6 r120 r200

Record Pair Comparison محمد أحمد سالم سليمان محمد محمود 0.4 0.89 1 2.29

Evaluation • Entity resolution process is evaluated using a variety of measures. • Using Two metrics: • Matching accuracy. • Average MRR (mean reciprocal rank):.

Datapreparation • We use GeCo corruptor to make the deduplication with corrupted data. • GeCo is updated to be suitable for arabic dataset. • We use four corruptors: • Missing ValueCorruptor • Corrupt Value OCR:depends on similar characters • Keyboard corruptor • Edit Value Corruptor: selects randomly edit operation (insert, delete, substitute, or transpose)

Results Specially with more corruption in data

Conclusion • The results showed that to be able to use DysimII with Arabic scripts it is important to apply the transliteration step before building the indexes. These indexes were then used successfully in the ER process. • The results of using translitration were compared with the use of the stem as a blocking key index and It outperform the results of the stem.

Real-Time Entity Resolution Using Transliteration for Arabic Datasets