1 / 12

Recovering Semantics of Tables on the Web

Recovering Semantics of Tables on the Web. Petros Venetis , Alon Halevy, Jayant Madhavan , Marius Pas¸ca , Warren Shen , Fei Wu, Gengxin Miao, Chung Wu 2011, VLDB Xunnan Xu. Problems to Solve. Annotating tables (the recovery of semantics) Title could be missing

river
Télécharger la présentation

Recovering Semantics of Tables on the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recovering Semantics of Tables on the Web PetrosVenetis, AlonHalevy, JayantMadhavan, Marius Pas¸ca, Warren Shen, FeiWu, GengxinMiao, Chung Wu 2011, VLDB XunnanXu

  2. Problems to Solve • Annotating tables (the recovery of semantics) • Title could be missing • Subjects could be missing • Relevant information might not be close at all • Improve table search • “Bloom period (Property) of shrubs (Class)” <- focused on in this paper • “Color (Property) of Azalea (Instance)”

  3. Classify Items Using Databases • isA database • Berlin is a city. • CSCI572 is a course. • relation database • Microsoft is headquartered inRedmond. • San Francisco is located in California. • Why is this useful? • Tables are structured, more “popular” names could help identify others

  4. Construction of isA Database • Extract pairs from web pages with patterns like: <[..] Class(C)[such as|including] Instance(I)[and|,|.]> • Easy? Not really… • To check the boundary of a Class: noun phrases whose last component is a plural-form noun and that are not contained in and do not contain another noun phrase • Michigan counties such as • Among the lovely cities • To check the boundary of an Instance: Ioccurs as an entire query in query logs

  5. Improvements • Mine more instances • Headquartered in I => I is a city • Handle sentence duplicates: • Sentence fingerprint -> the hash of first 250 characters • Score the pairs: • Score(I, C) = Size({Pattern(I, C)})2 x Freq(I, C) • {Pattern(I, C)} – the set of patterns • Freq(I, C) – the number of appearances • Similar to tf/idf

  6. Construction of Relation Database • TextRunner was used to extract the relations • TextRunneris a research project at the University of Washington. • It uses Conditional Random Field (CRF) to detect the relations among noun phrases. • CRF is a popular word in machine learning world: applying pre-defined feature functions to phrases to compute the final probability of a sentence (normalized score 0 ~ 1) • Example: • f(sentence, i, labeli, labeli-1) = 1 if word i is “in” and label i-1 is an adjective, otherwise 0=> Microsoft is headquartered in beautiful Redmond.

  7. Assign Labels to Instances • Assumptions • If many instances in that column are assigned to a class, then the next instance very likely also belongs to it. • The best label is the one that is most likely to “produce” the observed values in the column. (maximum likelihood hypothesis) • Definitions • vi – value i in column • Li – possible label for that column, L(A) – the best label • U(li, V) – the score of label i after assigned to the set (V) of values

  8. Assign Labels to Instances • According to the maximum likelihood assumption: • After applying Bayes function and normalization: • where Ks is the normalization factor • Pr[Li] -> estimated by the score in isA database • Pr[Li | vj] -> score(vj, Li) / ∑k score(vj, Lk) • Done?

  9. Assign Labels to Instances • What if the label does not exist in the database? • What if the some popular instance – label pair has much higher score? • Final equation to compute Pr[Li | vj] using smoothing and 0 prevention • For final results, they considered only labels above certain threshold.

  10. Results – Label Quantity and Quantity • Gold standard • Labels are manually evaluated by annotators • Vital > okay > incorrect • Allegan, Barry, Berrien –> Michigan counties (vital) • Allegan, Barry, Berrien -> Illinois counties (incorrect) • Relation quality • 128 binary relations using gold standard

  11. Table Search Results Comparison • Results are fetched automatically but compared manually: • 100 queries, using top-5 of the results – 500 • Results were shuffled and evaluated by 3 people using single blinding test • Scores: • right on - has all information about a largenumber of instances of the class and values for the property • relevant - has information about only some of the instances, or of properties that were closelyrelated to the queried property • irrelevant • Candidates • TABLE – the method in this paper • GOOG – results from google.com • GOOGR – top 1000 results from Google intersected with the table corpus) • DOCUMENT – document-based approach

  12. Table Search Results Comparison (a) right on, (b) right on or relevant, (c) right on or relevant and in a table

More Related