1 / 21

Structured Querying of Web Text: A Technical Challenge

Structured Querying of Web Text: A Technical Challenge. Michael J. Cafarella , Christopher Re, Dan Suciu , Oren Etzioni , Michele Banko. Presenter: Shahina Ferdous ID – 1000630375 Date – 03/23/10. Querying over Unstructured Data. Web (Text Documents).

leena
Télécharger la présentation

Structured Querying of Web Text: A Technical Challenge

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko Presenter: Shahina Ferdous ID – 1000630375 Date – 03/23/10

  2. Querying over Unstructured Data Web (Text Documents) • Contains vast amount Text Documents, which is: • Unstructured • Accessed by keywords • Limited search quality

  3. Querying over Unstructured Data Web Document-out Keyword-in Show me some people, what they invented, and the years they died

  4. Querying over Unstructured Data Web Document-out Keyword-in List some Scientists with their invention and the years they died

  5. Structured Querying of web Text • “Show me some people, what they invented, and the years they died” • In this paper, they proposed a structured Web query System called extraction databse, ExDB. • ExDb uses information extraction (IE) system to extract Data. • As the extracted Data can be erroneos, ExDB assigns Probability to the tuples.

  6. ExDB Work Flow Facts RDBMS …no one could surprising. In1877, Edisoninvented thephonograph.Although he… Types …didnt surprising. In1877, Edisoninvented thephonograph.Although he… …was surprising. In1877, Edisoninvented thephonograph.Although he… Web Querymiddleware Synonyms invented(Edison ?e, ?i) 1. Run extractors 2. Populate data model 3. Query Processing & Applications

  7. Information Extraction ExDB extracts several base-level concepts through combination of existing IE techniques: • Objects are Data values in the system. Examples: Einstein, telephone, Boston, • Light-bulb, etc. • Predicates represents binary relation between pair of objects. • Examples: discovered (Edison, phonograph), born-in (A. –Einstein, Switzerland) • and sells (Amazon, PlayStation) etc. • Semantic types represents unary relation of objects. • Examples: city (Boston), city (New-York) and electronics (dvd-player) etc.

  8. Information Extraction ExDB should also extract more series of relationships to make queries even easier for the user: • Synonyms denote equivalent objects, predicates or types. • Examples: Einstein and A. –Einstein almost certainly refer to same object. • Also, invented and has-invented refer to same predicate. • Inclusion Dependencies describes subset relationship between two • predicates. Examples: invented (?x, ?y )  discovered (?x, ?y). • Functional Dependencies are useful to answer query with negation or why • an object is not an answer. • For example, a probabilistic FD indicating a person can only be born in one • Country: born-in(?x, <country> ?y): ?x -> ?yp=0.95 • “All Scientists born in Germany that taught at Princeton”. If after receiving • the answers, they ask again to the system “Why Einstein is not an answer?”. • Using the above FD, the system will answer: • “As born-in (Einstein, Switzerland)” and FD tells a person can only born in one • Country, therefore probability of born-in (Einstein, Germany) is very low.

  9. Information Extraction

  10. ExDB Work Flow Facts RDBMS …no one could surprising. In1877, Edisoninvented thephonograph.Although he… Types …didnt surprising. In1877, Edisoninvented thephonograph.Although he… …was surprising. In1877, Edisoninvented thephonograph.Although he… Web Querymiddleware Synonyms invented(Edison ?e, ?i) 1. Run extractors 2. Populate data model 3. Query Processing & Applications

  11. Populate Data Model It was big news when Edison invented thephonograph… Facts We all know that Edison did-inventthe light bulb. … In 1877 Edisoncreated thephonograph. Morgan was born-in 1837 into a prosperous mercantile-banking family… Einstein is one of the best known scientists and intellectuals of all time. Types He visited citiessuch asBoston and New York. TextRunner knowItAll Synonyms • For fact extraction ExDB uses unsupervised system called TextRunner. • TextRunner generates a large set of extraction while running on entire corpus of text. • Unlike other IE systems, it does not require a set of target predicates specified beforehand, rather it starts by using a heavy weight linguistic parser to generate high quality extraction triples. • Later they use these high quality triples as the training set to generate a light weight extraction classifier that can run on entire web-scale corpus DIRT • For type extraction ExDB uses the KnowItAll system. • KnowItALL searches the entire corpus to extract hypernym or “is-a” relationships. For example: it extracts city (Boston) from “cities such as Seattle and Boston”. • Assign each extraction a probability based on its frequency (or search engine hit count). • ExDB uses DIRT algorithm to extract predicate synonyms. • DIRT computes the degree to which the argument pairs of two predicates coincide. For example, invented and has-invented will overlap many argument pairs like Edison/Light-bulb or Einstein/theory-of-relativity. IDs FDs

  12. ExDB Work Flow Facts RDBMS …no one could surprising. In1877, Edisoninvented thephonograph.Although he… Types …didnt surprising. In1877, Edisoninvented thephonograph.Although he… …was surprising. In1877, Edisoninvented thephonograph.Although he… Web Querymiddleware Synonyms invented(Edison ?e, ?i) 3. Query Processing & Applications 1. Run extractors 2. Populate data model

  13. ExDB Queries • ExDB proposes the users to query over the web Data model using Datalog-like notation. Example: q(?i) :- invented(Edison, ?i) returns all inventions by Edison. Example constranits: q(?x, ?y) :- died-in(<Scientist> ?x, 1955?y) Example query for locally available inexpensive electronics: q(?x, ?y, ?z) :- for-sale-in(<electronics> ?x, Seattle ?y), costs (?x, ?z), (?z < 25) Another example can be: q(?x, ?y, ?z) :- invented(<scientists> ?x, ?y), died-in (?x, <year> ?z), (?z < 1900) Example of projection queries: q(?s) :- invented(<scientist> ?s, ?i)

  14. Query Processing • Non-projecting queries • Involves a series of join against tables in the Web Data Model • Probability of a joined tuple is the product of the individual tuple’s probabilities • Select top-k queries ranked by their probability as results. Types Facts … … • Example: q(?x, ?y, ?z) :- invented (<scientist> ?x, ?y), died-in (?x, <year> ?z). …

  15. Projecting queries • q(?s) :- invented (<scientist> ?s, ?i) rank scientists according to the probability of the scientist invented something without caring much about the actual invention. • A scientist Tesla appears in the output q, if the tuple invented (Tesla, I0) is in the database. • There can be many inventions I1, …, Imfor Tesla such as invented (Tesla, Ii). Any of these are sufficient to return Tesla as an answer for q. • Need to compute a disjunction of m probabilistic events. • As m can be very large, a large number of very low probability extractions can unexpectedly • result in a quite large probability. • Therefore, try to abstract panel of experts, where an expert is a tuple with a score such as • Invented (tesla, Fluroescent-Lighting), 0.95, which determine the probability of its appearing in q.

  16. Result of Projecting Queries q(?s) :- invented(<scientist> ?s, x) Scientist invented

  17. ExDB Prototype • Web crawl: 90M pages • Facts: 338Mtuples, 102M objects • Types: 6.6M instances • Synonyms: 17k pairs • No IDs or FDs yet

  18. Applications • ExDB’s extracted Data are not meant to be examined directly, rather they are used to build topic-specific tables so that human user can appreciate. A synthetic table about scientists, generated by merging answers from Died-in(<scientist> ?x, ?y), invented(<scientist> ?x, ?y), published(<scientist> ?x, ?y) and taught(<scientist> ?x, ?y). • If it is possible to automatically generate an ExDB query from keywords, it is possible to build a very powerful query system. • It is possible to build web Data cube over the large amount of read only structured Data of ExDB.

  19. Alternative Models • Schema Extraction Model, intends to find out single best schema for the entire set of extractions to transform the web Text into a traditional relational database • Three good criteria for schema extraction are: • Simplicity (few tables). • Completeness (All extractions appear in the output). • Fullness ( output database has no NULLs).

  20. Alternative Models • Text Query Model does not perform any information extraction at all, rather offers a descriptive query language to generates answers for users query very quickly. User’s Query • Extract city/date tuples from band’s website. • Indicate the city where she lives. • Compute the dates when the band’s city and her own city are within 100 miles of each other.

  21. Questions? Thank You

More Related