1 / 30

Information Extraction Research @ Yahoo! Labs Bangalore

Information Extraction Research @ Yahoo! Labs Bangalore. Rajeev Rastogi Yahoo! Labs Bangalore. The most visited site on the internet. 600 million+ users per month Super popular properties News, finance, sports Answers, flickr, del.icio.us Mail, messaging Search. Unparalleled scale.

clancy
Télécharger la présentation

Information Extraction Research @ Yahoo! Labs Bangalore

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Extraction Research @ Yahoo! Labs Bangalore Rajeev RastogiYahoo! Labs Bangalore

  2. The most visited site on the internet • 600 million+ users per month • Super popular properties • News, finance, sports • Answers, flickr, del.icio.us • Mail, messaging • Search

  3. Unparalleled scale • 25 terabytes of data collected each day • Over 4 billion clicks every day • Over 4 billion emails per day • Over 6 billion instant messages per day • Over 20 billion web documents indexed • Over 4 billion images searchable No other company on the planet processes as much data as we do!

  4. Yahoo! Labs Bangalore • Focus is on basic and applied research • Search • Advertizing • Cloud computing • University relations • Faculty research grants • Summer internships • Sharing data/computing infrastructure • Conference sponsorships • PhD co-op program

  5. What does search look like today?

  6. Search results of the future: Structured abstracts yelp.com Gawker babycenter New York Times epicurious LinkedIn answers.com webmd

  7. Rank by price Search results of the future: Intelligent ranking

  8. A key technology for enabling search transformation Information extraction (IE)

  9. Information extraction (IE) • Goal: Extract structured records from Web pages Name Category Address Map Phone Price Reviews

  10. Multiple verticals • Business, social networking, video, ….

  11. Name Title Posted by Date Price Title Education Category Address Connections Phone Price Rating Views One schema per vertical

  12. IE on the Web is a hard problem • Web pages are noisy • Pages belonging to different Web sites have different layouts Noise

  13. Web page types Hand-crafted Template-based

  14. Template-based pages • Pages within a Web site generated using scripts, have very similar structure • Can be leveraged for extraction • ~30% of crawled Web pages • Information rich, frequently appear in the top results of search queries • E.g. search query: “Chinese Mirch New York” • 9 template-based pages in the top 10 results

  15. Wrapper Induction • Enables extraction from template-based pages Learn Sample pages Annotations Website pages Annotate Pages Sample Learn Wrappers Apply wrappers XPath Rules Extract Extract Website pages Records

  16. Example Generalize XPath: /html/body/div/div/div/div/div/div/span /html/body//div//span

  17. Filters • Apply filters to prune from multiple candidates that match XPath expression XPath: /html/body//div//span Regex Filter (Phone):([0-9]3) [0-9]3-[0-9]4

  18. Limitations of wrappers • Won’t work across Web sites due to different page layouts • Scaling to thousands of sites can be a challenge • Need to learn a separate wrapper for each site • Annotating example pages from thousands of sites can be time-consuming & expensive

  19. Research challenge • Unsupervised IE: Extract attribute values from pages of a new Web site without annotating a single page from the site • Only annotate pages from a few sites initially as training data

  20. Conditional Random Fields (CRFs) • Models conditional probability distribution of label sequence y=y1,…,yn given input sequence x=x1,…,xn • fk: features, lk: weights • Choose lk to maximize log-likelihood of training data • Use Viterbi algorithm to compute label sequence y with highest probability

  21. Name Noise Category Address Phone CRFs-based IE • Web pages can be viewed as labeled sequences • Train CRF using pages from few Web sites • Then use trained CRF to extract from remaining sites

  22. Drawbacks of CRFs • Require too many training examples • Have been used previously to segment short strings with similar structure • However, may not work too well across Web sites that • contain long pages with lots of noise • have very different structure

  23. An alternate approach that exploits site knowledge • Build attribute classifiers for each attribute • Use pages from a few initial Web sites • For each page from a new Web site • Segment page into sequence of fields (using static repeating text) • Use attribute classifiers to assign attribute labels to fields • Use constraints to disambiguate labels • Uniqueness: an attribute occurs at most once in a page • Proximity: attribute values appear close together in a page • Structural: relative positions of attributes are identical across pages of a Web site

  24. Attribute classifiers + constraints example Chinese Mirch Chinese, Indian Page1: 120 Lexington AvenueNew York, NY 10016 (212) 532 3663 Category Phone Name Address Jewel of India Page2: Indian 15 W 44th StNew York, NY 10016 (212) 869 5544 Category Name Phone Address 21 Club Page3: American 21 W 52nd StNew York, NY 10019 (212) 582 7200 Phone Category, Name Name, Noise Address Uniqueness constraint: NamePrecedence constraint: Name < Category 21 Club Page3: American 21 W 52nd StNew York, NY 10019 (212) 582 7200 Phone Category Name Address

  25. Performance evaluation: Datasets • 100 pages from 5 restaurant Web sites with very different structure • www.citysearch.com • www.fromers.com • www.nymag.com • www.superpages.com • www.yelp.com • Extract attributes: Name, Address, Phone num, Hours of operation, Description

  26. Methods considered • CRFs, attribute classifiers + constraints • Features • Lexicon: Words in the training Web pages • Regex: isAlpha, isAllCaps, isNum, is5DigitNum, isDay,… • Attribute-level: Num of words, Overlap with title,…

  27. Evaluation methodology • Metrics • Precision, recall, F1 for attributes • Test on one site, use pages from remaining 4 sites as training data • Average measures over all 5 sites

  28. Experimental results Precision Recall

  29. Other IE scenarios: Browse page extraction Similar-structuredrecords

  30. IE big picture/taxonomy • Things to extract from • Template-based, browse, hand-crafted pages, text • Things to extract • Records, tables, lists, named entities • Techniques used • Structure-based (HTML tags, DOM tree paths) – e.g. Wrappers • Content-based (attribute values/models) – e.g. dictionaries • Structure + Content (sequential/hierarchical relationships among attribute values) – e.g. hierarchical CRFs • Level of automation • Manual, supervised, unsupervised

More Related