1 / 17

NLP and Big Data

Shanxi HPC Research Center. NLP and Big Data. Xiaoge LI lixg@xupt.edu.cn WBDB2013, Xi’an, China. Introduction. Internet is a big knowledge base unstructured NLP & IE “understand” human language. Unstructured data. Structure data. Problems. Human language changed

brigid
Télécharger la présentation

NLP and Big Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Shanxi HPC Research Center NLP and Big Data Xiaoge LI lixg@xupt.edu.cn WBDB2013, Xi’an, China

  2. Introduction • Internet is a big knowledge base • unstructured • NLP & IE • “understand” human language Unstructured data Structure data

  3. Problems • Human language changed • Let Google it ! • Net language ( LOL,给力) • compounds words (JFK airport) • Domain knowledge • Domain specific training sets • Chinese tokenization • 小菊/ nr /的/u/生活/vn/很/d/给/v力/vg • 小菊/ nr /的/u/生活/vn/很/d/给力/a

  4. NLP need big data • Unsupervised (weekly supervised)learning • knowledge acquisition • Relationship • New words • NE gazette

  5. System Architecture

  6. knowledge acquisition Large scale Corpus from Web Weekly supervised learning Bootstrapping technique Map reduce,Hbase Location NE and new word P = 87.28%, 72.1%

  7. Chinese NLP & IE engine Pipeline FST & statistic mixture model Input:plain text Out : structured XML Map reduce Speed: 500KB/s in 10 nodes

  8. Information object Profile and Event

  9. Example Profile In Concept-Based Profile, its attributes are filled by its participant profiles.

  10. Information Network

  11. Cross Document Information fusion Hierarchical Clustering Map Reduce Hbase Half Million Profiles Computing complexity P=94.65% R=88.24% F= 91.33%

  12. Information Graph multi-dimension Orange: location Gray: organization Blue: Person Source: 2012 People’s daily Query: China Agricultural University Expand 1 level

  13. Organization-Organization Network Query: China Agricultural University filter: Organization

  14. Location-Personal Network Query : 青岛港, filter:Location

  15. Person-location Network Query: 金日成

  16. Future Work • Query Language • Graph Mining • Enhance NLP Engine • visualization

  17. Questions? Thank you

More Related