1 / 12

Over9K

Over9K. Alex Meng Chunshi Jin Elliott Conant Jonathan Fung. Agenda. What is Over9k? Architecture Crawler Postprocessor Extractor Web Service Summary. What is Over9K about?.

dai
Télécharger la présentation

Over9K

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Over9K Alex Meng Chunshi Jin Elliott Conant Jonathan Fung

  2. Agenda • What is Over9k? • Architecture • Crawler • Postprocessor • Extractor • Web Service • Summary

  3. What is Over9K about? • Original Goal: A system to predict stock’s future volatility based on the news and information gathered from Internet. • Current Goal: create a system that crawled different news sites for articles, identified which companies are affected, and extracted events from articles. We store all information in a database that is accessed through our web service.

  4. Architecture

  5. Crawler • Web crawler: Nutch • Domains we crawl: • www.cnbc.com • www.reuters.com • www.marketwatch.com • … (6 total) • Nutch’sSuccesses • Nutch’s Failures

  6. Postprocessor • Components: • NBClassifier • Classifies articles using Naives-Bayes • DateParser • Parses date using regular expressions • PageGetter • Retrieves training data from RSS feeds

  7. IE • Tried several systems for IE • Gate • OpenCalais • CRF++

  8. Comparison of IE tools • OpenCalais: • Web service. Easy to use. • Not extensible. No machine learning process. • Has usage quotas • Gate: • ANNIE( a Nearly New IE system ): • Tokenizer, Sentence Splitter, POSTagger, Gazetteer, NE • JAPE: Gate’s rule engine. • Extensible with JAPE. Easy to use for its regex like syntax. Behavior is almost deterministic. • High precision for defined patterns, low recall if there are sentences of undefined patterns.

  9. Comparison of IE tools (cont.) • CRF++ • Need tools to preprocess content: • HTML to text • POS Tag/NE (Stanford NLP library) • Extract other features when necessary • Convert file to the required train/test format of CRF++ • Template file to define dependencies of feature and label. • Need big set of training set. • Labeling training set is laborious • Fairly good precision/recall. “Intelligence” may emerge.

  10. Web Service • Technologies used: • YUI Toolkit • PHP • Apache • CSS • Javascript • Layout description

  11. Lessons and Thoughts • A realistic goal is critical. • Right tools are important. • Communication is key. • Future Improvement • Controlled crawling • Improve feature extraction qualities: POSTagger/NE etc. • Developing a model to predict volatility

  12. Q&A Thanks!

More Related