Over9K: A System for Predicting Stock Volatility Through News Analysis

Over9K Alex Meng Chunshi Jin Elliott Conant Jonathan Fung

Agenda • What is Over9k? • Architecture • Crawler • Postprocessor • Extractor • Web Service • Summary

What is Over9K about? • Original Goal: A system to predict stock’s future volatility based on the news and information gathered from Internet. • Current Goal: create a system that crawled different news sites for articles, identified which companies are affected, and extracted events from articles. We store all information in a database that is accessed through our web service.

Architecture

Crawler • Web crawler: Nutch • Domains we crawl: • www.cnbc.com • www.reuters.com • www.marketwatch.com • … (6 total) • Nutch’sSuccesses • Nutch’s Failures

Postprocessor • Components: • NBClassifier • Classifies articles using Naives-Bayes • DateParser • Parses date using regular expressions • PageGetter • Retrieves training data from RSS feeds

IE • Tried several systems for IE • Gate • OpenCalais • CRF++

Comparison of IE tools • OpenCalais: • Web service. Easy to use. • Not extensible. No machine learning process. • Has usage quotas • Gate: • ANNIE( a Nearly New IE system ): • Tokenizer, Sentence Splitter, POSTagger, Gazetteer, NE • JAPE: Gate’s rule engine. • Extensible with JAPE. Easy to use for its regex like syntax. Behavior is almost deterministic. • High precision for defined patterns, low recall if there are sentences of undefined patterns.

Comparison of IE tools (cont.) • CRF++ • Need tools to preprocess content: • HTML to text • POS Tag/NE (Stanford NLP library) • Extract other features when necessary • Convert file to the required train/test format of CRF++ • Template file to define dependencies of feature and label. • Need big set of training set. • Labeling training set is laborious • Fairly good precision/recall. “Intelligence” may emerge.

Web Service • Technologies used: • YUI Toolkit • PHP • Apache • CSS • Javascript • Layout description

Lessons and Thoughts • A realistic goal is critical. • Right tools are important. • Communication is key. • Future Improvement • Controlled crawling • Improve feature extraction qualities: POSTagger/NE etc. • Developing a model to predict volatility

Q&A Thanks!

Over9K: A System for Predicting Stock Volatility Through News Analysis

Over9K: A System for Predicting Stock Volatility Through News Analysis

Presentation Transcript

Over9K

Over9K

Sea Ice

Sea Ice