Over9K: Predicting Stock Volatility through News and Events

Over9K Alex Meng Chunshi Jin Elliott Conant Jonathan Fung

Agenda • What is Over9K about • Architecture • Crawler • IE/Classifier • Web Interface • Summary

What is Over9K about? • Original Goal: A system to predict stock’s future volatility based on the news and information gathered from Internet. • Over ambitious because of ignorant. • What we have done: extract information/events which may affect the volatility of stocks . User can search and browse it.

Events to Extract • Reorganization • Bankruptcy • Product release • Earning report

Architecture Web Interface MySQL IE/Classifier Internet Crawler

Crawler • Based on nutch • Crawled web sites: • …

IE/Classifier • Tried several systems for IE • Gate • OpenCalais • CRF++ • Classifier • Mallet

Comparison of IE tools • OpenCalais: • Web service. Easy to use. No machine learning process. • Not extensible • Fairly good precision/recall • Gate: • ANNIE( a Nearly New IE system ): • Tokenizer, Sentence Splitter, POSTagger, Gazetteer, NE • JAPE: Gate’s rule engine. • Extensible with JAPE. Easy to use for its regex like syntax. Deterministic behavior. • High precision/recall for defined patterns, low for undefined patterns.

Comparison of IE tools (cont.) • CRF++ • Need tools to preprocess content: • HTML to text • POS Tag/NE (Stanford NLP library) • Extract other features when necessary • Convert file to the required train/test format of CRF++ • Template file to define dependencies of feature and label. • Labeling training set is laborious • Fairly good precision/recall. “Intelligence” may emerge. • Need big set of training set.

Web Interface

Lessons and Thoughts • A realistic goal is critical. • Right tools are important. • Future Improvement • Controlled crawling • Improve feature extraction qualities: POSTagger/NE etc.

Q&A Thanks!

Over9K: Predicting Stock Volatility through News and Events

Over9K: Predicting Stock Volatility through News and Events

Presentation Transcript

Over9K

Sea Ice

Sea Ice