120 likes | 249 Vues
Over9K is an advanced web service designed to predict future stock volatility by analyzing news and information sourced from major online platforms. Initially aimed at forecasting market behaviors, the current focus involves crawling various news sites, identifying affected companies, and extracting relevant events into a structured database. Key components include a web crawler (Nutch), a postprocessor with Naive-Bayes classification, and multiple tools for information extraction. Future enhancements aim to improve feature extraction and develop predictive models.
E N D
Over9K Alex Meng Chunshi Jin Elliott Conant Jonathan Fung
Agenda • What is Over9k? • Architecture • Crawler • Postprocessor • Extractor • Web Service • Summary
What is Over9K about? • Original Goal: A system to predict stock’s future volatility based on the news and information gathered from Internet. • Current Goal: create a system that crawled different news sites for articles, identified which companies are affected, and extracted events from articles. We store all information in a database that is accessed through our web service.
Crawler • Web crawler: Nutch • Domains we crawl: • www.cnbc.com • www.reuters.com • www.marketwatch.com • … (6 total) • Nutch’sSuccesses • Nutch’s Failures
Postprocessor • Components: • NBClassifier • Classifies articles using Naives-Bayes • DateParser • Parses date using regular expressions • PageGetter • Retrieves training data from RSS feeds
IE • Tried several systems for IE • Gate • OpenCalais • CRF++
Comparison of IE tools • OpenCalais: • Web service. Easy to use. • Not extensible. No machine learning process. • Has usage quotas • Gate: • ANNIE( a Nearly New IE system ): • Tokenizer, Sentence Splitter, POSTagger, Gazetteer, NE • JAPE: Gate’s rule engine. • Extensible with JAPE. Easy to use for its regex like syntax. Behavior is almost deterministic. • High precision for defined patterns, low recall if there are sentences of undefined patterns.
Comparison of IE tools (cont.) • CRF++ • Need tools to preprocess content: • HTML to text • POS Tag/NE (Stanford NLP library) • Extract other features when necessary • Convert file to the required train/test format of CRF++ • Template file to define dependencies of feature and label. • Need big set of training set. • Labeling training set is laborious • Fairly good precision/recall. “Intelligence” may emerge.
Web Service • Technologies used: • YUI Toolkit • PHP • Apache • CSS • Javascript • Layout description
Lessons and Thoughts • A realistic goal is critical. • Right tools are important. • Communication is key. • Future Improvement • Controlled crawling • Improve feature extraction qualities: POSTagger/NE etc. • Developing a model to predict volatility
Q&A Thanks!