Innovations in Sequence Mining: Applications, Challenges, and Algorithm Development
This project explores the vast applications of sequence mining across various fields, such as bioinformatics (DNA and proteins), telecommunications (network alarms and packet data), and retail (customer behavior). It highlights the need for consolidated algorithms and software solutions in data mining, addressing challenges in sensor databases and multi-relational data mining. The goal is to innovate and implement effective algorithms that enhance data management and analysis while considering cross-disciplinary applications. This comprehensive study aims to deepen understanding and drive advancements in sequence mining techniques.
Innovations in Sequence Mining: Applications, Challenges, and Algorithm Development
E N D
Presentation Transcript
Mtech Projects 2002 Sunita Sarawagi
Sequence mining • Several real-life mining applications on sequence data • Classical applications • Speech, language, handwritten are all complex sequences • Newer applications • Bio-informatics: DNA and proteins • Telecommunication: Network alarms, network packet data • Retail data mining: Customer behavior
Sequence mining: problems • Existing work scattered and application specific • Field in dire need of consolidated algorithms and software solutions • More technical details can be discussed after we finish this topic in class on March 3
Sensor databases and mining • Several distributed sensors that push data to centralized database servers • Example: Automatic Vehicle Location systems consisting of sensors at bus stops, an entry in the server each time a bus passes a stop. • Goal: Build a DBMS for managing this data and supporting queries like “when is the next bus to X going to arrive”?
Problems Cross-disciplinary covering several areas • A mining sub-problem: predicting arrival time based on • Previous arrival patterns of same bus • Traffic conditions derived from other buses with common routes • A database query problem: • Approximate search based on spoken queries
Multi-relational data mining • Existing mining software assume data in a single relation • Real-life data over multiple relations • Existing tools rely on manual preprocessing before commencing mining, this is time-consuming and in-accurate. • Design and implement mining algorithms for multi-relational data
Who should apply • Fascinated by the areas of data mining, data bases, machine learning • Want to get a flavor of cutting-edge research • Enjoyed the courses • Have a knack for algorithm design and implementation • Are wery software savvy • Wants to stretch his learning/knowledge rather than slide through with an “easy” project.
Possible achievements • Understand one topic deeply, learn to innovate • Produce software that several people use • Write papers in really top-quality international conferences • Demo the software in leading international forums
Industries in the area • IBM IRL • Strand Genomics • GE Capital • TCS bio-informatics • PSPL • Startups like Vistaar • Outside india: several
Automatic segmentation of free text records, 2000 Batch • A HMM-based address segmenter • Software licensed by a Data Cleaning company • Paper in one of the two premium database conferences • ACM SIG on Management of Data (SIGMOD) 2001, Santa Barbara USA.
ICUBE – Intelligent Rollups • MTP work integrated in ICube, demo-ed at SIGMOD 2000 held in Texas, USA • Icube software adopted by a startup • Paper at the other premium database conference, VLDB 2001 held in Rome, Italy.
Data deduplication using active learning • Software likely to be transferred to National Informatics Corporation, Pune • Practical application of an interesting idea from machine learning • Paper at KDD 2002 conference held in Canda • Demos at VLDB 2002 Hongkong, ICDE 2003 Bangalore