Innovations in Sequence Mining: Applications, Challenges, and Algorithm Development

Mtech Projects 2002 Sunita Sarawagi

Sequence mining • Several real-life mining applications on sequence data • Classical applications • Speech, language, handwritten are all complex sequences • Newer applications • Bio-informatics: DNA and proteins • Telecommunication: Network alarms, network packet data • Retail data mining: Customer behavior

Sequence mining: problems • Existing work scattered and application specific • Field in dire need of consolidated algorithms and software solutions • More technical details can be discussed after we finish this topic in class on March 3

Sensor databases and mining • Several distributed sensors that push data to centralized database servers • Example: Automatic Vehicle Location systems consisting of sensors at bus stops, an entry in the server each time a bus passes a stop. • Goal: Build a DBMS for managing this data and supporting queries like “when is the next bus to X going to arrive”?

Problems Cross-disciplinary covering several areas • A mining sub-problem: predicting arrival time based on • Previous arrival patterns of same bus • Traffic conditions derived from other buses with common routes • A database query problem: • Approximate search based on spoken queries

Multi-relational data mining • Existing mining software assume data in a single relation • Real-life data over multiple relations • Existing tools rely on manual preprocessing before commencing mining, this is time-consuming and in-accurate. • Design and implement mining algorithms for multi-relational data

Who should apply • Fascinated by the areas of data mining, data bases, machine learning • Want to get a flavor of cutting-edge research • Enjoyed the courses • Have a knack for algorithm design and implementation • Are wery software savvy • Wants to stretch his learning/knowledge rather than slide through with an “easy” project.

Possible achievements • Understand one topic deeply, learn to innovate • Produce software that several people use • Write papers in really top-quality international conferences • Demo the software in leading international forums

Industries in the area • IBM IRL • Strand Genomics • GE Capital • TCS bio-informatics • PSPL • Startups like Vistaar • Outside india: several

Sample outcomes form some previous MTPs

Automatic segmentation of free text records, 2000 Batch • A HMM-based address segmenter • Software licensed by a Data Cleaning company • Paper in one of the two premium database conferences • ACM SIG on Management of Data (SIGMOD) 2001, Santa Barbara USA.

ICUBE – Intelligent Rollups • MTP work integrated in ICube, demo-ed at SIGMOD 2000 held in Texas, USA • Icube software adopted by a startup • Paper at the other premium database conference, VLDB 2001 held in Rome, Italy.

Data deduplication using active learning • Software likely to be transferred to National Informatics Corporation, Pune • Practical application of an interesting idea from machine learning • Paper at KDD 2002 conference held in Canda • Demos at VLDB 2002 Hongkong, ICDE 2003 Bangalore

Innovations in Sequence Mining: Applications, Challenges, and Algorithm Development