Shui-Lung Chuang Oct 27, 2004

Mining Reference Tables for Automatic Text SegmentationE. Agichtein V. GantiColumbia Univ. Microsoft R.KDD’04 Shui-Lung Chuang Oct 27, 2004

Text Segmentation • A (short)-text string • N attributes • Conventional approaches • Rule-based — human creates rules • Supervised model-based — human labels data Mining Ref. Table for Auto Text Segmentation E. Agichtein, V. Ganti, SIGKDD Null [ Authors , Title , Conference , Year ]

The Approach • Utilize the existing (large, clean) reference data • E.g, DBLP  Papers, US Addresses, … ARM2 ARM3 ARM3 ARM1 s: a sub-string prob. s is generated ARM: Attribute Recognition Model

Segmentation Model Mining Ref. Table for Auto Text Segmentation E. Agichtein, V. Ganti, SIGKDD To find s1 s2 s3 s4 ARM2 ARM3 ARM3 ARM1 s: a sub-string prob. s is generated ARM: Attribute Recognition Model

Challenges • Robust to input error • The ref. data may be clean, but • Input may contain various errors: • Missing values, spelling error, extraneous or unknown tokens, etc • Adaptive to varied attribute orders • Reference data don’t contain info for attribute order in input • Efficient in training • Reference data is large Engineer features Adjust model topology Determine attribute order from early input strings Fix model topology Don’t use advanced learning (e.g., EM)

Feature Hierarchy High-level features considered: Token classes (words, numbers, mixed, delimiters) + Token length

Attribute Recognition Model 57th n sixth st 1010 s fifth st 201 n goodwin ave

Model Training 57th n sixth st 1010 s fifth st 201 n goodwin ave Transition: B  { M, T, END } M  { M, T, END } T  { T, END } Emission: p(x|e)=(x=e) ? 1 : 0 Mixed [a-z0-9]{1,-} … … [a-z0-9]{1,5} [a-z0-9]{1,4} 57th

Sequential Specificity Relaxation Token insertion e.g., 57th 57th n sixth st Token deletion e.g., n sixth Missing attribute value e.g., <null>

Determining Attribute Value Order • Attribute order is usually preserved in the same batch of input strings

Determining Attribute Value Order s = walmart 20205 s. randall ave madison 53715 wi. 1 2 3 4 5 6 7 8 pos v(s,Ai): [ 0.05, 0.01, 0.02, 0.1, 0.01, 0.8, 0.01, 0.07 ]  city attr. [ 0.1, 0.7, 0.8, 0.7, 0.9, 0.5, 0.4, 0.1 ]  street attr. (partial order) (total order) Search all permutation for the best total order

Experiment Data • Reference relations • Addresses: 1,000,000 tuples • Schema; [ Name,Number1,Number2,Address, City, State, Zip ] • Media: 280,000 music tracks • Schema: [ ArtistName, AlbumName, TrackName ] • Bibliography: 100,000 records from DBLP • Schema: [ Title, Author, Journal, Volume, Month, Year ] • Test datasets – Naturally concatenated test sets • Addresses: from RISE repository • Media: from Microsoft • Papers: 100 most cited papers from Citeseer

Experiment Data (cont.) • Test datasets – Controlled test data sets • Randomly chosen order • Error injection

Experiment Results

Experiment Results • 1-Pos vs BMT vs BMT-robust

Comments • The idea of using reference tables is good • The approach is well engineered to deal with issues of robustness and efficiency • Experiment is thorough • The approach is somewhat still ad hoc, and every component seems replaceable

Shui-Lung Chuang Oct 27, 2004