1 / 16

Shui-Lung Chuang Oct 27, 2004

Mining Reference Tables for Automatic Text Segmentation E. Agichtein V. Ganti Columbia Univ. Microsoft R . KDD’04. Shui-Lung Chuang Oct 27, 2004. Text Segmentation. A (short)-text string N attributes Conventional approaches Rule-based — human creates rules

Télécharger la présentation

Shui-Lung Chuang Oct 27, 2004

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Reference Tables for Automatic Text SegmentationE. Agichtein V. GantiColumbia Univ. Microsoft R.KDD’04 Shui-Lung Chuang Oct 27, 2004

  2. Text Segmentation • A (short)-text string • N attributes • Conventional approaches • Rule-based — human creates rules • Supervised model-based — human labels data Mining Ref. Table for Auto Text Segmentation E. Agichtein, V. Ganti, SIGKDD Null [ Authors , Title , Conference , Year ]

  3. The Approach • Utilize the existing (large, clean) reference data • E.g, DBLP  Papers, US Addresses, … ARM2 ARM3 ARM3 ARM1 s: a sub-string prob. s is generated ARM: Attribute Recognition Model

  4. Segmentation Model Mining Ref. Table for Auto Text Segmentation E. Agichtein, V. Ganti, SIGKDD To find s1 s2 s3 s4 ARM2 ARM3 ARM3 ARM1 s: a sub-string prob. s is generated ARM: Attribute Recognition Model

  5. Challenges • Robust to input error • The ref. data may be clean, but • Input may contain various errors: • Missing values, spelling error, extraneous or unknown tokens, etc • Adaptive to varied attribute orders • Reference data don’t contain info for attribute order in input • Efficient in training • Reference data is large Engineer features Adjust model topology Determine attribute order from early input strings Fix model topology Don’t use advanced learning (e.g., EM)

  6. Feature Hierarchy High-level features considered: Token classes (words, numbers, mixed, delimiters) + Token length

  7. Attribute Recognition Model 57th n sixth st 1010 s fifth st 201 n goodwin ave

  8. Model Training 57th n sixth st 1010 s fifth st 201 n goodwin ave Transition: B  { M, T, END } M  { M, T, END } T  { T, END } Emission: p(x|e)=(x=e) ? 1 : 0 Mixed [a-z0-9]{1,-} … … [a-z0-9]{1,5} [a-z0-9]{1,4} 57th

  9. Sequential Specificity Relaxation Token insertion e.g., 57th 57th n sixth st Token deletion e.g., n sixth Missing attribute value e.g., <null>

  10. Determining Attribute Value Order • Attribute order is usually preserved in the same batch of input strings

  11. Determining Attribute Value Order s = walmart 20205 s. randall ave madison 53715 wi. 1 2 3 4 5 6 7 8 pos v(s,Ai): [ 0.05, 0.01, 0.02, 0.1, 0.01, 0.8, 0.01, 0.07 ]  city attr. [ 0.1, 0.7, 0.8, 0.7, 0.9, 0.5, 0.4, 0.1 ]  street attr. (partial order) (total order) Search all permutation for the best total order

  12. Experiment Data • Reference relations • Addresses: 1,000,000 tuples • Schema; [ Name,Number1,Number2,Address, City, State, Zip ] • Media: 280,000 music tracks • Schema: [ ArtistName, AlbumName, TrackName ] • Bibliography: 100,000 records from DBLP • Schema: [ Title, Author, Journal, Volume, Month, Year ] • Test datasets – Naturally concatenated test sets • Addresses: from RISE repository • Media: from Microsoft • Papers: 100 most cited papers from Citeseer

  13. Experiment Data (cont.) • Test datasets – Controlled test data sets • Randomly chosen order • Error injection

  14. Experiment Results

  15. Experiment Results • 1-Pos vs BMT vs BMT-robust

  16. Comments • The idea of using reference tables is good • The approach is well engineered to deal with issues of robustness and efficiency • Experiment is thorough • The approach is somewhat still ad hoc, and every component seems replaceable

More Related