110 likes | 241 Vues
This presentation by Matthew Waymost explores a statistical model for domain-independent text segmentation developed by Masao Utiyama and Hitoshi Isahura. The algorithm identifies the maximum-probability segmentation of text without requiring prior training data. It achieves this through a probabilistic approach, applying Bayes' rule and Laplace estimation to predict word frequencies. The method has been evaluated and shown superior performance compared to existing algorithms, making it effective for text summarization and versatile across different domains.
E N D
A Statistical Model for Domain-Independent Text Segmentation Masao Utiyama and Hitoshi Isahura Presentation by Matthew Waymost
Introduction • Algorithm find maximum-probability segmentation using a statistical method. • No training required. • Domain-independent.
Other Methods • Lexical Cohesion • Statistical • Hidden Markov model (Yamron et al., 1998)
Statistical Model • Find the probability of a segmentation S given a text W. • Use Bayes rule to find maximum-probability segmentation.
Definition of Pr(W|S) • Assume statistical independence of topics and of words within the scope of a topic. • Assume different topics have different word distributions. • Can breakdown into double product of probabilities across words and segments. • Uses Laplace estimator for word frequency prediction.
Definition of Pr(S) • Varies depending on prior information. • In general, assume no prior information. • Prevents the algorithm from generating too many segments; counteracts Pr(W|S).
Algorithm • Convert the probability function into a cost function by taking the negative log. • Given a text W, define gi to be the gap between word wi and wi+1. • Create a directed graph where the nodes are the gaps between words and the edges cover a segment between the gaps the edge connects. • Calculate all edge weights by using the cost function and find the minimum-cost path from the first to last node.
Algorithm • The calculated path represents the minimum-cost segmentation by correlating the edges to segments.
Algorithm – Features • Determines the number of segments, but can also specify the number of edges in the shortest path. • Can specify where segmentation occurs by only using a subset of all possible edges where both nodes connected by the edge meet user-specified conditions. • Algorithm is insensitive to text length. • Good for summarization
Algorithm – Evaluation • Compared algorithm against C99 (Choi 2000). • Artificial test corpus extracted from the Brown corpus used. • Probabilistic error metric used to evaluate performance. • Results of Utiyama algorithm significantly better at 1% level than Choi algorithm.
Algorithm – Evaluation • Assessment of algorithm using real texts is needed. • Advantages over HMM • No training required (implies domain-independence). • Can incorporate probabilistic information into model. • Might be expandable to detect word descriptions in text.