140 likes | 375 Vues
Naive Algorithms for Key-phrase Extraction and Text Summarization from a Single Document inspired by the Protein Biosynthesis Process. Daniel Gayo Avello (University of Oviedo). What’s the problem?. . Document reading is a time consuming task…
E N D
Naive Algorithms for Key-phrase Extraction and Text Summarization from a Single Document inspired by the Protein Biosynthesis Process Daniel Gayo Avello (University of Oviedo)
What’s the problem? • Document reading is a time consuming task… • Many common documents (e.g., e-mail, newsgroup posts, web pages) lack of abstract or keywords… • But, they are “electronic” so we can work on them in some way… 8%
What’s the problem? (cont.) • Many techniques to perform several Natural Language Processing (NLP) useful tasks: • Language identification. • Document categorization and clustering. • Keyword extraction. • Text summarization. • Quite different: • With/Without human supervision. • With/Without training. • With/Without complex linguistic data. • With/Without document corpora. 17%
Any suggestion? • It would be great to use only one technique to carry out several of those tasks. • Desirable goals: • Simple (only free text, not linguistic data) • Fully automatic (neither supervision nor ad hoc heuristics) • Scalable (from one web page to several web sites) • Could it be a bio-inspired solution? 25%
Our (bio-inspired) hypothesis • Living beings are defined by their genome. • Document from a corpus ≈ Individual from a population • So…? • Let’s imagine a “document genome”… • Similar documents (similar language/topic) Similar genomes. • More interesting, translation from “document genome” to “significance proteins” (i.e., keyphrases and summaries). 33%
aminoacids DNA UAC AUGCCGGGUUACUAA mRNA copied into a single-stranded mRNA molecule Folding process Protein folded into a 3D structure Our biological inspiration • The protein biosynthesis process… Termination Elongation Could we mimic this to distill from a single documentkeyphrases and summaries!? Initiation Polypeptide chain Transcription 42%
A “DNA” for Natural Language? • n-grams (slices of adjoining n characters) • Frequency not the most relevant weight for each n-gram. • There exist different measures to show relation between both elements in a bigram: • Mutual information. • Dice coefficient. • Loglike. • … • Cannot be applied straightforward to n-grams… • …But, they can be generalized (Ferreira and Pereira, 1999) 58%
Original document The rain in Spain stays mainly in the plain. < in > < mai> < pla> < rai> < Spa> < sta> < the> <ain > <ainl> <ays > <e pl> <e ra>… Relative frequency Fair Specific Mutual Information n-grams <Spai> 0.025 2.013 Assigning weights to n-grams <inly> 0.025 1.975 A “DNA” for Natural Language? (cont.) 67%
The- he-r e-ra 20 29 24 Document genome translation • So… • “Document genome” spliced into “pseudo-tRNA”. • Document used as “pseudo-mRNA”. • We “attach” to the document pseudo-tRNA “molecules” (with max. weight) while average significance per character continues growing. • Result: Document spliced into “chunks” with maximum average significance. The rain in Spain stays mainly in the plain 20 The 49 The r 73 The ra pseudo-mRNA The rain in Spain stays mainly in the plain. etc. 75%
Work on Early Stage Folding the “protein” / summarization • To obtain keyphrases the “protein” (text chunks) must be folded… • At this moment we are studying different alternatives: • Mutual reinforcement? • Chunks ≈ Documents Apply classical IR techniques? • Others? • Automatic text summarization • Simple but useful approach. • Use the shortest paragraphs with the most significant keyphrases. 83%
To test feasibility of these ideas a prototype was developed. • blindLight – http://www.purl.org/NET/blindLight • It receives a user-provided URL and produces: • A “blindlighted” version of the original URL. • A list of keyphrases. • An automatic summary. 92%
Conclusions • Proof-of-concept tests have been performed • Details in the paper… • Results can be improved. • Thorough study and analysis is needed. • Really promising! • Summary of the proposal • Free text from just one document. • Language independent (currently only western languages). • Bio-inspired. • Extremely simple to implement. 100%
Thank you! Merci beaucoup! ¡Muchas gracias!