1 / 70

Linguistic techniques for Text Mining

Linguistic techniques for Text Mining . NaCTeM team www.nactem.ac.uk Sophia Ananiadou Chikashi Nobata Yutaka Sasaki Yoshimasa Tsuruoka. lexicon. ontology. Natural Language Processing. part-of-speech tagging. named entity recognition. deep syntactic parsing. annotated (structured)

gianna
Télécharger la présentation

Linguistic techniques for Text Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linguistic techniques for Text Mining NaCTeM team www.nactem.ac.uk Sophia Ananiadou Chikashi Nobata Yutaka Sasaki Yoshimasa Tsuruoka

  2. lexicon ontology Natural Language Processing part-of-speech tagging named entity recognition deep syntactic parsing annotated (structured) text raw (unstructured) text ………………………………..………………………………………….……….... ... Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells. …………………………………………………………….. S VP VP NP PP NP PP PP NP NN IN NN VBZ VBN IN NN IN JJ NN NNS . Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells . protein_moleculeorganic_compound cell_line negative regulation

  3. Basic Steps of Natural Language Processing • Sentence splitting • Tokenization • Part-of-speech tagging • Shallow parsing • Named entity recognition • Syntactic parsing • (Semantic Role Labeling)

  4. Sentence splitting Current immunosuppression protocols to prevent lung transplant rejection reduce pro-inflammatory and T-helper type 1 (Th1) cytokines. However, Th1 T-cell pro-inflammatory cytokine production is important in host defense against bacterial infection in the lungs. Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines leaves patients susceptible to infection. Current immunosuppression protocols to prevent lung transplant rejection reduce pro-inflammatory and T-helper type 1 (Th1) cytokines. However, Th1 T-cell pro-inflammatory cytokine production is important in host defense against bacterial infection in the lungs. Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines leaves patients susceptible to infection.

  5. A heuristic rule for sentence splitting sentence boundary = period + space(s) + capital letter Regular expression in Perl s/\. +([A-Z])/\.\n\1/g;

  6. Errors IL-33 is known to induce the production of Th2-associated cytokines (e.g. IL-5 and IL-13). • Two solutions: • Add more rules to handle exceptions • Machine learning IL-33 is known to induce the production of Th2-associated cytokines (e.g. IL-5 and IL-13).

  7. Tools for sentence splitting • JASMINE • Rule-based • http://uvdb3.hgc.jp/ALICE/program_download.html • Scott Piao’s splitter • Rule-based • http://text0.mib.man.ac.uk:8080/scottpiao/sent_detector • OpenNLP • Maximum-entropy learning • https://sourceforge.net/projects/opennlp/ • Needs training data

  8. Tokenization • Convert a sentence into a sequence of tokens • Why do we tokenize? • Because we do not want to treat a sentence as a sequence of characters! The protein is activated by IL2. The protein is activated by IL2 .

  9. Tokenization • Tokenizing general English sentences is relatively straightforward. • Use spaces as the boundaries • Use some heuristics to handle exceptions The protein is activated by IL2. The protein is activated by IL2 .

  10. Tokenisation issues • separate possessive endings or abbreviated forms from preceding words: • Mary’s  Mary ‘sMary’s  Mary isMary’s  Mary has • separate punctuation marks and quotes from words : • Mary.  Mary . • “new”  “ new “

  11. Tokenization • Tokenizer.sed: a simple script in sed • http://www.cis.upenn.edu/~treebank/tokenization.html • Undesirable tokenization • original: “1,25(OH)2D3” • tokenized: “1 , 25 ( OH ) 2D3” • Tokenization for biomedical text • Not straight-forward • Needs dictionary? Machine learning?

  12. Tokenisation problems in Bio-text • Commas • 2,6-diaminohexanoic acid • tricyclo(3.3.1.13,7)decanone • Four kinds of hyphens • “Syntactic:” • Calcium-dependent • Hsp-60 • Knocked-out gene: lush-- flies • Negation: -fever • Electric charge: Cl- K. Cohen NAACL-2007

  13. Tokenisation • Tokenization: Divides the text into smallest units (usually words), removing punctuation. Challenge: What should be done with punctuation that has linguistic meaning? • Negative charge (Cl-) • Absence of symptom (-fever) • Knocked-out gene (Ski-/-) • Gene name (IL-2 –mediated) • Plus, “syntactic”uses (insulin-dependent) K. Cohen NAACL-2007

  14. Part-of-speech tagging • Assign a part-of-speech tag to each token in a sentence. The peri-kappa B site mediates human immunodeficiency DT NN NN NN VBZ JJ NN virus type 2 enhancer activation in monocytes … NN NN CD NN NN IN NNS

  15. Part-of-speech tags • The Penn Treebank tagset • http://www.cis.upenn.edu/~treebank/ • 45 tags NN Noun, singular or mass NNS Noun, plural NNP Proper noun, singular NNPS Proper noun, plural : : VB Verb, base form VBD Verb, past tense VBG Verb, gerund or present participle VBN Verb, past participle VBZ Verb, 3rd person singular present : : JJ Adjective JJR Adjective, comparative JJS Adjective, superlative : : DT Determiner CD Cardinal number CC Coordinating conjunction IN Preposition or subordinating conjunction FW Foreign word : :

  16. Part-of-speech tagging is not easy • Parts-of-speech are often ambiguous • We need to look at the context • But how? I have to go to school. I had a go at skiing. verb noun

  17. Writing rules for part-of-speech tagging • If the previous word is “to”, then it’s a verb. • If the previous word is “a”, then it’s a noun. • If the next word is … : Writing rules manually is impossible I have to go to school. I had a go at skiing. verb noun

  18. Learning from examples The involvement of ion channels in B and T lymphocyte activation is DT NN IN NN NNS IN NN CC NN NN NN VBZ supported by many reports of changes in ion fluxes and membrane VBN IN JJ NNS IN NNS IN NN NNS CC NN ……………………………………………………………………………………. ……………………………………………………………………………………. training Unseen text Machine Learning Algorithm We demonstrate PRP VBP that … IN We demonstrate that …

  19. Part-of-speech tagging with Hidden Markov Models tags words transition probability output probability

  20. First-order Hidden Markov Models • Training • Estimate • Counting (+ smoothing) • Using the tagger

  21. Machine learning using diverse features • We want to use diverse types of information when predicting the tag. He opened it Verb The word is “opened” The suffix is “ed” The previous word is “He” : many clues

  22. Machine learning with log-linear models Feature function Feature weight

  23. Machine learning with log-linear models • Maximum likelihood estimation • Find the parameters that maximize the conditional log-likelihood of the training data • Gradient

  24. Computing likelihood and model expectation • Example • Two possible tags: “Noun” and “Verb” • Two types of features: “word” and “suffix” He opened it Noun Verb Noun tag = noun tag = verb

  25. Conditional Random Fields (CRFs) • A single log-linear model on the whole sentence • The number of classes is HUGE, so it is impossible to do the estimation in a naive way.

  26. Conditional Random Fields (CRFs) • Solution • Let’s restrict the types of features • You can then use a dynamic programming algorithm that drastically reduces the amount of computation • Features you can use (in first-order CRFs) • Features defined on the tag • Features defined on the adjacent pair of tags

  27. Features • Feature weights are associated with states and edges W0=He & Tag = Noun He has opened it Noun Noun Noun Noun Tagleft = Noun & Tagright = Noun Verb Verb Verb Verb

  28. A naive way of calculating Z(x) = 7.2 = 4.1 Noun Noun Noun Noun Verb Noun Noun Noun = 1.3 = 0.8 Noun Noun Noun Verb Verb Noun Noun Verb = 4.5 = 9.7 Noun Noun Verb Noun Verb Noun Verb Noun = 0.9 = 5.5 Noun Noun Verb Verb Verb Noun Verb Verb = 2.3 = 5.7 Noun Verb Noun Noun Verb Verb Noun Noun = 11.2 = 4.3 Noun Verb Noun Verb Verb Verb Noun Verb = 3.4 = 2.2 Noun Verb Verb Noun Verb Verb Verb Noun = 2.5 = 1.9 Noun Verb Verb Verb Verb Verb Verb Verb Sum = 67.5

  29. Dynamic programming • Results of intermediate computation can be reused. He has opened it Noun Noun Noun Noun Verb Verb Verb Verb forward

  30. Dynamic programming • Results of intermediate computation can be reused. He has opened it Noun Noun Noun Noun Verb Verb Verb Verb backward

  31. Dynamic programming • Computing marginal distribution He has opened it Noun Noun Noun Noun Verb Verb Verb Verb

  32. Maximum entropy learning and Conditional Random Fields • Maximum entropy learning • Log-linear modeling + MLE • Parameter estimation • Likelihood of each sample • Model expectation of each feature • Conditional Random Fields • Log-linear modeling on the whole sentence • Features are defined on states and edges • Dynamic programming

  33. POS tagging algorithms • Performance on the Wall Street Journal corpus

  34. POS taggers • Brill’s tagger • http://www.cs.jhu.edu/~brill/ • TnT tagger • http://www.coli.uni-saarland.de/~thorsten/tnt/ • Stanford tagger • http://nlp.stanford.edu/software/tagger.shtml • SVMTool • http://www.lsi.upc.es/~nlp/SVMTool/ • GENIA tagger • http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/

  35. Tagging errors made by a WSJ-trained POS tagger … and membrane potential after mitogen binding. CC NN NN IN NN JJ … two factors, which bind to the same kappa B enhancers… CD NNS WDT NN TO DT JJ NN NN NNS … by analysing the Ag amino acid sequence. IN VBG DT VBG JJ NN NN … to contain more T-cell determinants than … TO VB RBR JJ NNS IN Stimulation of interferon beta gene transcription in vitro by NN IN JJ JJ NN NN IN NN IN

  36. Taggers for general text do not work wellon biomedical text Performance of the Brill tagger evaluated on randomly selected 1000 MEDLINE sentences: 86.8% (Smith et al., 2004) Accuracies of a WSJ-trained POS tagger evaluated on the GENIA corpus (Tsuruoka et al., 2005)

  37. MedPost(Smith et al., 2004) • Hidden Markov Models (HMMs) • Training data • 5700 sentences randomly selected from various thematic subsets. • Accuracy • 97.43% (native tagset), 96.9% (Penn tagset) • Evaluated on 1,000 sentences • Available from • ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedPost/medpost.tar.gz

  38. Training POS taggers with bio-corpora(Tsuruoka and Tsujii, 2005)

  39. Performance on new data • Relative performance evaluated on recent abstracts selected from three journals: • - Nucleic Acid Research (NAR) • - Nature Medicine (NMED) • - Journal of Clinical Investigation (JCI)

  40. Chunking (shallow parsing) • A chunker (shallow parser) segments a sentence into non-recursive phrases. He reckons the current account deficit will narrow to NPVPNPVPPP only #1.8 billion in September . NPPPNP

  41. Extracting noun phrases from MEDLINE(Bennett, 1999) • Rule-based noun phrase extraction • Tokenization • Part-Of-Speech tagging • Pattern matching Noun phrase extraction accuracies evaluated on 40 abstracts

  42. Chunking with Machine learning • Chunking performance on Penn Treebank

  43. Machine learning-based chunking • Convert a treebank into sentences that are annotated with chunk information. • CoNLL-2000 data set • http://www.cnts.ua.ac.be/conll2000/chunking/ • The conversion script is available • Apply a sequence tagging algorithm such as HMM, MEMM, CRF, or Semi-CRF. • YamCha: an SVM-based chunker • http://www.chasen.org/~taku/software/yamcha/

  44. GENIA tagger • Algorithm: Bidirectional MEMM • POS tagging • Trained on WSJ, GENIA and Penn BioIE • Accuracy: 97-98% • Shallow parsing • Trained on WSJ and GENIA • Accuracy: 90-94% • Can output base forms • Available from http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/

  45. Named-Entity Recognition We have shown that interleukin-1 (IL-1) and IL-2 control protein protein protein IL-2 receptor alpha (IL-2R alpha) gene transcription in DNA CD4-CD8-murine T lymphocyte precursors. cell_line • Recognize named-entities in a sentence. • Gene/protein names • Protein, DNA, RNA, cell_line, cell_type

  46. Performance of biomedical NE recognition • Shared task data for Coling 2004 BioNLP workshop • - entity types: protein, DNA, RNA, cell_type, and cell_line

  47. Features Classification models, main features used in NLPBA (Kim, 2004) Classification Model (CM): S: SVM; H: HMM; M: MEMM; C: CRF Features lx: lexical features; af: affix information (chracter n-grams); or; orthographic Information; sh: word shapes; gn: gene sequence; gz: gazetteers; po: part-of-speech tags; np: noun phrase tags; sy: syntactic tags; tr: word triggers; ab: abbreviations; ca: cascaded entities; do: global document information; pa: parentheses handling; pre: previously predicted entity tags; B: British National Corpus; W: WWW; V: virtually generated corpus; M: MEDLINE

  48. CFG parsing S VP NP NP QP VBN NN VBD DT JJ CD CD NNS . Estimated volume was a light 2.4 million ounces .

  49. Phrase structure + head information S VP NP NP QP VBN NN VBD DT JJ CD CD NNS . Estimated volume was a light 2.4 million ounces .

  50. Dependency relations VBN NN VBD DT JJ CD CD NNS . Estimated volume was a light 2.4 million ounces .

More Related