Deep Learning for Domain-Specific Entity Extraction from Unstructured Text

Deep Learning for Domain-SpecificEntity Extraction from Unstructured Text • Mohamed AbdelHady, Sr. Data Scientist • Zoran Dzunic, Data Scientist • Cloud AI

Goals • What is entity extraction? • When do I need to train a custom entity extraction model? • Why should I use Deep Neural Networks and which one? • What are word embeddings and why should I use them as features? • How do I train a custom word embedding model? • How do I train a custom Deep Neural Network model for entity extraction? • What is the best architecture for these two tasks?

Agenda • Motivation / Entity Extraction • Recurrent Deep Neural Networks • Word Embeddings • Architecture • Results • Conclusion

Entity Extraction • Subtask of information extraction • Also known as Named-entity recognition (NER), entity chunking and entity identification • Find phrases in text that refer to a real-world entity of specific types Mr. Zoran is at Strata in San Jose on March 7.

Entity Extraction • Subtask of information extraction • Also known as Named-entity recognition (NER), entity chunking and entity identification • Find phrases in text that refer to a real-world entity of specific types Mr. Zoran is at Strata in San Jose on March 7. Zoran : PERSONStrata : ORGSan Jose : LOCMarch 7 : DATE

Why is it useful? • Indexing • e.g., find all people in a document collection • Discovery • e.g., learn about new drugs from recent biomedical articles • Relation extraction • WORKS (PERSON, ORG), e.g., WORKS (Zoran, Microsoft) • Question answering • Where is Zoran? Zoran is in San Jose.

Custom Entity Extraction • Pre-trained models for common entity types (person, org, loc, date, …) • Custom models • new entity types (e.g., drug, disease) • different language properties • foreign names • non-proper sentences (e.g., tweets) • specific domain (e.g., biomedical domain)

Domain-Specific Entity Extraction Biomedical named entity recognition • Entity types drug/chemical, disease, protein, DNA, etc. • Critical step for complex biomedical NLP tasks: • Extraction of diseases, symptoms from electronic medical or health records • Understanding the interactions between different entity types such as drug-drug interaction, drug-disease relationship and gene-protein relationship, e.g., • Drug A cures Disease B. • Drug A causes Disease B. Similar for other domains (e.g., legal, finance)

Biomedical Entity Extraction • Vast Amount of Medical Articles

Approach • Feature Extraction Phase – Domain Specific Features Use a large amount of unlabeled domain-specific data corpus such as Medline PubMed abstracts to train a neural word embedding model. • Model Training Phase – Domain Specific Model The output embeddings are considered as automatically generated features to train a neural entity extractor using a small/reasonable amount of labeled data.

Why Machine Learning? • Dictionary approach • What if a phrase is not in the dictionary? Zoran is at Strata in San Jose on March 7. Mr. Zoran is at Strata in San Jose on March 7. Zoran Smith is at Strata in San Jose on March 7.

Why Machine Learning? • Statistical (ML) approach • Takes features from surrounding words Zoran is at Strata in San Jose on March 7. Mr. Zoran is at Strata in San Jose on March 7. Zoran Smith is at Strata in San Jose on March 7. Clues: after “Mr.”, before a known last name (“Smith”) • Structured prediction (sequence tagging) – label for a word depends on other labels • Rather than classifying each word independently

Why Deep Learning? • Can find complex relationships between input and output using • Non-linear processing units • Multiple hidden layers • Special-purpose layers/architectures have been developed for different problems • Recurrent Neural Networks (RNNs) are commonly used for sequence-labeling tasks • Long Short-Term Memory layer (LSTM) • Comparison to Conditional Random Fields (CRFs) • CRFs are linear models w/o hidden layers • CRFs have short-term memory

Recurrent Neural Networks • The state ht denotes the memory of the network and is responsible for capturing information about previous time steps. • All layers in RNNs share the same set of parameters because we are performing the same task at each time step with different inputs.

Each rectangle represents a feature vector

Long Short Term Memory Networks (LSTMs) • Vanilla RNNs suffer from a drawback that they are not able to keep track of long term dependencies (Vanishing Gradient Problem). • LSTMs are special types of RNNs designed to solve this problem by using four interacting layers in the repeating module.

Features: Words B-Chemical O O Naloxone the Words: reverses

Features: One-Hot Encoding B-Chemical O O One-Hot: dim=|V| (e.g., 105) [1, 0, 0 …] [0, 0, 1 …] [0, 1, 0 …]

Features: Word Embeddings B-Chemical O O Embedding: dim small (e.g., 50, 200) [0.3, 0.2, 0.9 …] [0.8, 0.8, 0.1 …] [0.5, 0.1, 0.5 …]

Word2Vec • Simple neural network of a single hidden layer with a linear activation function (Skip-Gram, CBOW) • Unsupervised learning from large corpora. The word vectors (embedding) are learned by the stochastic gradient descent optimization algorithm • Publicly available pre-trained models such as Google News • Can we do better on a specific domain? CBOW Neural Network Model

Skip-Gram Neural Network Model

Spark HDInsight Spark application a.k.a. Driver Parallel in-memory computing Big Data workloads Scales to thousands of machines Offers multiple libraries that include machine learning, streaming, and graph analysis APIs in Scala, Python, R Task Task Task Task Task Task Task Task Task Task Task Task valsc = new SparkContext(master=“mesos://..”) Slave a.k.a. Worker Slave a.k.a. Worker Executor Executor SparkContext Master a.k.a. Cluster Manager

GPUDSVMs Physical Card = 2 x NVidia TESLA GPUs Use NC class Azure VMs with GPUs Great for Deep Learning workloads

AML Workbench Sample, understand, and prep data rapidly Support for Spark + Python + R (roadmap) Execute jobs locally, on remote VMs, Spark clusters, SQL on-premises Git-backed tracking of code, config, parameters, data, run history

Tutorial: http://github.com/Azure/MachineLearningSamples-BiomedicalEntityExtraction

Datasets • Proteins, Cell Line, Cell Type, DNA and RNA Detection Bio-Entity Recognition Task at BioNLP/NLPBA 2004 • Chemicals and Diseases Detection BioCreative V CDR task corpus • Drugs Detection Semeval 2013 - Task 9.1 (Drug Recognition)

Experimental Setup • Provision Azure HDInsight Spark cluster • Train the Word2Vec model: • window_size = 5 • vector_size = 50 • min_count =1000 • Provision an Azure GPU-DSVM using NC6 Standard (56 GB, K80 NVIDIA Tesla). • Keras with TensorFlow as backend: • Initialize the embedding layer with the trained embeddings • Define the LSTM RNN architecture • Train the LSTM RNN model CRFSuite: • Extract traditional features • Train CRF model

RNN Achitecture

RNN Graph Definition

Dataset Description • Chemicals and Diseases Detection • BioCreative V CDR task corpus, 2015 http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/ • Training Data Stats • # words = 214,198 • # disease entities = 7699 • # chemical entities = 9575 • Test Data Stats • # words = 112,285 • # disease entities = 4080 • # chemical entities = 4941

Results (exact match)

Embedding Comparison

Takeaways • Recipe for building a custom entity extraction pipeline: • Get a large amount of in-domain unlabeled data • Train a word2vec model on unlabeled data on Spark • Get as much of labeled data as possible • Train an LSTM -based Neural Network on a GPU-enabled machine • Word embeddings are powerful features • Convey word semantics • Perform better than traditional features • No feature engineering • LSTM NN is more powerful model than traditional CRF

Reference • Github – http://github.com/Azure/MachineLearningSamples-BiomedicalEntityExtraction • Documentation – http://docs.microsoft.com/en-us/azure/machine-learning/preview/scenario-tdsp-biomedical-recognition • Azure ML – https://docs.microsoft.com/en-us/azure/machine-learning/preview/overview-what-is-azure-ml • Azure ML Workbench – https://docs.microsoft.com/en-us/azure/machine-learning/preview/quickstart-installation • DSVM –http://aka.ms/dsvm

Extras

Input Gate Layer • A sigmoid layer that decides the values that need to be updated. The tanh layer completes the step by adding new candidate values for the state. • Forget Gate Layer • A sigmoid layer which looks at the output of the previous state (ht-1) and the input at the current state (xt) and decides which information to forget from the previous cell state. • Update Previous States • Update the cell state based on the previous decisions • Output Layer • Use the sigmoid and the tanh layer to decide what we are going to output

Deep Learning for Domain-Specific Entity Extraction from Unstructured Text

Deep Learning for Domain-Specific Entity Extraction from Unstructured Text

Presentation Transcript

Name Date Place Extraction in unstructured text

Domain Specific Deep Web Discovery and Cataloging

Text Extraction from Big Data

Information Extraction: Distilling Structured Data from Unstructured Text.

Conditional random fields for entity extraction and ontological text coding

Information Extraction from Biomedical Text

Text Learning and Information Extraction

Schema-Driven Relationship Extraction from Unstructured Text

Information extraction from text

Information extraction from text

Information extraction from text

Ontology-Driven Automatic Entity Disambiguation in Unstructured Text

Relevant characteristics extraction from semantically unstructured data

Schema-Driven Relationship Extraction from Unstructured Text

Automating the Extraction of Domain Specific Information from the Web

Named Entity Extraction

From Unstructured Text to StructureD Data

Information extraction from text

Information extraction from text

Information extraction from text

Information extraction from text

Schema-Driven Relationship Extraction from Unstructured Text