Natural Language Processing Techniques for Managing Legal Resources JURIX 2009 Tutorial Erasumus University, School of L

Natural Language Processing Techniques for Managing Legal Resources JURIX 2009 Tutorial Erasumus University, School of Law Rotterdam, The Netherlands December 16, 2009 Adam Wyner University College London adam@wyner.info www.wyner.info/LanguageLogicLawSoftware

Overview • Preliminary comments. • Natural Language Processing elements. • GATE introduction. • Samples of GATE applied to legal resources -- legislation, case based reasoning, and gazettes. • From GATE to ontologies and logical representations.

Main Point Legal text expressed in natural language can be automatically annotated with semantic mark ups using natural language processing systems such as the General Architecture for Text Engineering (GATE). Using the annotations, we can (in principle) extract the information from the texts then use it for answering queries or reasoning.

Outcome from the Tutorial • Overview of key issues and objectives of NLP with respect to legal resources. • Idea of how one NLP tool (GATE) works and can be used. • Idea of what some initial results might be. • Sense of what can be done.

Audience • Law school students, legal professionals, public administrators. Get things done that are relevant to them. • AI and Law researchers. A collaborative framework for research and development.

What the Tutorial is.... • A report of learning and working with this material. Faster way in. • An invitation to collaborate as a research and development community using a common framework. • A presentation of initial elements of a work in progress, not a through and through prototype or full fledged system.

Open Data Lists, Rules, and Development Environment • Contribute to the research community and build on past developments. • Teaching and learning. • Interchange. Semantic web chokes on different formats. • No publication without replication. Text analytics has an experimental aspect. • On balance, academic research ought to contribute to the common good rather than be proprietary. If you need to own it, work at a company. • Distributive research, stream results.

Sample Texts • Legislation (EU and UK) for structure and rules. • Case decisions (US on intellectual property and crime) for details and CBR factors. • Gazettes (UK). • On paper: • What information do you want to identify? • How can you identify it (e.g. how do you know what you know)? • What do you want to do with it?

Semantic Web • Want to be able to do web based information extraction and reasoning with legal documents, e.g. find the attorneys who get decisions for plaintiffs in a corpus of case law. • Machine only “sees” strings of characters, while we “see” and use meaning. • John Smith, for plaintiff..... The decision favours plaintiff. • How can we do this? Semantic annotation of documents, then extraction of those annotations in meaningful relations. • “Self-describing” documents

What is the Problem? • Natural language supports implicit information, multiple forms with the same meaning, the same form with multiple meanings (context), and dispersed meanings: • Entity ID: Jane Smith, for plaintiff. • Relation ID: Edgar Wilson disclosed the formula to Mary Hays. • Jane Smith, Jane R. Smith, Smith, Attorney Smith.... • Jane Smith in one case decision need not be the same Jane Smith in another case decision. • Jane Smith represented Jones Inc. She works for Dewey, Chetum, and Howe. To contact her, write to j.smith@dch.com • As for names, so to with sentences.

Knowledge Light v. Heavy Approaches • Statistical approaches - compare and contrast large bodies of textual data, identifying regularities and similarities. Sparse data problem. No rules extracted. Useful for ranking documents for relevance. • Machine learning - apply learning algorithms to known material to extend results to unknown material. Needs known, annotated material. Useful for text classification. Black box - cannot really know the rules that are learned and use them further. • Lists, rules, and processes - know what we are looking for. Know the rules and can further use and develop them. Labour and knowledge intensive.

Knowledge Light v. Heavy Approaches • Some mix of the approaches. • The importance of humanly accessible explanation and justification in some domains of the law warrants a knowledge heavy approach.

Overview • Motivations and objectives of NLP in this context. • General Architecture for Text Engineering (GATE). • Processing and marking up text. • Other technologies for parsing and semantic interpretation (C&C/Boxer).

Motivation • Annotate large legacy corpora. • Address growth of corpora. • Reduce number of human annotators and tedious work. • Make annotation systematic, automatic, and consistent. • Annotate fine-grained information: • Names, locations, addresses, web links, organisations, actions, argument structures, relations between entities. • Map from well-drafted documents in NL to RDF/OWL.

Approaches • Top-down vs. Bottom-up approaches: • Both do initial (and iterative) analysis of the texts in the target corpora. • Top-down defines the annotation system, which is applied manually to texts. Knowledge intensive in development and application. • Annotation system is ‘defined’ in terms of parsing, lists of basic components, ontologies, and rules to construct complex mark ups from simpler one. Apply the annotation system to text, which outputs annotated text. Knowledge intensive in development. • Convergent/complementary/integrated approaches. • Bottom-up reconstructs and implements linguistic knowledge. However, there are limits....

Objectives of NLP • Generation – convert information in a database into natural language. • Understanding – convert natural language into a machine readable form. Support inference? • Information Retrieval – gather documents which contain key words or phrases. Preindex (list of what documents a word appears in) the corpus to speed retrieval (e.g. Google). Rank documents in terms of “relevance”. Documents must be read to identify information.

Objectives of NLP • Text Summarization – summarize (in a paragraph) the main meaning of a text or corpus. • Question Answering – queries made and answers given in natural language with respect to some corpus of texts. • Information Extraction – identify and extract information from documents which is then reused or represented. The information should be meaningfully related. • Information extraction can be used to improve information retrieval.

Objectives of NLP • Automatic mark up to overcome bottleneck. • Semantic representation for modelling and inference. • Semantic representation as a ‘interlanguage’ for translation. • Develop ontologies. • Provide gold-standard corpora. • To understand and work with human language capabilities.

Subtasks of NLP • Syntactic parsing into phrases/chunks (prepositional, nominal, verbal,....). • Identify semantic roles (agent, patient,....). • Entity recognition (organisations, people, places,....). • Resolve pronominal anaphora and co-reference. • Address ambiguity. • Focus on entity recognition (parsing happens, anaphora can be shown, others are working on semantic roles, etc).

Computational Linguistic Cascade • Sentence segmentation – divide text into sentences. • Tokenisation - words identified by spaces between them. • Part of speech tagging – noun, verb, adjective.... Determined by lookup up and relationships among words. • Morphological analysis - singular/plural, tense, nominalisation, ... • Shallow syntactic parsing/chunking - noun phrase, verb phrase, subordinate clause, .... • Named entity recognition – the entities in the text. • Dependency analysis - subordinate clauses, pronominal anaphora,... • Each step guided by pattern matching and rule application.

Development Cycle • Text -> Linguistic Analysis -> Knowledge Extraction • Cycle back to text and linguistic analysis to improve knowledge extraction.

GATE • General Architecture for Text Engineering (GATE) open source framework which supports plug in NLP components to process a corpus of text. Is “open” open? • A GUI to work with the tools. • A Java library to develop further applications. • Where to get it? Lots of tutorial information. • http://gate.ac.uk/ • Components and sequences of processes, each process feeding the next in a “pipeline”. • Annotated text output. • Instructions on how to run. Simple examples. ANNIE-BNA.xgapp

Loading and Running GATE with ANNIE • Start GUI • LC on File > Load ANNIE System > Choose with Defaults. • Adds Processing Resources and an Application. • RC on Language Resources > New > Select GATE document > Browse to document > OK. • When added, RC on the document (BNA sample) > New Corpus with this document. • RC on ANNIE under applications to see the pipeline. • At Corpus, select the Corpus created. • Run.

GATE Example

Inspecting the Result • RC on document (not Corpus) > Show, which shows the text. • LC on Annotation Sets, LC on Annotations List. • On right see coloured list with check boxes for annotations; below see a box with headings. • Selecting an annotation highlights the relevant text in colour. In the List box below, we get detailed information about location, Id, and features

Inspecting the Result • For Location, we have “United Kingdom”, with locType = country, matching Ids, and the rules that have been applied. • Similarly for JobTitle, Lookup (from Gazettes), Sentence, SpaceToken, Split (for sentences), and Token (every “word”). • Note different information provided by Lookup and Token, which is useful for writing rules. • Will remark on Type, Start/End, Id, and features.

GATE Example

XML -- Inline XML is a flexible, extensible framework for mark up languages. The mark ups have beginnings/endings. Inline XML is strictly structured in a tree (note contains body, body could contain date, no overlap) and is “inline” with the text. Compare to standoff, which allows overlap and sets text off from annotations. Allows reprocessing since text is constant.

XML -- Standoff

GATE Output Inline In the GATE Document Editor, the Annotations can be deleted (RC > Delete). We have left just Location and JobTitle. To output text with annotations that are XML compatible, RC on the document in Language Resources, then Save preserving document format. Further processing can be done using XSLT.

GATE Output Offset - part 1a In the GATE Document Editor, the Annotations can be deleted (RC > Delete). We have left just Location and JobTitle. To output text with annotations that are in XML, RC on the document in Language Resources, then Save as XML. This is the top part. The text is serialized, and annotations relate to positions in the text.

GATE Output - part 1b

GATE ANNIE Annotations

GATE ANNIE Annotations Organisations and Quotations. Case references.

GATE • Language Resources: corpora of documents. • Processing Resources: lexicons, ontologies, parsers, taggers. • Visual Resources: visualisation and editing. • The resources are plug ins, so can be added, removed, or modified. See this latter with ANNIC (Annotations in Context) and Onto Root Gazetteer (using ontologies as gazetteers).

GATE • A source document contains all its original mark up and format. John Smith ran. • A GATE document is: Document = text + (annotations + features) • <Person, gender = “male”>John Smith</Person> • <Verb, tense = “past”>ran</Verb> Not really the way it appears in GATE, but the idea using XML.

GATE Annotations • Have types (e.g. Token, Sentence, Person, or whatever is designed for the annotation). • Belong to annotation sets (see later). • Relate to start and end offset nodes (earlier). • Have features and values that store a range of information as in (not GATE, but XML-ish): • <Person, gender = “male”>John Smith</Person> • <Verb, tense = “past”>ran</Verb>

GATE Construction: From smaller units, compose larger, derivative units. Gazetteers: Lists of words (or abbreviations) that fit an annotation: first names, street locations, organizations.... JAPE (Java Annotation Patterns Engine): Build other annotations out of previously given/defined annotations. Use this where the mark up is not given by a gazetteer. Rules have a syntax.

GATE – A Linguistic Example • Lists: • List of Verb: like, run, jump, .... • List of Common Noun: dog, cat, hamburger, .... • List of Proper Name: Cyndi, Bill, Lisa, .... • List of Determiner: the, a, two, .... • Rules: • (Determiner + Common Noun) | Proper Name => Noun Phrase • Verb + Noun Phrase => Verb Phrase • Noun Phrase + Verb Phrase => Sentence • Input: • Cyndi likes the dog. • Output: • [s [np Cyndi] [vp [v likes] [np [det the] [cn dog]]]].

Lists, Lists of Lists, Rules • Coalesce diverse yet related information in a list, e.g. organisation.lst. What is included here depends on.... What is Looked Up from the list is associated with the “master category”. • Make a master list of the lists in lists.def, which contains organisation.lst, date.lst, legalrole.lst..... • The master list indicates the majorType of things looked up in the list, e.g. organisation, and minorType, e.g. private, public (and potentially more features). Two lists may have the same majorType, but different minor types. Useful so rules can apply similarly or differently according to major or minor types.

GATE organisation.lst

GATE Gazetteer – a list of lists

What Goes into a List? • A 'big' question. Depends on what one is looking for, how one wants to find it, and what rules one wants to apply. • Every difference in character is a difference in the string even if the 'meaning' is the same. • B.C. b.c. bc b.c bC. • May01,1950 May 01 1950 01 May 1950 • More examples later. • By list or by rule....

Token, Lookup, Feature, Annotation • Token - a string of characters with a space. In The big brown dog chased the lazy cat there are seven tokens. Token information includes syntactic part of speech (noun, verb,....) and string details (orthography, kind, position, length,....). • Lookup - look up a string in a list and assign it major or minor types. The “bottom semantic” layer of the cascade. • Annotation - subsequent mark ups which depend on Token, Lookup, or prior annotations. • Feature - additional Token, Lookup, or Annotation information.

Rolling your Own • Create lists and a gazetteer. • Add processing resources. • Add documents and make a corpora. • Construct the pipeline - an ordered sequence of processes. • Run the pipeline over the corpora. • Inspect the results.

GATE JAPE JAPE rule idea (not the real thing). <FirstName>aaaa</FirstName><LastName>bbbb</LastName> => <WholeName><FirstName>aaaa</FirstName> <LastName>bbbb</LastName></WholeName> FirstName and LastName we get from the Gazetteer. WholeName we construct using the rule. For complex constructions, must have a range of alternative elements in the rule.

GATE JAPE • Header - rule name, annotations to use, processing features, processing priority.... • Left hand side of rule (LHS) - refers to various mark ups that are found in the text, relies on order, uses expressions of combination or iteration, and identifies what portion is to be annotated as per the rule. • Right hand side of rule (RHS) - annotates as per the rule (plus some information) • Can have Java on RHS, but will not cover this.

GATE JAPE ? means optional

Natural Language Processing Techniques for Managing Legal Resources JURIX 2009 Tutorial Erasumus University, School of L

Natural Language Processing Techniques for Managing Legal Resources JURIX 2009 Tutorial Erasumus University, School of L

Presentation Transcript

Natural Language Processing

Natural Language Processing

CS4705 Natural Language Processing Fall 2009

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

CS4705 Natural Language Processing Fall 2009

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Tutorial I: Natural Language Processing With Python

Natural Language Processing

Natural Language Processing