1 / 24

Graduate School of Information Science, Nagoya University, Japan

JURISIN 2008 Second International Workshop on Juris-informatics. Bootstrapping-based Extraction of Dictionary Terms from Unsegmented Legal Text. Graduate School of Information Science, Nagoya University, Japan Masato HAGIWARA, Yasuhiro OGAWA, Katsuhiko TOYAMA. Background.

kamin
Télécharger la présentation

Graduate School of Information Science, Nagoya University, Japan

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. JURISIN 2008Second International Workshopon Juris-informatics Bootstrapping-based Extraction of Dictionary Terms from Unsegmented Legal Text Graduate School of Information Science,Nagoya University, Japan Masato HAGIWARA, Yasuhiro OGAWA,Katsuhiko TOYAMA

  2. Background • Growing demand for translation of Japanese statutes • Social and economic globalization • Promotion of international investment toward Japan • Technical assistance to developing and/orformer socialist countries • Japanese government effort • “Study Council for Promoting Translation of Japanese Laws and Regulations into Foreign Language”

  3. Bilingual Dictionary • Standard Japanese-English bilingual dictionary of legal terms (SBD) • Recommended to translators and lawyers • More than 250 major statutes to be translated,120 already released based on SBD • High compiling/maintenance cost • Should be technically supported

  4. Dictionary Compilation Support • Natural language processing technique • Automatic extraction of bilingual lexicons byword alignment technique [Toyama et al. 2006] • Japanese entries must be fixed before application • Appropriate terms are still selected by hand Supported by automatic dictionary term selectionfrom unsegmented legal text

  5. Defined Terms • What kind of terms should be selected? Definition sentences この法律において「商品取引所」とは、会員商品取引所及び株式会社商品取引所をいう。 (The term “Commodity Exchange” as used in this Act shall mean a Member Commodity Exchange and an Incorporated Commodity Exchange.) (Act No. 239, 1950) この法律において、次の各号に掲げる用語の意義は、当該各号に定めるところによる。  一 著作物 思想又は感情を創作的に表現したものであつて、文芸、学術、美術又は音楽の範囲に属するものをいう。  二 著作者 著作物を創作する者をいう。 (In this Act, the meanings of the terms listed in the following items shall be as prescribed respectively in those items: (i) “work” means a production in which thoughts or sentiments are expressed in a creative way and which falls within the literary, scientific, artistic or musical domain; (ii) “author” means a person who creates the work;) (Act No. 48, 1970)

  6. Pattern-based Term Extraction “Important terms appear in similar contexts” Commodity Exchange … in accordance with the standards and methods specified by aCommodity Exchange … … a market that a Commodity Exchange has opened for each single kind of … … a Member, etc. of a Commodity Exchange or a Member, etc. of a facility equivalent … … a Member, etc. of a facility equivalent to a Commodity Exchange in a foreign state …

  7. Pattern-based Term Extraction “Important terms appear in similar contexts” Commodity Exchange … in accordance with the standards and methods specified by a Commodity Exchange … … a market that a Commodity Exchange has opened for each single kind of … … a Member, etc. of a Commodity Exchange or a Member, etc. of a facility equivalent … … a Member, etc. of a facility equivalent to a Commodity Exchange in a foreign state … Patterns specified by # #has opened Member, etc. of a # equivalent to a # … one-third or more has been specified by articles of incorporation, at least such … … the locations where Old Marketshave been opened and Listed Commodities … … person is a member of a commodity futures association (hereinafter referred to… … in a foreign state equivalent toa Commodity Market; hereinafter the same shall apply…

  8. Pattern-based Term Extraction “Important terms appear in similar contexts” Commodity Exchange … in accordance with the standards and methods specified by a Commodity Exchange … … a market that a Commodity Exchange has opened for each single kind of … … a Member, etc. of a Commodity Exchange or a Member, etc. of a facility equivalent … … a Member, etc. of a facility equivalent to a Commodity Exchange in a foreign state … Patterns Instances specified by # #has opened Member, etc. of a # equivalent to a # Articles of incorporation Old Markets commodity futures association Commodity Market … one-third or more has been specified by articles of incorporation, at least such … … the locations where Old Markets have been opened and Listed Commodities … … person is a member of a commodity futures association (hereinafter referred to… … in a foreign state equivalent toa Commodity Market; hereinafter the same shall apply…

  9. Bootstrapping-based Methods • Espresso[Pantel and Pennacchiotti 2006] • Extraction of lexical relations (binary) • English news articles (segmented) • Tchai[Komachi and Suzuki 2008] • Extraction of semantic categories (unary) • Japanese query logs (unsegmented but short) • Long, unsegmented Japanese legal text • → Conventional analyzers/parsers are not applicable

  10. Objectives • A new algorithm Monaka is proposed • Based on Tchai algorithm • Character n-gram based instance/pattern induction • Constraint to ensure proper segmentation • Evaluation to confirm its effectiveness fordictionary term extraction

  11. Espresso Algorithm [Pantel and Pennacchiotti 2006] Instances wheat :: crop George Wendt :: star nitrogen :: element diborane :: substance Picasso :: artist tax :: charge protein :: biopolymer HCl :: string acit Corpus Seed instances Pattern Induction Instance Induction Extracted instances Patterns Pattern Ranking Instance Ranking Bootstrapping x is a y y such as x x and other y

  12. Tchai Algorithm [Komachi and Suzuki 2008] • Applied Espresso to semantic category extractionfrom Japanese web query logs • Some improvements over Espresso • Query-based pattern induction seed: JAL query: JAL_flight pattern: #_flight • Local PMI Max • Ambiguous instance/pattern filtering • Ambiguous instance: 1.5x patterns of prev. instances • Ambiguous pattern: 2.0x instances of prev. patterns • Improves the precision of the extracted instances

  13. Monaka Algorithm – Pattern Induction • Character n-gram based induction • Espresso→ Segmented English text • Tchai→ Short Japanese queries この法律において「商品取引所」とは、会員商品取引所及び株式会社商品取引所をいう。 (The term “Commodity Exchange” as used in this Act shall mean a Member Commodity Exchange and an Incorporated Commodity Exchange.) (Act No. 239, 1950) Patterns Instance て「# いて「# おいて「# … #」と #」とは #」とは、 … 商品取引所 (Commodity Exchange)

  14. Monaka Algorithm – Instance Induction この法律において「商品取引所」とは、会員商品取引所及び株式会社商品取引所をいう。 (The term “Commodity Exchange” as used in this Act shall mean a Member Commodity Exchange and an Incorporated Commodity Exchange.) (Act No. 239, 1950) Instances 商品 商品取 商品取引 商品取引所 商品取引所」 … Pattern 律において「# (# as used in thisAct) Incorrectly segmented instances are extracted as well

  15. Bidirectional Adjacency Constraint (BAC) • Constraint to ensure proper segmentation Instance i … この法律において「商品取引所」とは、会員商品取引所 … : Instance reliability

  16. Bidirectional Adjacency Constraint (BAC) • Constraint to ensure proper segmentation Instance i … この法律において「商品取引所」とは、会員商品取引所 … : Preceding instance reliability : Succeeding instance reliability Combine as the generalized average high high high … 律において「商品取引所」とは、会員 … high low low … 律において「商品取引所」とは、会員 …

  17. Monaka Algorithm – Ambiguous Patterns and Instances • Character n-gram based pattern/instance induction • Negative effect of generic instance/pattern is more serious e.g. “て「#”, #」と • The number of extracted instances is unpredictable • Ambiguous pattern filtering • Ambiguity = # of co-occurring instance types • Discard 10 most ambiguous patterns after each induction • Ambiguous instance filtering • Ambiguity = # of statutes in which the pattern appears (DF) • Discard ones which appear in more than 70% of the statutes

  18. Experimental Settings • Corpus • 228 Japanese acts included in the translation project • Article, paragraph, and item numbers → head markers • Seed instances • Randomly chosen 100 defined terms out of1,225 defined terms extracted by regular expression • Bootstrapping • # of patterns: initially 100, incremented by 10 • # of instances: start with 100 seeds, 100 new instancescumulatively learned in each iteration • A total of 10 iterations

  19. Evaluation 1. Defined term reproducibility test • How well the rest of the defined terms are reproduced,without depending on the definition sentences • Gold standard: 1,225 defined terms • Closed test 2. SBD coverage test • How many of the SBD entries are covered • Gold standard: all the 3,510 SBD entries appearedat least once in the corpus • Open test

  20. Results – Defined Term Reproducibility Extracted a quarter of the defined termswith the precision of 29.2%

  21. Results – SBD Coverage 5% or more improvement → Supports the effectiveness of the constraint

  22. Results – Extracted Instances

  23. Result – Extracted Patterns • Mostly substrings of other patterns • Most of the patterns are quite generic • A single pattern may induce too many incorrect instances • Reliability measures are effective to rank patterns/instances • BAC is essential for extraction from unsegmented text

  24. Conclusion • Monaka algorithm was proposed • Bootstrapping-based lexical knowledge acquisition • Simple character n-gram based instance/pattern induction • Constraint (BAC) to ensure proper segmentation • Ambiguous pattern/instance filtering • Evaluation results • Improved precision/recall in both defined term reproducibilityand SBD coverage • BAC helped to extract many correctly segmented instances • Future work: Application of Monaka to other domains • Highly “fixed” format of Japanese statutes • Investigation on the effect of “topic drift” [Komachi et al. 2008] showed bootstrapping tend to converge to generic instances

More Related