1 / 13

Tapta4IPC: helping translation of IPC definitions

Tapta4IPC: helping translation of IPC definitions. Translation assistant for patent titles and abstracts in PATENTSCOPE - potential use in translating IPC definitions collaboration. Bruno Pouliquen ( Bruno.Pouliquen@wipo.int ). 25 feb 2013, IPC workshop. Introduction.

wynona
Télécharger la présentation

Tapta4IPC: helping translation of IPC definitions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tapta4IPC: helping translation of IPC definitions Translation assistant for patent titles and abstracts in PATENTSCOPE - potential use in translating IPC definitions collaboration Bruno Pouliquen (Bruno.Pouliquen@wipo.int) 25 feb 2013, IPC workshop

  2. Introduction • Statistical Machine Translation: bottom-up approach • no rules, no grammar, no dictionary, no terminology, only the parallel texts (bitexts) system data • We use an open-source system: Moses • Tapta: Translation of Patent Titles and Abstract • Originally built to translate patent applications • Adapted to various applications

  3. Tapta framework sourcelanguage targetlanguage Gather/convert data Bitexts clean post-filter re-clean prune binarize optimize Publish train-model Our system prepares the data for Moses, apply some post-processing (filter, pruning, binarization, optimization…) and offers a Web interface to translate

  4. Introduction: Tapta • In WIPO, as part of Patentscope (English,French,German,Chinese,Japanese) • eg. http://patentscope.wipo.int/translate/simpleTranslate.jsf?id=JP75694586&langpair=jaen • Automatic translation of a patent application only available in Japanese… • In United Nations (English from/into Arabic,French,Spanish,Russian & Chinese)

  5. Technical workflow sourcelanguage Filter wrong language Filter wrong language Translation client Sentence-split Translation server Tokenization Bitexts Sentence-align Moses decoder Moses decoder Moses decoder Score alignment reordering model language model phrase table Filter align. Filter align. Moses’ training targetlanguage Bitexts aligned at sentence level

  6. IPC context • Gather data: • Get existing definitions • Add IPC schema (xml on WIPO website) • Add “few” texts from patents • “learn” translation model • Translate new texts

  7. Get existing data, build parallel texts Existing definitions… Bitext: training material… IPC schema… <ipcEntry kind="1" symbol="B61F0019020000" ipcLevel="A" entryType="K" lang="FR"> <textBody> <title><titlePart> <text>Couvre-roues</text> </titlePart></title></textBody> </ipcEntry> <ipcEntry kind="1" symbol="B61F0019020000" ipcLevel="A" entryType="K" lang="EN"> <textBody> <title> <titlePart> <text>Wheel guards</text> </titlePart></title></textBody> </ipcEntry> Patent texts… WO/2013/014517 (EN) TYRE FOR VEHICLE WHEELS(FR)PNEUMATIQUE POUR ROUES DE VÉHICULE

  8. How well it works? Automatic evaluation: BLEU score • Principle : similarity of n-grams between evaluated and reference sentences On IPC definition English-French: bleu=48% (without patent data: 44%) Good quality needs human post-editing

  9. Tapta4IPC prototype (1) Live demo using: http://patentscope.wipo.int/translateUN/translateIPC.jsf

  10. Tapta4IPC prototype (2) http://fulty3.wipo.int:8080/Wtapta/translateIPC.jsf

  11. Conclusion / future work • This is a prototype, but the quality looks already acceptable • Human evaluation? • Better integrate the tool • In PCA6TRANSDEF ? • Other languages?

  12. Tapta4IPC in various languages • Tapta4IPC should work reasonably well on the following languages (we have built some language specific tools and we have patent corpora): • German • Japanese • Korean • Spanish • Dutch  • Portuguese • Chinese • Russian • More challenging: • Czech, Slovak, Polish (many word forms, training corpus?) • Estonian (even more word forms, would in theory require more training corpus) • Other languages: Arabic, Italian, Danish, Swedish etc.

  13. Thank you for your attention • شكرا لكم على اهتمامكم • Merci pour votre attention! • 感谢您的关注 • Grazie per la vostra attenzione! • ¡ Gracias por su atención ! • Vielen Dank für Ihre Aufmerksamkeit! • Obrigado pela vossa atenção! • Dziękuję bardzo za Państwa uwagę! • Děkujeme za Vaši pozornost! • Ďakujem ti veľmi pekne za tvoju pozornosť • Tänan tähelepanu eest! • Благодарим за Вашето внимание! • Tak for Jeres opmærksomhed! • Thank you for your attention!

More Related