1 / 31

Talk overview

Télécharger la présentation

Talk overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Computer Aided Document Indexing System for Accessing LegislationA Joint Venture of Flanders and CroatiaBojana Dalbelo BašićFaculty of Electrical Engineering and Computing, University of Zagrebbojana.dalbelo@fer.hrMarko TadićFaculty of Humanities and Social Sciences, University of Zagrebmarko.tadic@ffzg.hrMarie-Francine MoensCentre for Law and IT / Dept. of Computer Science, Katholieke Universiteit Leuvenmarie-francine.moens@law.kuleuven.ac.be Leuven, 2007-05-22

  2. Talk overview • document indexing and computer aided document indexing • project AIDE • CADIS workstation: features • project CADIAL • eCADIS workstation: additional features • machine learning techniques • future developments • conclusions Leuven, 2007-05-22

  3. Computer Aided Document Indexing • document indexing • attachment of descriptors from a controlled thesaurus to a document • descriptors = labels representing the content of a document • necessary for document retrieval in many document collections • parliamentary documentation • legislation • technical documentation • … • usually done manually • tedious, error prone, slow (max. 30-40 documents/day) • could computers be of any help in this process? • if we build a Computer Aided Document Indexing System (CADIS) Leuven, 2007-05-22

  4. Project AIDE in Croatia • idea for a project • September 2004 • interdisciplinary collaboration of 3 institutions • Croatian Information Documentation Referral Agency (HIDRA) • Department of Electronics, Microelectronics, Computer and Intelligent Systems (ZEMRIS)Faculty of Electrical Engineering and ComputingUniversity of Zagreb • Institute of Linguistics (ZZL)Faculty of Humanities and Social SciencesUniversity of Zagreb Leuven, 2007-05-22

  5. AIDE – collaborating institutions • HIDRA • collecting, processing, providing public access and promotion of the official documentation of the Republic of Croatia • coordinator Maja Cvitaš, M.A. • ZEMRIS • research in the field of artificial intelligence, neural networks, machine learning, data and text mining • coordinators prof. Bojana Dalbelo Bašić and Jan Šnajder, M.Sc. • ZZL • computational linguistic research and building language technologies for Croatian • coordinator prof. Marko Tadić Leuven, 2007-05-22

  6. AIDE – project objective Development of intelligentsystem for automatic indexingof the official documentationof the Republic of Croatiawith descriptors from Eurovoc thesaurus Leuven, 2007-05-22

  7. AIDE – how? • AIDE = Automatic Indexing of Documents with Eurovoc • automatic indexing, how? • program which “learns to index” documents • conference in Joint Research Center of EC (JRC), Ispra, Italy, 2004-09 • at least 10,000 manually indexed documents • 3-5 descriptors per document • 10-15 documents per descriptor • indexed documents stored in XML format • Steinberger (2003) • compiling a corpus of Croatian manually indexed documentsfor machine learning of automatic indexing with Eurovoc descriptors • situation with Croatian documentation in 2004-09 • there were only few hundreds of documents indexed • manual indexing: painfully slow • how could we speed up the manual indexing? Leuven, 2007-05-22

  8. AIDE – activities • investigate and develop algorithms in the field of computational linguistics/language technologies • include that knowledge into the Computer Aided Document Indexing System (CADIS) • demonstration of CADIS in European parliament (2006-03-10) Leuven, 2007-05-22

  9. CADIS: two parallel windows Eurovoc browser window Document window Leuven, 2007-05-22

  10. Document Window Leuven, 2007-05-22

  11. Leuven, 2007-05-22

  12. CADIS features • Enhanced user interface • list of descriptors literary appearing in document Leuven, 2007-05-22

  13. CADIS features • Descriptors and non-descriptors marked in document Leuven, 2007-05-22

  14. CADIS features • Lists of n-grams Leuven, 2007-05-22

  15. CADIS features • Integration of corpus analysis • greyed n-grams are statistically relevant in the corpus i.e. collocations Leuven, 2007-05-22

  16. CADIS features • Manual marking of significant n-grams • important step towards further refinment of automatic indexing Leuven, 2007-05-22

  17. Eurovoc browser window Leuven, 2007-05-22

  18. AIDE – activities • investigate and develop algorithms in the field of computational linguistics/language technologies • include that knowledge into the Computer Aided Document Indexing System (CADIS) • demonstration of CADIS in European parliament (2006-03-10) • ca 10,000 Croatian documents indexed in HIDRA using CADIS workstation during 2006 • joint project proposal with Katholieke Universiteit Leuven for CADIAL project Leuven, 2007-05-22

  19. CADIAL project • Computer Aided Document Indexing for Accessing Legislation • a joint Flemish-Croatian project • Department International Flanders, grant no. KRO/009/06 • partners: • Katholieke Universiteit Leuven (prof. Marie-Francine Moens) • University of Zagreb, Hidra (prof. Bojana Dalbelo Bašić) • started: 2007-03 • duration: 2 years • web: www.cadial.org • the goal: publicly accessible service for automatic indexing of the official documentation of the Republic of Croatia • new version of CADIS (eCADIS) is one of modules in this project • planned as a web-based service Leuven, 2007-05-22

  20. CADIAL project 2 • used the 10,000 manually indexed documents to train the system for automatic indexing of documents in Croatian • used the 20,000 manually indexed documents from Acquis to train the system for automatic indexing of documents in English • included that training data into the next version: eCADIS (-version) Leuven, 2007-05-22

  21. eCADIS () features • Automatic suggestion of relevant descriptorsi.e. automatic indexing • application of machine learning techniques Leuven, 2007-05-22

  22. eCADIS () features • Compare it to manually attached indexes… Leuven, 2007-05-22

  23. eCADIS () features • Manual marking of inappropriate suggestions • another step in further refinment of automatic indexing Leuven, 2007-05-22

  24. eCADIS () on document in English Leuven, 2007-05-22

  25. eCADIS () on document in English • Automatic suggestion of relevant descriptorsi.e. automatic indexing Leuven, 2007-05-22

  26. eCADIS () on document in English • Compare it to manually attached indexes… Leuven, 2007-05-22

  27. Training the classifiers • already existing classifiers • profile classifier (Steinberger 2003) • K-nearest neighbours • binary classifiers • SVM, Logistic Regression, Rocchio, Bayes, … • classifiers used for the preliminary training • ca 3500 independent binary classifiers • need to be further evaluated • Logistic Regression used for 10,000 documents in Croatian • SVM used for 20,000 documents in English • features • tokens, lemmas, stems, character n-grams • various feature selection methods and their combinations: 2, ig, mi… Leuven, 2007-05-22

  28. Further development of eCADIS • training with new features and feature selection methods • collocations, word n-grams, chunks • new measures for evaluation of results • sensitive to thesaurus hierarchy • web-interface for eCADIS for inclusion into the CADIAL system • eCADIS for other languages • now only Croatian and English (-version) covered • usable for other languages as it is, but without the linguistic module less efficient • no list of lemmas, but types • poor statistics for n-grams • cooperation with language technology experts in different languages for development of linguistic modules Leuven, 2007-05-22

  29. Further development of eCADIS • … eCADIS for other languages • training the automatic indexing system for other languages • enables automatic suggestions of relevant descriptors in new, unseen documents • analysis of manual markings • descriptors, word n-grams, suggestions • promote the use of eCADIS in other countries beyond the scope of CADIAL project • e.g. Belgium (Flanders) • linguistic module for Dutch and French needed • computational lingustics expertise • training data from Acquis can be used to make an automatic indexing system for Dutch and French • machine learning expertise Leuven, 2007-05-22

  30. Conclusion • CADIAL • a joint Flemish-Croatian project sponsored by Flemish government • better public access to Croatian official documentation • faster and improved document indexing • automatic content metadata generation (Semantic Web) • easier document retrieval and exploration of legislation • multilingual access via standardized EU thesaurus Eurovoc • a test-case for the usage of such a system in Flanders • Web information on CADIAL project and eCADIS • www.cadial.org • contact: • bojana.dalbelo@fer.hr • marie-france.moens@law.kuleuven.ac.be Leuven, 2007-05-22

  31. Computer Aided Document Indexing System for Accessing LegislationA Joint Venture of Flanders and CroatiaBojana Dalbelo BašićFaculty of Electrical Engineering and Computing, University of Zagrebbojana.dalbelo@fer.hrMarko TadićFaculty of Humanities and Social Sciences, University of Zagrebmarko.tadic@ffzg.hrMarie-Francine MoensCentre for Law and IT / Dept. of Computer Science, Katholieke Universiteit Leuvenmarie-francine.moens@law.kuleuven.ac.be Leuven, 2007-05-22

More Related