1 / 28

SANSKRIT ANALYZING SYSTEM

SANSKRIT ANALYZING SYSTEM. Manji Bhadra , Surjit Kumar Singh, Sachin Kumar, Subash , Diwakar Mishra Muktanand Agrawal , R.Chandrashekar , Sudhir K Mishra , Girish Nath Jha. Introduction. It is an attempt towards analysis of laukika Sanskrit

shateque
Télécharger la présentation

SANSKRIT ANALYZING SYSTEM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SANSKRIT ANALYZING SYSTEM ManjiBhadra, Surjit Kumar Singh, Sachin Kumar, Subash, DiwakarMishra MuktanandAgrawal, R.Chandrashekar, Sudhir K Mishra, Girish Nath Jha 3rd ISCLS, Hyderabad

  2. Introduction • It is an attempt towards analysis of laukika Sanskrit • Major goal is to build a machine translation system from Sanskrit to other Indian language. • The modules have been developed separately • We need to integrate these modules • We need to evaluate these modules 3rd ISCLS, Hyderabad

  3. Introduction • The system accepts full text inputs in Devanagari Unicode (UTF-8). • It supports two IMEs - Baraha and J-IME. • It has two major components- • the shallow parser • the kraka analyzer 3rd ISCLS, Hyderabad

  4. Shallow parser The modules are as follows- • sandhi analyzer • samsa analyzer * • subanta analyzer • gender analyzer • kdanta analyzer • taddhita analyzer* • tianta analyzer • POS tagger * Modules are under development 3rd ISCLS, Hyderabad

  5. How does it work Show example 3rd ISCLS, Hyderabad

  6. Our platform • Java servlet based web application and services. • Java, JSP for frontend. • Unicode input/output with flatfiles, RDBMS (MS-SQL server 2005) • MS-JDBC driver for connectivity • Apache-Tomcat for web server • Javascript IME for unicode output with Itrans input 3rd ISCLS, Hyderabad

  7. Sandhi analyzer • Sandhi processing is critical for any further processing of Sanskrit. Without sandhi-vichheda it is not possible to get the word constituents for analysis. • At present, our sandhi analyzer does only vowel sandhi splitting. The consonant splitting is under development. • Our goal is to be able to parse a very complex string with potentially all kinds of sandhi 3rd ISCLS, Hyderabad

  8. Sandhi analyzer input Sanskrit text ↓ viccheda eligibility tests (pre-processing) ↓ subanta processing ↓ search of sandhi marker and sandhi patterns (‌‌sandhi rule base) ↓ generate possible solutions (result generator) ↓ search the lexicon ↓ subanta processing (to parse the vibhakti of first segment, if any) ↓ output (segmented text) 3rd ISCLS, Hyderabad

  9. Sandhi analyzer 1. tokenize by space (words) 2. preprocess (exclude puncts) -> puncts marked 3. check example base - if found stop 4. check subanta (it checks avyayas, verbs as well) -> pratipadikas -> avyayas marked -> verbs marked 3rd ISCLS, Hyderabad

  10. Sandhi Analyzer 5. check pratipadika list -> if found then dont process for Sandhi -> if not found then start sandhi processing 6. search of sandhi marker and sandhi patterns 7. generate possible solutions 8. search the lexicon 9. subanta processing (to parse the vibhakti of first segment, if any) 10. output (segmented text) 3rd ISCLS, Hyderabad

  11. Demo Live demo from JNU server Demo from localhost 3rd ISCLS, Hyderabad

  12. Subanta analyzer • Isolating the inflections and obtaining nominal bases and its case terminations is essential for morph analysis. • The system has Unicode Devanagari input/output mechanism and accepts complete text as well 3rd ISCLS, Hyderabad

  13. Subanta analyzer INPUT TEXT ↓ PRE-PROCESSOR ↓ VERB DATABASE  LIGHT POS TAGGING  AVYAYA DATABASE ↓ SUBANTA RECOGNIZERVIBHAKTI DATABASE ↓ SUBANTA RULESSUBANTA ANALYZER SANDHI RULES ↓ SUBANTA ANALYSIS 3rd ISCLS, Hyderabad

  14. Subanta Analyzer • Works on a subanta rulebase and example-base • Subanta eligibility • check fixed lists (punctuations, avyayas, verbs) • If found  tag • Else mark them SUBANTA • Check it in dictionary • If found store separately • Else start subanta processing 3rd ISCLS, Hyderabad

  15. Subanta Analyzer • check example-base • If found  tag else continue • Template search • Evaluate string as per set templates • Split it in parts and match the viccheda patterns • If found  obtain corresponding analysis • Else tag the input SUBANTA 3rd ISCLS, Hyderabad

  16. Demo Live demo from JNU server demo from localhost 3rd ISCLS, Hyderabad

  17. Kdanta Analyzer 3rd ISCLS, Hyderabad

  18. Kdanta Analysis • The process of kdanta analysis mechanism is divided into two sections - recognition and analysis. • The kdanta recognition starts by an exclusion process. The verb forms, avyayas and punctuations are excluded by running POS tagger • The nominal bases are obtained by the subanta analyzer These nominal bases are then checked in fixed lists. This may result in some of the subantas being marked for kdanta. • The remaining subantas are sent to the kdanta recognizer and analyzer system for recognition and analysis using following steps –  3rd ISCLS, Hyderabad

  19. Kdanta Analysis • check the kdanta database, annotated corpus and kdanta-tagged Monier Williams Sanskrit Digital Dictionary (MWSDD). • the subantas still untagged for kdanta are sent to the rule base for kdanta checking. • there may still remain an untagged kdanta subanta. This will count as failure of the system. • अधिकारी["अधिकारिन्","अधि+डुकृञ्","इनि","noun_m"]_KR 3rd ISCLS, Hyderabad

  20. Tianta Analysis • The methodology is a mix of using verb database and reverse Paninian processing • pre-processing • take token by token • confirm the verb (dict, check suffixes), ignore others • Check database • If not found  start analysis • analyze suffixes • evaluate remaining string for base (dict check for bases) • result 3rd ISCLS, Hyderabad

  21. POS Tagger • Rule-based tagger is developed for Sanskrit Language. • There are three kinds of tags in this tagset- Word class main tags, feature sub-tags, punctuation tags. • The tag as a whole is a combination of word class main tag with feature sub-tags separated by an underscore • All the tags bear Sanskrit names with letter-digit acronymic in Roman script • Tagset (JNU server) • Tagset (localhost) 3rd ISCLS, Hyderabad

  22. POS Tagger Input text Pre processing Fixed list tagger Morph analyzer Disambiguator* Result normalizer Display tagged text 3rd ISCLS, Hyderabad

  23. Gender Analyzer • Along with the information of vibhakti and number, it is also necessary to have information of gender. In Sanskrit there is agreement within noun phrase in terms of vibhakti, number and gender. While translating a Sanskrit sentence into Hindi, it is necessary to know what would be collocational gender of the sentence, otherwise the whole translation may be wrong. 3rd ISCLS, Hyderabad

  24. Gender Analyzer Lexical lookup Un-anlayzed Text Input Sanskrit text Application of Subanta Analyzer Lexical lookup Subanta Analyzed Text Un-analyzed text Application of rulebase Check gender agreement within a noun phrase Suggest the gender of the noun phrase for Hindi translation 3rd ISCLS, Hyderabad

  25. Demo live demo from JNU server demo from localhost 3rd ISCLS, Hyderabad

  26. Kraka Analyzer VERB ID VERB ANALYSIS NON—VERB ID SUBANTA ANALYSIS KK CHECK* KRAKA RULES* SPECIAL CONDITIONS KRAKA ASSIGNMENT 3rd ISCLS, Hyderabad

  27. Conclusion • The authors in this paper have presented an ongoing work for developing a complete SAS. Currently, the SAS has some modules partially developed and some under development. • Significant future additions will be the • Taddhita, samasa modules • ambiguity resolution modules • System integration module • Evaluation module 3rd ISCLS, Hyderabad

  28. Thank You! http://sanskrit.jnu.ac.in 3rd ISCLS, Hyderabad

More Related