80 likes | 211 Vues
XGTagger, a generic interface dealing with XML contents. Xavier Tannier, Jean-Jacques Girardot, Mihaela Mathieu Ecole des Mines de Saint-Etienne. September 19 th , 2005. 1. 2. 3. 4. <book> <title> Gone with the wind </title> <author> Margaret Mitchell </author> </book>. 1.
E N D
XGTagger, a generic interface dealing with XML contents. Xavier Tannier, Jean-Jacques Girardot, Mihaela Mathieu Ecole des Mines de Saint-Etienne September 19th, 2005
1. 2. 3. 4. <book> <title>Gonewiththewind</title> <author>Margaret Mitchell</author> </book> 1. 4. 3. Gone VPP with IN the DT wind NN . Margaret NN Mitchell NN System S (POS tagger) Gone with the wind . Margaret Mitchell 2. XGTagger System S (black box) Input (text only) Output (text only) Initial XML document Final XML document
5. <book> <title> <w pos=“VPP”>Gone</w> <w pos=“IN”>with</w> <w pos=“DT”>the</w> <w pos=“NN”>wind</w> </title> <author> <w pos=“PN”>Margaret</w> <w pos=“PN”>Mitchell</w> </author> </book> XGTagger System S (black box) Input (text only) Output (text only) Initial XML document Final XML document 1. 5. <book> <title>Gonewiththewind</title> <author>Margaret Mitchell</author> </book> 4. Gone VPP with IN the DT wind NN . Margaret NN Mitchell NN
Tag classification [Colazzo et al, 2001] • "hard" tags : break the linearity of the text. • ex: titles, chapters, paragraphs <tag>text A</tag><tag>text B</tag> • "soft" tags : identify significant parts of text, but remain "transparent" when reading it. • ex: bold, italics, underlined text A <bold>text B</bold>text C • "jump" tags : particular elements, as margin notes, citations, glosses. text A<note>text B</note>text C
Soft tags, reading contexts and XGTagger <par> United States<bold>elections</bold>are admisnistered at the state and local level </par> United States elections are admisnistered at the state and local level
Jump tags, reading contexts and XGTagger <paragraph> The 2004 United States<footnote>See an article p.163 about the United States of America.</footnote>elections caused less controversy than in 2000. </paragraph> The 2004 United States elections caused less controversy than in 2000. See an article p.163 about the United States of America. <paragraph> …………………………..<footnote>………………………………………………………………….</footnote>…………………………………………………………………… </paragraph>
1. <book> <title>Advances in Information Retrieval </title> </book> 2. Advances in Information Retrieval System S (parser) 3. 5. <book> <title> <w id=“1” pos=“NNS”>Advances</w> <w id=“2” pos=“IN”>in</w> <w id=“3” pos=“NP”>Information</w> <w id=“3” pos=“NP”>Retrieval</w> </title> </book> 4. Advances NNS in IN Information///Retrieval NP Example : Phrases
1. 2. <element> I had a conversation with my brother </element> I had a conversation with my brother System S (translator) 3. 5. <element> <w>I</w> <w>had</w> <w>a</w> <w french=“entretien” german=“Gescpräch”>conversation </w> <w>with</w><w>my</w> <w french=“frère” german=“Bruder”> brother</w> </element> 4. I had a conversation/entretien/Gespräch with my Brother/frère/Bruder Example : Translation