220 likes | 240 Vues
Corpus Annotation with Linked Open Data . John P. McCrae and Thierry Declerck. Summary. Inline and Stand-off annotation Web Annotation/Open Annotation NLP Interchange Format CoNLL-RDF. Why annotate?. Ontologies capture facts about concepts, not the usage of words
E N D
Corpus Annotation with Linked Open Data John P. McCrae and Thierry Declerck
Summary • Inline and Stand-off annotation • Web Annotation/Open Annotation • NLP Interchange Format • CoNLL-RDF
Why annotate? Ontologies capture facts about concepts, not the usage of words Lexicons capture facts about patterns and systems of usage Sometimes we wish to capture data about specific usage
Inline annotation Typically with XML <divtype="essay"> <head>An Essay on Summer</head> <p>Summer school in <datewhen="1990">MCMXC</date> was never easy; it went by too quickly and left us wanting more.</p> <p>But, as my friend <nametype="person">Peter</name> said with his inimitable <foreignxml:lang="fr">je ne sais quoi</foreign>, <said>It never pays to think too hard</said>. Or, as I would rather put it, <quotexml:lang="es">Que sera, sera</quote>.</p> </div> Pros: Easy and quick to do Cons: Limited expressivity Complicates source document Annotations cannot be added later
Stand-off Annotation Annotation 1 Annotation 2 Annotation 3 Annotation 4 Source Document Annotation File
Web Annotation Annotation recommendation from W3C https://www.w3.org/TR/annotation-model/
Web Annotation: Target and Body • body • element containing the annotation • object property: oa:hasBody (any RDF object) • datatype property: oa:bodyValue (strings) • target • element being annotated • any RDF object, including • oa:Selector (more in a second)
Selector Types • oa:FragmentSelector • Uses the IRI fragment specification defined by the representation's media type. • oa:TextQuoteSelector • Describes a range of text by copying it, and including some of the text immediately before (a prefix) and after (a suffix) it to distinguish it. • oa:TextPositionSelector • Describes a range of text by recording the start and end positions • oa:DataPositionSelector • Describes a range of data by recording the start and end positions of the selection • oa:SvgSelector
Web Annotation Example <http://example.org/name_example> a oa:Annotation ; oa:hasBody [ a oa:TextualBody ; dc11:format"text/plain"^^xsd:string ; rdf:value"PERSON"^^xsd:string ] ; oa:hasTarget [ oa:hasSelector [ a oa:TextQuoteSelector ; oa:exact"James Baker"^^xsd:string ] ; oa:hasSource<https://catalog.ldc.upenn.edu/.../06/wsj_0655.name> ] .
Web Annotation Example oa:TextualBody oa:Annotation format text/plain hasBody PERSON value name_example oa:TextQuoteSelector hasTarget exact hasSelector James Baker source https://catalog.ldc.upenn.edu/.../06/wsj_0655.name
Web Annotation Summary • relatively good uptake • reification • annotation as n:m relation between bodies & targets • with metadata • powerful • annotate all instances of a string at once using a • very verbose • previous example uses 10 triples
NLP Interchange Format • String URIs • e.g., in a web document • can be directly used as object of oa:hasTarget • simple ontology of linguistic data structures • for selected, typical NLP annotations • not covering all you ever need for linguistic annotations ;)
RFC 5147 Allows URIs to refer to fragments in text Character Offsets: https://catalog.ldc.upenn.edu/docs/LDC95T7/raw/06/wsj_0655.txt#char=19,30 Line Offsets: https://catalog.ldc.upenn.edu/docs/LDC95T7/raw/06/wsj_0655.txt#line=0 Integrity Checks: https://.../wsj_0655.txt#char=19,30;md5=67f60186fe687bb898ab7faed17dd96a
Web Annotation + NIF oa:TextualBody oa:Annotation format text/plain hasBody PERSON value name_example nif:String hasTarget https://.../wsj_0655.name#char=2,22
NLP Interchange Format • Slightly simpler method of reference • Saves some triples • but still very verbose • Less standardised and supported than just Web Annotation
CoNLL-RDF CoNLL is a format family widely used in NLP • tab-separated values • one word per line • one column for annotation type • sentences separated by empty lines • conventions for most types of word-based linguistic • annotation
CoNLL Example Inflection ID Lemma 1_1 Sie sie P PPER nom|pl|*|3 2 SB 1_2 dürfen dürfen V VMFIN pl|3|pres|ind 0 -- 1_3 eine ein A ART acc|sg|fem 4 NK 1_4 Kopie Kopie N NN acc|sg|fem 12 OA 1_5 der der A ART gen|sg|fem 6 NK 1_6 Software Software N NN gen|sg|fem 4 AG 1_7 auf auf A APPR _ 4 MNR 1_8 dem der A ART dat|sg|masc 9 NK Word POS Dependency Structure
CoNLL as RDF (simple) Sie sie WORD LEMMA POS_COARSE P 1_1 POS PPER FEATS nom|pl|*|3 HEAD nif:nextWord 2 EDGE 1_2 SB
CoNLL as RDF (better) Sie sie WORD LEMMA POS_COARSE P 1_1 POS PPER FEATS nom|pl|*|3 SB WORD dürfen 1_2 dürfen nif:nextWord LEMMA
CoNLL as an RDF Tree 1_12 1_2 installieren dürfen 1_1 1_4 1_3 1_6 Kopie Sie 1_5 Easy to query with SPARQL eine Software der
Conclusion RDF is a powerful method of representing corpus annotations But • Not well adopted by many major projects • Can be verbose and hard to read • Limited tool support This should change over the next few years.