60 likes | 191 Vues
This document outlines the process of preparing a corpus for the Sketch Engine (SkE). It covers various formats, including plain text files with one word per line, as well as advanced structures with lemmas and POS (part-of-speech) tags. Additionally, it describes how to structure XML markup for corpus documentation, ensuring proper organization and retrieval of data. The configuration file is also explained, detailing how to set attributes such as word, tag, and lemma, and how to configure structures for effective corpus management.
E N D
Kilgarriff: Preparing a corpus for SkE Preparing a corpus for the Sketch Engine
Kilgarriff: Preparing a corpus for SkE Vertical format • One word per line in a plain text file Suddenly , their luck changed .
Kilgarriff: Preparing a corpus for SkE With lemmas and POS-tags Suddenly suddenly RR , - PUN their their PRP luck luck NN1 changed change VVD . - PUN
Kilgarriff: Preparing a corpus for SkE With XML structure markup <doc id=“ABC” region=“UK” genre=“fiction”> <s> Suddenly suddenly RR <g/> , - PUN their their PRP luck luck NN1 changed change VVD <g/> . - PUN <s>
Kilgarriff: Preparing a corpus for SkE Corpus configuration file • Tells the system • Where data and other files are • What attributes • word, tag, lemma and structures • <doc> <p> <s> <g/> it contains • How to display
Kilgarriff: Preparing a corpus for SkE Simple example PATH /corpora/test2 ATTRIBUTE word ATTRIBUTE lemma ATTRIBUTE tag STRUCTURE doc { ATTRIBUTE region ATTRIBUTE genre } STRUCTURE s