Outline • General overview of the Scientext project • End product & applications • Goals of the linguistic study • Details of the corpus & tagging • Presentation of the beta version
General Overview • Project financed by the French ANR CORPUS ET OUTILS DE LA RECHERCHE EN SCIENCES HUMAINES ET SOCIALES (2007-2010). • Goals: • Create a freely-available corpus of scientific & academic writing in French & English • Devise tools for studying linguistic markers of stance/positioning AND reasoning • Intended Users: Linguists, epistemologists, information retrieval specialists, scientists, language teachers. • Long-Term Applications: • L1 & FL/L2 teaching • Lexicography & writing aids • Information retrieval in scientific & technical fields
General Overview • Draws on several branches of linguistics: • Corpus linguistics: creation & study of a large corpus of scientific & academic texts • Natural Language Processing: processing & study of a corpus using a syntactic dependency parser (Bourigault’s Syntex). • Traditional branches of linguistics: discourse analysis, lexicology,enunciation, syntax and semantics • Projet coordinatedby LIDILEM research group (F. Grossmann, A. Tutin), 3 teams = multidisciplinary • LIDILEM (Grenoble) : F. Grossmann, A. Tutin, F. Boch, C. Cavalla, O. Kraif, M. Florez, I. Novakova, M.L. Nguyen, F. Rinck. • LLS (Chambéry) : J. Osborne, A. Henderson, R. Barr. • LiCorN (Lorient) : G. Williams, H. Maury, C. Ropers.
End Product & Applications • Web site with several ways of selectingsub-parts of texts. • Query search (complex & simple) and text view • Search for traces of stance/positioning and reasoning using local pre-established grammars • Downloading of XML corpus (for authors who gave permission, Creative Commons) • Downloading of search results (zip format, CSV format for statistics)
End Product & Applications • Websiteallowingselection ofsub-parts of texts • Teaching applications for both L1 and L2 learners: research into university writing, second language production, etc. • Lexicographical applications including assistance with encoding strategies using reference corpora. • Targeted information retrieval in scientific and technical fields.
The Linguistic Study • Focus on 2 essential features of the texts: • Authors use stance to situate themselves in relation to previous and contemporary research whilst demonstrating what is specific to their work and the choices made. • The intellectual process upon which findings and deductions are based can be revealed via the analysis of authorial reasoning. • Test two hypotheses: • Stance is expressed by a phraseology that is shared (partly? largely?) across fields • This phraseology is more characteristic of genres than of fields
The Linguistic Study • Distinguish between 3 main parameters: Field, Text genre (and sub-genres), Text section Scientific sub-genres • Scientific articles • Conference proceedings • PhD theses, HDR Academic sub-genres (Learner corpus) • 2nd year English majors, Long Essays • 3rd year English majors, Language Policy analyses
Details: French scientific corpus 234 texts (1997-2008), 5 million words
Details: English corpora • Academic (learner) corpus(Chambery,1997-2007) • 1.1 million words, 300 texts, 4000-5000 words long • Scientific corpus(Lorient, geoffrey.williams@wanadoo.fr) • 33 million words “hoovered” from BMC Corpus of Biology and Medical Texts • POS & lemmatised • Theoretical analysis of meaning transfers for the analysis of diachronic & synchronic meaning changes in context through collocational resonance • Creation of a bottom-up dictionary of verb patterns with corpus-driven thematic and conceptual groupings for NNS scientists
Corpus Tagging (French sci. + Eng academic/learner) • XML format (Text Encoding Initiative) • Tagged elements • Header: • Type of tagging, information about the text, availability of the text • Text Structure (semi-automatic tagging): • Identification of text sections: abstract, introduction, body of the text, conclusion, notes, references. • Lay-out (when available): bold, italics, structure of lists • Linguistic Tagging (automatic): • Morpho-syntactic tagging & identification of syntactic dependencies(Bourigault’s Syntex – 2007 version)
Outline • General overview of the Scientext project • End product & applications • Goals of the linguistic study • Details of the corpus & tagging • Presentation of the beta version
Presentation of the beta version • Web site available on-line: http://scientext.dynalias.net • Interface created by Achille Falaise, using the query language Concquest developed by Olivier Kraif (Université Grenoble 3)
Step1 : Choosing the field, genre, & text section (French scientific corpus)
Step 2 : Searching in the texts • 3 search modes • Simple interface, with scroll-menus and predefined values • Complex query language, so grammars can be created/written • Local grammars, involving stance/positioning or reasoning • Example: grammar of scientific affiliation
Example of a simple query • Selection of predicate adjectives used with the noun policy.
Examples of predefinedsearches • Verbs of feeling: hate, love, feel, like, … • Verbs of opinion: consider, think find, … • Evaluative adjectives: true, great, important, best, new, right, …
Example of a complexquery(advancedsearch) • Search for syntacticdependency + co-occurrence<hypothèse,#1><>*<cat=V,#2> :: (SUJ,#2,#1); Verbswhich come after the lemmahypothèse, wherehypothèseis the subject of the verb. • Search for a disjunction of lemmas + syntacticdependency <lemma=/(hypothèse|notion|concept)/,#1> && <cat=V,#2> && <cat=A,#3> :: (SUJ,#2,#1) AND (ADJ,#1,#3) ; The lemmashypothèse, notion or conceptfunctioning as subjects & accompanied by an adjective
Example of a local grammar(to write an advancedsearch) • Using variables • Re-defining a relation • Ex : (ATTSUJ,#2,#1) = (ATTS,#3,#1) AND (SUJ,#3,#2)
Step3 : Display • KWIC display, can be customised
Displaying a widercontext • Display of a wider context
Conclusion • Project still running (through early 2010) • Constitution of corpus & tagging : LONG … & fastidious • Interface still being developed • Linguistic model still needs finalising • More grammars need to be developed • Teaching materials need developing & piloting • Issues: interface between lexis & rhetorical functions • Future Research • Linguistic study of markers : • “positioned” citations • markers of scientific affiliation • Teaching materials need piloting & evaluating
