110 likes | 119 Vues
‘Capturing the Zoo’ A system for downloading, preparing, and managing corpus data from online forums. John Williams | john.x.williams@port.ac.uk | @lexicoj0hn Claudia Viggiano | claudia.viggiano@port.ac.uk | @thisiswater_. Who are we and what are we doing?.
E N D
‘Capturing the Zoo’A system for downloading, preparing, and managing corpus data from online forums John Williams | john.x.williams@port.ac.uk | @lexicoj0hn Claudia Viggiano | claudia.viggiano@port.ac.uk | @thisiswater_
Who are we and what are we doing? • Language of Citizen Science (LOCS) research group, University of Portsmouth • ‘Citizen science’ (CS) = online collaboration between scientists and members of the public who volunteer to take part in research; the ‘crowdsourcing’ of scientific research • 7 researchers in different areas of linguistics linked to CS researchers in other departments (Economics, Cosmology) • Overarching research questions: what are the factors that motivate and demotivate volunteers from taking part in CS projects? What part does language play in this? • Our task is to capture and interrogate linguistic data from online CS forums Zooniverse
Zooniverse • Umbrella site • Over 40 projects in different scientific domains • Each project includes a ‘Talk’ section (online discussion forum)
Corpus building • Our team was tasked with compiling corpora from the Talk sections for each of the Zooniverse projects • UNIX-based approach to downloading and cleaning the data (Linux) • Separate corpora for each forum thread • Some threads are short (micro corpora) 1 page, 1 reply • Some threads are very long 7014 pages, 105200 replies • From this we can compile larger corpora: • A corpus for each project (macro corpus) • User-specific corpora • Themed corpora across projects (e.g. introductions, general chat threads)
Challenges for methodology • Different researchers with converging yet distinct research interests • Constantly increasing number of Zooniverse projects, threads and posts • No predictable system for numerically identifying threads in URLs • http://talk.galaxyzoo.org/#/boards/BGZ0000001/discussions/DGZ10066h4 • http://talk.galaxyzoo.org/#/boards/BGZ0000001/discussions/DGZ0001lf1 • Three different forum formats (type_0, type_1, type_2) • type_1 and type_2 are ‘enhanced’ by JavaScript: the content cannot be downloaded with simple UNIX commands, e.g. wget, lynx
Forum formats type_0 type_2 type_1
Problem: JavaScript blocks content type_1 thread to download Output of lynx command However, manual copying and pasting works…
Solution: xdotool • http://www.semicomplete.com/projects/xdotool/ • Open-source package for Linux or Mac that simulates keystrokes and mouse movements • Therefore xdotool can automate copying and pasting • Ctrl+A Ctrl+C Ctrl+V (select all, copy, paste) • Can be incorporated into UNIX shell scripts • Download data clean up data save to corpus .txt files • Clean up = remove boilerplate text and superfluous metadata + tagging poster and timestamp information • Can open and close browser pages (indeed any windows) • Possible Windows equivalent: autohotkey • https://autohotkey.com/
Where does the script get its input from? • Management system on Google Drive spreadsheet • Shared with all team members who can request threads for download The selected cells can be used directly as input to the download program
How the whole system works • A team member requests one or more threads for download by entering details in the spreadsheet • The same or any team member can select multiple rows and use them as input (arguments) to the download program by pasting them into a .txt file • A record for progress in downloading corpora • Team member who downloads the threads uploads the resulting corpora to central repository (Google Drive shared folder accessible to all team members) • Corpora can then be analysed using any corpus software (AntConc, WordSmith, SketchEngine etc.)
Why this methodology? • Quick, consistent and flexible method for download • It is free, and uses only open-source tools • The scripts are adaptable good starting point for future projects • The management system (Google Drive) is orderly, easily accessible, and a useful record of progress • Compatible with both large and small corpora • Encourages corpus linguists to gently acquire some coding skills (cf. BAAL Symposium @ Aston, May 6th)