John Williams

‘Capturing the Zoo’A system for downloading, preparing, and managing corpus data from online forums John Williams | john.x.williams@port.ac.uk | @lexicoj0hn Claudia Viggiano | claudia.viggiano@port.ac.uk | @thisiswater_

Who are we and what are we doing? • Language of Citizen Science (LOCS) research group, University of Portsmouth • ‘Citizen science’ (CS) = online collaboration between scientists and members of the public who volunteer to take part in research; the ‘crowdsourcing’ of scientific research • 7 researchers in different areas of linguistics  linked to CS researchers in other departments (Economics, Cosmology) • Overarching research questions: what are the factors that motivate and demotivate volunteers from taking part in CS projects? What part does language play in this? • Our task is to capture and interrogate linguistic data from online CS forums  Zooniverse

Zooniverse • Umbrella site • Over 40 projects in different scientific domains • Each project includes a ‘Talk’ section (online discussion forum)

Corpus building • Our team was tasked with compiling corpora from the Talk sections for each of the Zooniverse projects • UNIX-based approach to downloading and cleaning the data (Linux) • Separate corpora for each forum thread • Some threads are short (micro corpora)  1 page, 1 reply • Some threads are very long  7014 pages, 105200 replies • From this we can compile larger corpora: • A corpus for each project (macro corpus) • User-specific corpora • Themed corpora across projects (e.g. introductions, general chat threads)

Challenges for methodology • Different researchers with converging yet distinct research interests • Constantly increasing number of Zooniverse projects, threads and posts • No predictable system for numerically identifying threads in URLs • http://talk.galaxyzoo.org/#/boards/BGZ0000001/discussions/DGZ10066h4 • http://talk.galaxyzoo.org/#/boards/BGZ0000001/discussions/DGZ0001lf1 • Three different forum formats (type_0, type_1, type_2) • type_1 and type_2 are ‘enhanced’ by JavaScript: the content cannot be downloaded with simple UNIX commands, e.g. wget, lynx

Forum formats type_0 type_2 type_1

Problem: JavaScript blocks content type_1 thread to download Output of lynx command However, manual copying and pasting works…

Solution: xdotool • http://www.semicomplete.com/projects/xdotool/ • Open-source package for Linux or Mac that simulates keystrokes and mouse movements • Therefore xdotool can automate copying and pasting • Ctrl+A Ctrl+C Ctrl+V (select all, copy, paste) • Can be incorporated into UNIX shell scripts • Download data  clean up data  save to corpus .txt files • Clean up = remove boilerplate text and superfluous metadata + tagging poster and timestamp information • Can open and close browser pages (indeed any windows) • Possible Windows equivalent: autohotkey • https://autohotkey.com/

Where does the script get its input from? • Management system on Google Drive spreadsheet • Shared with all team members who can request threads for download The selected cells can be used directly as input to the download program

How the whole system works • A team member requests one or more threads for download by entering details in the spreadsheet • The same or any team member can select multiple rows and use them as input (arguments) to the download program by pasting them into a .txt file • A record for progress in downloading corpora • Team member who downloads the threads uploads the resulting corpora to central repository (Google Drive shared folder accessible to all team members) • Corpora can then be analysed using any corpus software (AntConc, WordSmith, SketchEngine etc.)

Why this methodology? • Quick, consistent and flexible method for download • It is free, and uses only open-source tools • The scripts are adaptable  good starting point for future projects • The management system (Google Drive) is orderly, easily accessible, and a useful record of progress • Compatible with both large and small corpora • Encourages corpus linguists to gently acquire some coding skills (cf. BAAL Symposium @ Aston, May 6th)

John Williams | john.x.williams@port.ac.uk | @lexicoj0hn

John Williams | john.x.williams@port.ac.uk | @lexicoj0hn

Presentation Transcript

John Towner Williams Artist Profile

John Williams by Amber Rector

John Williams

Williams

John Williams

John Williams February 8, 1932 - Present

John Williams

John Williams “ GREATEST COMPOSER OF OUR CENTURY”

John Williams

John Williams

John Williams NSW Natural Resources Commissioner

John Williams

Tuvan Throat Singing and John williams

John Williams

John Williams

Michael Wood ( michael.wood@port.ac.uk )

John Williams

John Williams

Riverside/John R. Williams

port.ac.uk

Medicare and You John Williams SHIP Medicare Specialist