1 / 11

John Williams | john.x.williams@port.ac.uk | @lexicoj0hn

‘Capturing the Zoo’ A system for downloading, preparing, and managing corpus data from online forums. John Williams | john.x.williams@port.ac.uk | @lexicoj0hn Claudia Viggiano | claudia.viggiano@port.ac.uk | @thisiswater_. Who are we and what are we doing?.

shadley
Télécharger la présentation

John Williams | john.x.williams@port.ac.uk | @lexicoj0hn

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ‘Capturing the Zoo’A system for downloading, preparing, and managing corpus data from online forums John Williams | john.x.williams@port.ac.uk | @lexicoj0hn Claudia Viggiano | claudia.viggiano@port.ac.uk | @thisiswater_

  2. Who are we and what are we doing? • Language of Citizen Science (LOCS) research group, University of Portsmouth • ‘Citizen science’ (CS) = online collaboration between scientists and members of the public who volunteer to take part in research; the ‘crowdsourcing’ of scientific research • 7 researchers in different areas of linguistics  linked to CS researchers in other departments (Economics, Cosmology) • Overarching research questions: what are the factors that motivate and demotivate volunteers from taking part in CS projects? What part does language play in this? • Our task is to capture and interrogate linguistic data from online CS forums  Zooniverse

  3. Zooniverse • Umbrella site • Over 40 projects in different scientific domains • Each project includes a ‘Talk’ section (online discussion forum)

  4. Corpus building • Our team was tasked with compiling corpora from the Talk sections for each of the Zooniverse projects • UNIX-based approach to downloading and cleaning the data (Linux) • Separate corpora for each forum thread • Some threads are short (micro corpora)  1 page, 1 reply • Some threads are very long  7014 pages, 105200 replies • From this we can compile larger corpora: • A corpus for each project (macro corpus) • User-specific corpora • Themed corpora across projects (e.g. introductions, general chat threads)

  5. Challenges for methodology • Different researchers with converging yet distinct research interests • Constantly increasing number of Zooniverse projects, threads and posts • No predictable system for numerically identifying threads in URLs • http://talk.galaxyzoo.org/#/boards/BGZ0000001/discussions/DGZ10066h4 • http://talk.galaxyzoo.org/#/boards/BGZ0000001/discussions/DGZ0001lf1 • Three different forum formats (type_0, type_1, type_2) • type_1 and type_2 are ‘enhanced’ by JavaScript: the content cannot be downloaded with simple UNIX commands, e.g. wget, lynx

  6. Forum formats type_0 type_2 type_1

  7. Problem: JavaScript blocks content type_1 thread to download Output of lynx command However, manual copying and pasting works…

  8. Solution: xdotool • http://www.semicomplete.com/projects/xdotool/ • Open-source package for Linux or Mac that simulates keystrokes and mouse movements • Therefore xdotool can automate copying and pasting • Ctrl+A Ctrl+C Ctrl+V (select all, copy, paste) • Can be incorporated into UNIX shell scripts • Download data  clean up data  save to corpus .txt files • Clean up = remove boilerplate text and superfluous metadata + tagging poster and timestamp information • Can open and close browser pages (indeed any windows) • Possible Windows equivalent: autohotkey • https://autohotkey.com/

  9. Where does the script get its input from? • Management system on Google Drive spreadsheet • Shared with all team members who can request threads for download The selected cells can be used directly as input to the download program

  10. How the whole system works • A team member requests one or more threads for download by entering details in the spreadsheet • The same or any team member can select multiple rows and use them as input (arguments) to the download program by pasting them into a .txt file • A record for progress in downloading corpora • Team member who downloads the threads uploads the resulting corpora to central repository (Google Drive shared folder accessible to all team members) • Corpora can then be analysed using any corpus software (AntConc, WordSmith, SketchEngine etc.)

  11. Why this methodology? • Quick, consistent and flexible method for download • It is free, and uses only open-source tools • The scripts are adaptable  good starting point for future projects • The management system (Google Drive) is orderly, easily accessible, and a useful record of progress • Compatible with both large and small corpora • Encourages corpus linguists to gently acquire some coding skills (cf. BAAL Symposium @ Aston, May 6th)

More Related