1 / 23

HIST*4170 Data : Big and Small

HIST*4170 Data : Big and Small. 29 January 2013. Today’s Agenda. Blog Updates A Short Introduction to Databases A Big Data Project: People In Motion Special Guest: Dr. Rebecca Lenihan. Blog Highlights. Ambition Consider scalability Consider source availability – local advantage?

conner
Télécharger la présentation

HIST*4170 Data : Big and Small

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HIST*4170Data: Big and Small 29 January 2013

  2. Today’s Agenda • Blog Updates • A Short Introduction to Databases • A Big Data Project: People In Motion • Special Guest: Dr. Rebecca Lenihan

  3. Blog Highlights • Ambition • Consider scalability • Consider source availability – local advantage? • Keep your eye on the academic value • What do you want to teach? Learn? • Themes: war, sport, family, mapping • Intellectual property/privacy • Resources: • Google Sketchup • To make 3D buildings

  4. Data Deluge • Bit, byte, kilobyte (kB) megabyte (MB), gigabyte, terbyte, petabyte, exabyte, zettabytes.... • Library of Congress = 200 terabytes • “Transferring “Libraries of Congress” of Data” • IP traffic is around 667 exabytes • It’s a deluge... • Ian Milligan “Preparing for the Infinite Archive: Social Historians and the Looming Digital Deluge.” (Mar 23, Tri-U history conference) • “Big Data” • too large for current software to handle • Don’t be intimidated • Not all DH sources (yet)

  5. Introduction to Databases • Database – a system that allows for the efficient storage and retrieval of information • We associate with... • Computers changed a lot • Problems: organization and efficient retrieval • Organization = requires data structure • Efficient Retrieval = requires through algorithms • Potential for Humanities? • ...new problems, questions visualization, and objects worthy of study and reflection.

  6. Database Design • The purpose of a database is to store information about a particular domain and to allow one to ask questions about the state of that domain. • Relational databases are more efficient because they store information separately • Attributes • Relationships • Quamen reading is a nice introduction • Not as complicated as you might think, but following rules is important • We will apply...

  7. New approach: Crowdsourcing • An “online, distributed problem-solving and production model.” • Daren C. Brabham (2008), "Crowdsourcing as a Model for Problem Solving: An Introduction and Cases", Convergence: The International Journal of Research into New Media Technologies14 (1): 75–90 • Cited in Wikipedia, where “Anyone with Internet access can write and make changes to Wikipedia articles...” • reCAPTCHA • Luis von Ahn • Others... • Google?

  8. There are limitations... • Organization • Quality Control • Selection

  9. A Database for Your Project? • Think about how you might use a database • but perhaps not too big! • Databases can be very small and still be DH-worthy • Are there public docs out there that you can digest? • Google Refine • Incorporate a search function into your website? • Resources • MS Excel (spreadsheet) • MS Access (relational database) • Google Refine • Cleaning data

  10. Assignment for Next Week • Reading: TBD (3D guns?) • Help someone else out with their project • Read their blog • Comment and provide detailed feedback • Find a collaborator?

  11. People in Motion:Creating Longitudinal Data from Canadian Historical Census

  12. What we are working towards 1881 Census 1871 Census 1891 Census 1851 Census ‘Unbiased’ links connecting individuals/households over several census years A comprehensive infrastructure of longitudinal data 1901 Census US1880 Census 1906 Census US 1900 Census 1911 Census 1916 Census

  13. Current Work 100% of 1871 Census 100% of 1871 Census 100% of 1871 Census 100% of 1881 Census 100% of 1871 Census Automatic Linking 3,601,663 records 4,277,807 records Partners and collaborators: FamilySearch (Church of Latter Day Saints), Minnesota Population Center, Université de Montréal, Université Laval/CIEQ University of Alberta

  14. Existing (True) Links • Ontario Industrial Proprietors – 8429 links • Logan Township – 1760 links • St. James Church, Toronto – 232 links • Quebec City Boys – 1403 links • Bias concerns • family context • others? Guelph Logan Twp

  15. Attributes for Automatic Linking • Last Name – string • First Name – string • Gender – binary • Birthplace – code • Age – number • Marital status – single, married, divorced, widowed, unknown

  16. Automatic Linkage • The challenges: 1) Identify the same person 2) Deal with attribute characteristics 3) Manage computational expense • The system:

  17. Data Cleaning and Standardization • Cleaning • Names – remove non-alpha numerical characters; remove titles • Age – transform non-numerical representations to corresponding numbers (e.g. 3 months); • All attributes - deal with English/French notations (e.g. days/jours, married/mariee) • Standardization • Birthplace codes and granularity • Marital status

  18. Computational Expense • Very expensive to compare all the possible pairs of records • Computing similarity between 3.5 million records (1871 census) with 4 million records (1881 census) • Run-time estimate of : ( (3.5M x 4M)record pairs x 2 attributes being compared ) / (4M comparisons per second) / 60 (sec/min) / 60 (min/hour) / 24 (hours/day) = 40.5 days. (Big Data)

  19. Managing Computational Expense • Blocking • By first letter of last name • By birthplace • Using HPC • Running the system on multiple processors in parallel

  20. Record Comparison • Comparing Strings • Jaro-Winkler • Edit Distance • Double Metaphone • Age • +/- 2 years • Exact matches • Gender • Birthplace

  21. Linkage Results

More Related