Understanding Big and Small Data in Historical Research: Insights and Innovations

HIST*4170Data: Big and Small 29 January 2013

Today’s Agenda • Blog Updates • A Short Introduction to Databases • A Big Data Project: People In Motion • Special Guest: Dr. Rebecca Lenihan

Blog Highlights • Ambition • Consider scalability • Consider source availability – local advantage? • Keep your eye on the academic value • What do you want to teach? Learn? • Themes: war, sport, family, mapping • Intellectual property/privacy • Resources: • Google Sketchup • To make 3D buildings

Data Deluge • Bit, byte, kilobyte (kB) megabyte (MB), gigabyte, terbyte, petabyte, exabyte, zettabytes.... • Library of Congress = 200 terabytes • “Transferring “Libraries of Congress” of Data” • IP traffic is around 667 exabytes • It’s a deluge... • Ian Milligan “Preparing for the Infinite Archive: Social Historians and the Looming Digital Deluge.” (Mar 23, Tri-U history conference) • “Big Data” • too large for current software to handle • Don’t be intimidated • Not all DH sources (yet)

Introduction to Databases • Database – a system that allows for the efficient storage and retrieval of information • We associate with... • Computers changed a lot • Problems: organization and efficient retrieval • Organization = requires data structure • Efficient Retrieval = requires through algorithms • Potential for Humanities? • ...new problems, questions visualization, and objects worthy of study and reflection.

Database Design • The purpose of a database is to store information about a particular domain and to allow one to ask questions about the state of that domain. • Relational databases are more efficient because they store information separately • Attributes • Relationships • Quamen reading is a nice introduction • Not as complicated as you might think, but following rules is important • We will apply...

New approach: Crowdsourcing • An “online, distributed problem-solving and production model.” • Daren C. Brabham (2008), "Crowdsourcing as a Model for Problem Solving: An Introduction and Cases", Convergence: The International Journal of Research into New Media Technologies14 (1): 75–90 • Cited in Wikipedia, where “Anyone with Internet access can write and make changes to Wikipedia articles...” • reCAPTCHA • Luis von Ahn • Others... • Google?

There are limitations... • Organization • Quality Control • Selection

A Database for Your Project? • Think about how you might use a database • but perhaps not too big! • Databases can be very small and still be DH-worthy • Are there public docs out there that you can digest? • Google Refine • Incorporate a search function into your website? • Resources • MS Excel (spreadsheet) • MS Access (relational database) • Google Refine • Cleaning data

Assignment for Next Week • Reading: TBD (3D guns?) • Help someone else out with their project • Read their blog • Comment and provide detailed feedback • Find a collaborator?

People in Motion:Creating Longitudinal Data from Canadian Historical Census

What we are working towards 1881 Census 1871 Census 1891 Census 1851 Census ‘Unbiased’ links connecting individuals/households over several census years A comprehensive infrastructure of longitudinal data 1901 Census US1880 Census 1906 Census US 1900 Census 1911 Census 1916 Census

Current Work 100% of 1871 Census 100% of 1871 Census 100% of 1871 Census 100% of 1881 Census 100% of 1871 Census Automatic Linking 3,601,663 records 4,277,807 records Partners and collaborators: FamilySearch (Church of Latter Day Saints), Minnesota Population Center, Université de Montréal, Université Laval/CIEQ University of Alberta

Existing (True) Links • Ontario Industrial Proprietors – 8429 links • Logan Township – 1760 links • St. James Church, Toronto – 232 links • Quebec City Boys – 1403 links • Bias concerns • family context • others? Guelph Logan Twp

Attributes for Automatic Linking • Last Name – string • First Name – string • Gender – binary • Birthplace – code • Age – number • Marital status – single, married, divorced, widowed, unknown

Automatic Linkage • The challenges: 1) Identify the same person 2) Deal with attribute characteristics 3) Manage computational expense • The system:

Data Cleaning and Standardization • Cleaning • Names – remove non-alpha numerical characters; remove titles • Age – transform non-numerical representations to corresponding numbers (e.g. 3 months); • All attributes - deal with English/French notations (e.g. days/jours, married/mariee) • Standardization • Birthplace codes and granularity • Marital status

Computational Expense • Very expensive to compare all the possible pairs of records • Computing similarity between 3.5 million records (1871 census) with 4 million records (1881 census) • Run-time estimate of : ( (3.5M x 4M)record pairs x 2 attributes being compared ) / (4M comparisons per second) / 60 (sec/min) / 60 (min/hour) / 24 (hours/day) = 40.5 days. (Big Data)

Managing Computational Expense • Blocking • By first letter of last name • By birthplace • Using HPC • Running the system on multiple processors in parallel

Record Comparison • Comparing Strings • Jaro-Winkler • Edit Distance • Double Metaphone • Age • +/- 2 years • Exact matches • Gender • Birthplace

Linkage Results

Understanding Big and Small Data in Historical Research: Insights and Innovations

Understanding Big and Small Data in Historical Research: Insights and Innovations

Presentation Transcript

CONTAR HIST RIAS: UMA ARTE M GICA R bia Mesquita Tha s Bernardes

Towards a VQA Suite: Architecture Tweaks, Learning Rate Schedules, and Ensembling

Chapter 7:Potential Energy and Energy Conservation

Keri Manning The Silk Roads NEH Summer Institute 2010 University of Hawaii-Manoa

Data and the web manager

Processing Big Data with Small Programs

Data for Priority Setting in Small Doses # 4

Geoffrey Hendrey @ geoffhendrey

Aruba at HiST

HIST 202 Reminders

HIST 388 D: Junior Seminar

Study of the semileptonic decays at 4170 MeV

Histograms REVIEWED

COMM 4170-01: Applied Organizational Communication

SUNY Potsdam History Course Redesign, 2008-2014

aur

HIST 300: Search Strategy