Web Data Collection Methods for Big Data Analysis

Big Data Analysis Lecture 3: Data Collection

Outline of Today’s Lecture • Online data have become increasingly prevalent and are useful for many applications • Example applications: • Measurement: • User sentiment about a brand name, an organization, etc • Event Detection and Monitoring: • flu/diseases, earthquake/tsunami/wildfire, sport events, etc • Prediction • Election outcomes, stock market, etc. • This lecture focuses on how to acquire data from the Web

General Approaches Download raw data files Raw ASCII or binary files made available to the public for download Crawling a website Using automated programs to scour the data on a website Follows the links on web pages to move to other pages Search engines use this mechanism to index websites Application Programming Interface (API) Websites increasingly provide an API to gather their data

Download Raw Data Files • Some websites provide easy access to their raw data

Download Raw Data Files Easy to automate data downloading process

Download Raw Data Files • Some websites require users to fill out a form to select range of data to download Harder to automate data downloading process

Download Raw Data Files • Browser automation tools are available to perform repetitious web clicks and to autofill web forms, etc • How it works? • Records the actions you made when browsing a web site (include filling out forms, etc) • Produces a script file that can be edited by user • Allows user to replay the script over and over again • Example: https://www.youtube.com/watch?v=2ncKQxD3xVM

A crawler (also known as a Web robot or spider) is a computer program that automatically traverses the hyperlink structure of the World Wide Web to gather Web pages (e.g., for indexing by search engines) Snowball sampling Start from one or more seed URLs and recursively extract hyperlinks to other URLS Website Crawling Seed

Anatomy of a Web Crawler Initialize: append seed URLs to Queue yes Terminate? Terminated no Dequeue: remove a URL from Queue Fetch: retrieve web page associated with URL Parse: extract URLs from retrieved web page Enqueue: append extracted URLs to queue

Robot Exclusion Protocol • Web crawlers can overwhelm server with too many requests • Robot Exclusion Protocol is a set of guidelines for robot behavior at a given Web site • Enforced by a special file located at the root directory of the web server (called robot.txt) that specifies the restrictions at a site • Allow: list of pages that can be accessed • Disallow: list of pages that should not be indexed • A “well-behaved” robot should follow the protocol • Robot can choose to ignore file but will have to face consequences – e.g., blacklisted by web site administrator

Example: Robots.txt

Meta Tags • META tags on a webpage also tell a crawler what not to do • Meta tags are placed between <head> … </head> tags in HTML • <META NAME="ROBOTS" CONTENT="NOFOLLOW"> • To not follow links on this page • <META NAME=“GOOGLEBOT" CONTENT=“NOINDEX"> • To not appear in Google’s index • <META NAME="GOOGLEBOT" CONTENT="NOARCHIVE"> • To not archive copy in search results 1http://googleblog.blogspot.com/2007/02/robots-exclusion-protocol.html

Python Web Crawlers • There are many Python libraries available • HTMLParser, lxml, BeautifulSoup, etc • Example: display all the links embedded in a web page

Application Programming Interface (API) • Wikipedia: an application programming interface (API) is a set of routines, protocols, and tools for building software and applications • API defines the standard way for a program/application to request services from another program/application • Many websites provide APIs to access their data • Twitter: https://dev.twitter.com/ • Facebook: http://developers.facebook.com/ • Reddit: https://github.com/reddit/reddit/wiki/API

Acceptable Use Policy • Each website has its own policy • Read through the whole policy before development • Important details to note: • Rate limit • Authentication key • When in doubt, ask. • Most APIs have message boards where you can ask the company or other developers.

Rate Limiting Limitation imposed by API on how many requests can be made per day or per hour If rate exceeded, API returns error If rate constantly exceeded, API blocks IP address from further request Examples: Google Geocode: 25,000 requests per day Twitter REST API: 180 queries per 15 minute window. Reddit: 30 requests/minute

Google Maps Geocoding API • Provides a service for geocoding and reverse geocoding of addresses. https://developers.google.com/maps/documentation/geocoding/start • Geocoding: the process of converting addresses into geographic coordinates (e.g., latitude and longitude) • Reverse geocoding: the process of converting geographic coordinates into an address • Example • You can use the requests or geocoder Python libraries

Python Example 1

Python Example 2 • A simpler way is to use the geocoder library > pip install geocoder Other options: g.content, g.city, g.state, g.country, etc

Twitter API (version 1.1) • Streaming API • Twitter’s firehose delivers all tweets containing a given keyword or from specific users as they are posted on Twitter • Search (REST) API • Submit a query to Twitter • Returns last 15 tweets that satisfy the query

Example: How to Use Twitter API • Step 1: Create an account on Twitter • Step 2: Register an application to obtain authentication keys (your app needs key and access tokens) • Step 3: Download the libraries (native to the programming language you want to use) • Step 4: Write your code using the functions provided by the libraries (see examples on how to call the functions in the libraries) • Step 5: Deploy the program

Create a Twitter Account • You need a Twitter account to use the API • Go to apps.twitter.com and sign in (or create new account if you don’t have an account yet)

Registering Your Twitter Application • After signing in, click on “Create a new application”

Registering Your Twitter Application • Fill in the application details

Registering Your Twitter Application

Authentication Tokens from Twitter • Click on the “Keys and Access Tokens” tab • Click on the buttons to generate • Consumer key and secret • Access token and secret

Authentication Tokens from Twitter • Click the Test OAuth button and note the consumer key, consumer secret, access token, and access token secret These fields will be filled up with values specific to your application

Python for Twitter API • You can install tweepy library to query Twitter API • pip install tweepy • For Twitter Search (REST) API • Import OAuthHandler and API from tweepy • Create OAuthHandler object • Need to set the customer keys and access tokens) • Create API object • Call api.search(query) to retrieve the tweets • For more information, go to http://docs.tweepy.org/en/v3.5.0/

Python Twitter Search API Example

JSON key fields To obtain the individual keys:

JSON key fields To obtain user information:

Python Twitter Streaming API • For Twitter Streaming API • Create a class that inherits from StreamListener class • Create a Stream object • Start the Stream • When using Twitter Streaming API, you should • Set a timer for the data collection (stop if it exceeds time limit) • Save the output to a file (especially if there are lots of tweets collected)

Python Twitter Streaming API Example

Summary • This lecture presents an overview of methods for downloading online data • Many websites provide APIs for users to download their data • Some require authentication; requires user to register their app and use Oauth protocol to authenticate access • Next lecture: • Data storing and querying with SQL

Web Data Collection Methods for Big Data Analysis