180 likes | 441 Vues
Language and Geography. Brendan O’Connor Social Media Analysis, 3/18/2010. http://anyall.org/blog/2009/05/where-tweets-get-sent-from/. Analyze Geography and Language. Using Twitter data: (1) Identify author & message locations (2) Side note: opinions about self’s location Applications:
E N D
Language and Geography Brendan O’Connor Social Media Analysis, 3/18/2010
Analyze Geography and Language Using Twitter data: (1) Identify author & message locations (2) Side note: opinions about self’s location Applications: (3) Analyze language use by geography • Example: find regional dialects (4) Predict geographically embedded real-world phenomena • Example: per-state retail sales
U.S. State Identification • String-matching approach • Match on • Full names (“Pennsylvania”) • Case-insensitive • Abbreviations (“PA”) • Case-sensitive
Examples AZ Scottsdale, AZ MO St. Louis, MO MI Michigan CA Sacramento, CA FL Jacksonville, FL CA Santa Cruz, CA IN Indianapolis, Indiana CA 2OH!9, California TX Dallas, TX NY new york IL Chicago, IL CT Hartford, CT GA Georgia HI Hawaii WA Seattle, WA, USA CT Watertown, CT CA Bay Area, California DC DC Metro Area IA Iowa NC Raleigh, NC CA California CA southern california GA Atlanta, GA CA Porn Valley, CA TN Newbern, TN CA Westlake Village, CA, USA MS Dourados, MS ME U GOTTA CATCH ME! CA Malibu, California NC North Carolina NY Windsor, NY
Examples AZ Scottsdale, AZ MO St. Louis, MO MI Michigan CA Sacramento, CA FL Jacksonville, FL CA Santa Cruz, CA IN Indianapolis, Indiana CA 2OH!9, California TX Dallas, TX NY new york IL Chicago, IL CT Hartford, CT GA Georgia HI Hawaii WA Seattle, WA, USA CT Watertown, CT CA Bay Area, California DC DC Metro Area IA Iowa NC Raleigh, NC CA California CA southern california GA Atlanta, GA CA Porn Valley, CA TN Newbern, TN CA Westlake Village, CA, USA MS Dourados, MS ME U GOTTA CATCH ME! CA Malibu, California NC North Carolina NY Windsor, NY
Problems? AL AK AS AZ AR CA CO CT DE DC FM FL GA GU HI ID IL IN IA KS KY LA ME MH MD MA MI MN MS MO MT NE NV NH NJ NM NY NC ND MP OH OK OR PW PA PR RI SC SD TN TX UT VT VI VA WA WV WI WY
Brazil • Brazilian states have two-letter abbreviation conventions like U.S., and many overlaps • Belém,PA • São Luís, MA • Maceió AL • “SC” • Myrtle Beach, SC • Charleston, SC, U.S.A. • Joinville - SC • Mafra - SC • Palmtios – SC • FLORIANÓPOLIS, SC, BRASIL
U.S. State Identification • String-matching approach • Match on • Full names (“Pennsylvania”) • Case-insensitive • Abbreviations (“PA”) • Case-sensitive • Brazil check • Common words check • DE, ME
Experiment • 4,793,729 messages – stream sample • 2,309,284 unique users • 1,624,983 unique users with non-blank location • Detections • 838,012 U.S. State • 346,553 Latitude, Longitude • 3,163 Five-digit ??Zip Code
OH clevelandohio sadly :( NY Syracuse, NY :) IL Close to ur heart =],Illinois TX S.A TX :D MN Minnesota :) CA California, Newport Beach :) SC JERSEY but in Cola SC 4 now:-) NC Charlotte,NC =( CA Playboy Mansion California. :) NY Bronx,NY :)
States, happy:sad, %happy ND 2:3 0.400 NV 2:3 0.400 MO 6:7 0.462 ID 2:2 0.500 WY 2:2 0.500 RI 5:3 0.625 UT 5:3 0.625 KY 12:6 0.667 MT 2:1 0.667 NE 6:3 0.667 NH 2:1 0.667 SD 4:2 0.667 MA 13:6 0.684 WI 11:5 0.688 WV 7:3 0.700 NM 5:2 0.714 AR 6:2 0.750 PA 19:3 0.864 NC 15:2 0.882 CO 8:1 0.889 TN 16:2 0.889 WA 18:2 0.900 NJ 40:4 0.909 PR 10:1 0.909 OK 11:1 0.917 FL 90:8 0.918 GA 24:2 0.923 ME 12:1 0.923 LA 55:4 0.932 AZ 31:2 0.939 DC 16:1 0.941 NY 146:9 0.942 IL 19:1 0.950 TX 151:5 0.968 CA 211:6 0.972 • CT 12:4 0.750 • DE 6:2 0.750 • SC 10:3 0.769 • MS 11:3 0.786 • OR 11:3 0.786 • IN 19:5 0.792 • AK 4:1 0.800 • GU 4:1 0.800 • KS 12:3 0.800 • MN 17:4 0.810 • IA 9:2 0.818 • OH 41:9 0.820 • HI 15:3 0.833 • MD 22:4 0.846 • MI 29:5 0.853 • AL 18:3 0.857 • VA 25:4 0.862
Emoticon parsing 5658 :) 2032 :D 1391 ;) 845 =) 701 :] 583 :/ 554 =] 461 ;D 437 :P 338 =D 278 :( 245 ;] 197 :-) 138 ;-) 128 =P 122 :p 93 :O 67 ;P 51 :o 44 =/ 42 ;p 33 =p 31 =( 26 :\ 25 ;o 22 =[ 20 :-D 20 :[ 15 :-P 15 ;O 14 =O 11 :-p 9 :-/ 9 ;( 8 :-( 8 ;/ 7 :d 7 ;d 7 :-] 5 =o 3 ;-P 3 :-O 3 ;-D 3 ;[ 2 ;-p 2 =d 2 ;-( 2 =\ 1 :-d 1 :-[ 1 ;\ 1 ;-] 1 =-] 1 =-) NormalEyes = r'[:=]' Wink = r'[;]' NoseArea = r'(|o|O|-)’ HappyMouths = r'[D\)\]]' SadMouths = r'[\(\[]' Tongue = r'[pP]' OtherMouths = r'[doO/\\]’ Happy = NormalEyes + NoseArea + HappyMouths Sad = NormalEyes + NoseArea + SadMouths)