1 / 24

Lexicon: exploring language trends on Facebook Walls

Lexicon: exploring language trends on Facebook Walls. Roddy Lindsay Data Team. What’s a Wall?. Walls are semi-public and public forums on profiles, groups, events, etc. Old. New. Numbers. Blogs 1.6 million posts per day (Technorati) ~18 posts per second Walls

julie
Télécharger la présentation

Lexicon: exploring language trends on Facebook Walls

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lexicon: exploring language trends on Facebook Walls • Roddy Lindsay • Data Team

  2. What’s a Wall?

  3. Walls are semi-public and public forums on profiles, groups, events, etc. Old New

  4. Numbers • Blogs • 1.6 million posts per day (Technorati) • ~18 posts per second • Walls • 12-20 million wall posts per day • ~180 posts per second • 5-9 million unique users per day • 2-2.5 GB of unstructured text per day

  5. Lexicon 101

  6. Brief History of Lexicon • First iteration: “Pulse” (2006) • Interests in profile fields ranked by count • E.g. “Top movies in San Francisco Network” • Pros • Structure through comma delimitation • Cons • Limited to profile field categories (movies, books, interests, TV shows, music) • Profile information is static (not updated frequently)

  7. Brief History of Lexicon • Attempt 2: • Extract terms from public and semi-public conversations between friends (on the Wall) • Anonymize user data to respect privacy • Plot time series data to show usage trends • Pros • Wall conversations closer to RL conversations • Topics are constantly changing, giving a strong temporal signal • Cons • No structure • Greater computational requirements

  8. How does Lexicon work? • Count occurrences of each word and bigram that is posted each day • Aggregate by unique user to minimize the effect of spam • Trim the long tail to handle data explosion • Normalize for intraweek and seasonal variance by putting total posts in the denominator “apple” “apple” • Interactive Flash charts rolled at home (used internally and externally for all Facebook reporting products)

  9. How does Lexicon work? • More technically... • Use Scribe (distributed log file aggregation service built with Thrift) to collect wall post logs from web servers • Have a 180-node Hadoop cluster that loads the log files into Hive, our homegrown data warehouse sitting on top of Hadoop • Pipeline of Map-Reduce scripts (written in Python) that count the number unique users for each (term, day) pair, trim the long tail • Load into horizontally partitioned MySQL tier for user queries • PHP front-end • Memcached sits in front to cache common queries • All of these are (or will be) open-source projects • Facebook is an active contributor to most of these projects

  10. Demo

  11. What is Lexicon useful for?

  12. What is Lexicon useful for? • Tracking news • Lexicon shows relative chatter surrounding current events • Can understand which events are of interest to the Facebook audience “tibet” “died” (Heath Ledger)

  13. What is Lexicon useful for? • Natural language trends • Words and phrases constantly enter and exit the lexicon • Track the popularity of terms that are used in everyday conversation “lulz” “pwned”

  14. What is Lexicon useful for? • Understanding the Facebook audience • Lexicon trends can yield insights into Facebook demographics, user attitudes towards Facebook products, and how the products are used “the add”

  15. What is Lexicon useful for? • Brand Mindshare • Brands and commercial products are mentioned in Wall conversations, just as in face-to-face conversations “verizon” “juno”

  16. What is Lexicon useful for? • Categories that are social in nature yield the strongest signal • Entertainment, Mobile, Automotive, QSR, etc. “honda”, “toyota”

  17. What is Lexicon useful for? • Measuring the success of sponsored gift campaigns on Facebook • Sponsored gifts: images you can send to friends along with a Wall post “coors”

  18. Challenges

  19. Challenges • Term disambiguation • Words are used in a variety of contexts • E.g. my cousin Wendy’s birthday vs. Wendy’s hamburgers OR ? • Tracking each different context automatically with machine learning techniques is difficult • Language classifiers, proper tokenization, and smart cleaning of the data can get us part way there

  20. Challenges • Sentiment • Is the mention of a term positive, negative, neutral, something else? • Most challenging aspects: irony, ambiguous sentiment terms, complex grammar • Many top companies use humans to rate a sizable percentage of posts • Numerous Ph.D. candidates have quit graduate school over this problem • Obviously a difficult task...

  21. Challenges • Sentiment • The language on Facebook wall posts is characterized by: • slang, lulz • mispellings • blunt sentences. • superfluous punctuation!!! • absent punctuation for example • emoticons ^_^ • acronyms, omg • a big freaking mess

  22. Challenges • Sentiment • Blunt language without complex grammar means that irony and sarcasm aren’t big issues • Synonym identification (figuring out that “hotttt” == “hot”), subjective/objective classification, and tokenization are more troublesome • Something to keep in mind: strong prior probability of a subjective post being positive (80-90% as rated by humans) • Walls are not blogs or movie reviews • Theory: users don’t want to appear to be negative, and so avoid making overtly negative comments for the most part • Sentiment classifier that guesses positive every time gives the least error • Maybe sentiment isn’t as important for us...

  23. Future trends for text analytics • Data visualization • Graph structure/Diffusion analysis • Cloud computing

  24. Thanks!

More Related