1 / 41

Fourth-Generation Content Analysis Computational Linguistics for the Social Sciences

Fourth-Generation Content Analysis Computational Linguistics for the Social Sciences. Douglas W. Oard Joint work with Ping Wang, Ken Fleischmann, Tiffany Chao, An-Shou Cheng, Chia-jung Tsui and Lidan Wang. Outline. Content analysis (some) Computational linguistics Putting them together

diallo
Télécharger la présentation

Fourth-Generation Content Analysis Computational Linguistics for the Social Sciences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fourth-Generation Content AnalysisComputational Linguistics for the Social Sciences Douglas W. Oard Joint work with Ping Wang, Ken Fleischmann, Tiffany Chao, An-Shou Cheng, Chia-jung Tsui and Lidan Wang

  2. Outline • Content analysis • (some) Computational linguistics • Putting them together • An example: adoption of IT concepts • Collaboration opportunities

  3. Insight through Triangulation • Think aloud • Observation notes • Interviews • Surveys • Content analysis • Citation analysis

  4. Content Analysis • “… any technique for making inferences by objectively and systematically identifying specified characteristics of messages …” (Holesti, 1969) • “… the study of recorded human communications such as books, Web sites, paintings, and laws …” (Babbie, 1975) • “… a summarizing, quantitative analysis of messages that relies on the scientific method …” (Neuendorf, 2002)

  5. Four Generations of Content Analysis • Read and understand something • Manually infer something, then count it • Directly observe something, then count it • Automatically infer something, then count it

  6. Problem identification Data selection Conceptualization Operationalization Coding frame design Analysis Second-Generation Content Analysis Content acquisition Manual Coding

  7. Third-Generation Content Analysis:(“Computer Assisted Text Analysis”) • “Dictionary-based” word counting • Person or organization names • Positive and negative sentiment terms • Vastly more scalable than manual coding • Alias list can accommodate synonymy • Focused domain can limit homonomy effects • Regression models some context-dependent effects

  8. (Some of) Computational Linguistics • Transducers • Document image processing (e.g., OCR) • Speech processing (e.g., ASR) • Machine translation • “Text Mining” • Segmentation • Clustering • Classification

  9. OCR MT Handwriting Speech Transducer Capability Curve Searchable Fraction Transducer Capabilities

  10. Segmentation • Find mentions of specific types of items in a sequence • Equivalently, learn to mark start and end points • Applicable at many scales • Coherent passages • Multi-word expressions • Named entities (e.g., people or organizations) • Noun phrases • Chinese words • Stems

  11. Aggregate related items Documents, based on topical similarity Entities, based on detected relationships Clustering

  12. Classification • Associate each item with a category • Document  topic • Passage  sentiment • Entity  type Feature extraction Model learning Content acquisition Classification evaluation

  13. Automating the Annotation Process “There has been a lot of buzz over the arrival of Firefox, the open-source browser published by the Mozilla Foundation… Sun Microsystems Inc. hopes that open-source Solaris will draw in new users and new growth opportunities.” Segmentation: Classification: Association and clustering: (company, software) Firefox Mozilla Foundation Sun Microsystems Solaris Open Source Software Organization Organization Open Source Software Firefox Mozilla Foundation Sun Microsystems Solaris

  14. Interdisciplinary Collaboration

  15. PopIT: Scalable Computational Analysis • Objectives • Describe, explain, and predict trends in technological fields (IT, biotech, nanotech…) • Develop methodology for domain-specific computational analysis of large-scale textual data from multiple sources • Advance theory development in social sciences • Period: September 2007-August 2010 • Sponsor: National Science Foundation http://www.wam.umd.edu/~pwang/PopIT/

  16. Theory-Based Iterative Inquiry

  17. Problem identification Data selection Content acquisition Conceptualization Feature extraction Model learning Operationalization Classification Coding frame design Analysis evaluation Process Integration

  18. Rethinking “Operationalization” • Coupled models as “boundary object” • Input representation • Transformation • Output representation • Layered uncertainty • Meaning of the text • Meaning of the coding frame • Purpose of the coding frame

  19. SaaS Chatbots Portable Personality Ajax RFID Ultramobile Devices BPO Application Quality Dashboards SOA VoIP Mashup DRM Identity Management OSS Thin Provisioning Business Intelligence Semantic Web Web2.0 SCM Tera-architectures CRM Distributed Encryption iSCSI Why do some innovations become popular, but others don’t?

  20. Hype Cycle Performance S-curve Adoption Curve IT Innovation Life Cycle Time Management Fashion Theory: Knowledge entrepreneurs create a transitory collective belief that an innovation is at the forefront of progress. Linden & Fenn, 2003

  21. Hype Cycle Emerging Technologies 2007 Gartner Fenn et al., 2007

  22. Conceptual and Material Innovation

  23. Fluctuation Hypothesis An innovation will be more prevalent when its environmental cues are more prevalent. Abrahamson & Fairchild, 1999

  24. Sentiment Hypothesis Emotional and positive discourse characterizes the upswing of an innovation’s hype cycle, whereas reasoned, negative discourse characterizes the downswing. Abrahamson & Fairchild, 1999

  25. Competition Hypothesis The discourse volume of an old concept is negatively associated with the discourse volume of a new and related concept. Wang, 2007

  26. Concept Popularity Evolution Model

  27. Manual Content Acquisition

  28. Automating Content Acquisition Manually identify source(s) Paid content, Blogs Locally cache Web pages FlashGet Parse HTML, build XML Perl Read XML, write tool’s format LingPipe, SVMlite

  29. Collections 6-month Pilot Study Collection (used to date) • Computerworld: 1 January 2005 – 30 June 2005 • 1,193 documents • 26 issues 10-year Trade Press Collection (now available) • Computerworld: 1 January 1998 – 9 June 2008 • 25,278 documents • 534 issues • Information Week: 1 January 1998 – 30 June 2008 • 31,112 documents • 527 issues

  30. Manual Document Classification Coding frame: ProQuest innovation labels

  31. Automatic Document Classification SVM, 6-month Computerworld pilot collection

  32. 10-Year Subject Label Distribution 10-year Computerworld collection

  33. Recall F1 Precision Recall F1 Precision Automatic Annotation of Mentions LingPipe, 6-month Computerworld pilot collection

  34. Manual Selective Acquisition In ABI/Inform, search co-occurrence of two innovations

  35. Manual Co-occurrence Analysis

  36. Next Steps for PopIT • Additional content types • Academic papers • Blogs • Interviews • Classification • Cross-domain (e.g., trade press : blogs) • Non-topical (e.g., sentiment) • Social network (e.g., opinion leaders) • Extraction • Non-entity (e.g., values)

  37. Build Tools, not “Solutions”

  38. Interdisciplinary Innovation Cycle Language Technology Transducer Technology Application Systems Application Technology

  39. Some Collaboration Opportunities Content acquisition • API access to content providers • “Sandbox” collections, focused access • Diverse sources • Email, speech, multiple languages • Interaction trails • Query logs, clickstreams Computational linguistics • Cross-language co-reference Applications • Global diffusion of innovation

  40. References • Oard, D., “A Whirlwind Tour of Automated Language Processing for the Humanities and Social Sciences,” CLIR-NEH Symposium on Cyberinfrastructure for the Humanities and Social Sciences, Washington, DC, September 2008. • Cheng, A.-S., Fleischmann, K.R., Wang, P. and Oard, D., “Advancing Social Science Research by Applying Computational Linguistics,” Annual Conference of the American Society for Information Science and Technology, Columbus, OH, October 2008. • Wang, P., “Chasing the Hottest IT: Effects of Information Technology Fashion on Organizations,” Best Paper Proceedings of the Academy of Management Annual Meeting, Philadelphia PA, August 2007.

  41. Ping Wang pwang@umd.edu+1-301-593-4518 Doug Oard oard@umd.edu +1-301-405-7590 Ken Fleischmann kfleisch@umd.edu +1-301-405-3989 College of Information Studies University of Maryland Room 4105 Hornbake Bldg, South Wing College Park, MD 20742-4325, USA The PopIT Team at Maryland http://www.wam.umd.edu/~pwang/PopIT/

More Related