1 / 62

Textual and Quantitative Analysis: Towards a new, e-mediated Social Science

Textual and Quantitative Analysis: Towards a new, e-mediated Social Science. Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University of Surrey. Outline. Think Tank Rationality, Bounded Rationality and Sentiment News Analysis and Sentiment Analysis

quito
Télécharger la présentation

Textual and Quantitative Analysis: Towards a new, e-mediated Social Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Textual and Quantitative Analysis: Towards a new, e-mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University of Surrey

  2. Outline • Think Tank • Rationality, Bounded Rationality and Sentiment • News Analysis and Sentiment Analysis • A method for identifying and extracting sentiment • Experiments and Evaluation • Conclusions and Future Work

  3. THINK TANK What is the connection between these pairs of terms: HAPPY & SAD MORE & LESS NORTH & SOUTH AHEAD & BEHIND HIGHER & LOWER LOUDER & QUIETER IN PROFIT & IN LOSS OPERATIONAL & BROKEN MORE EXPENSIVE & LESS EXPENSIVE AT UNIVERSITY & AWAY FROM UNIVERSITY METRO Thursday, June 28, 2005, pp 5.

  4. THINK TANK We rely on reviews and opinion polls of various kinds: • Film & TV reviews; Book reviews; Resort reviews • Bank reviews; Automobile Review; White good reviews; • Consumer surveys; ‘write your own’ reviews; • Newspaper editorials; Editors’ choice. METRO Thursday, June 28, 2005, pp 5.

  5. THINK TANK • We rely on the sentiment of the reviewers, editors, investment experts, and …… • We do know the cost of durables, shares, holidays. • A reasonable price is rejected if the reviews are poor; an exorbitant price is acceptable if the reviews are good; • Bad reviews stick in the mind for longer than good reviews. METRO

  6. THINK TANK • We rely on the sentiment of the more vociferous in the society sometimes • The vociferous may call black white, and white black; • The vociferous may repudiate facts and purvey fiction. METRO

  7. THINK TANK An internal war may be due to bounded rationality: given certain structural conditions – emergent anarchy, economic scarcity, weakening state structures due to globalization – elites and groups make rational decisions to pursue their aims by violent means. Within the bounded context of their decision-making parameters, going to war may be entirely rational. Jackson, Richard (2004). ‘The Social Construction of Internal War’ In (Ed.) Richard Jackson. (Re)Constructing Cultures of Violence and Peace. Rodopi: Amsterdam/New York.

  8. THINK TANK • We rely on the sentiment of safety expressed by our near and dear, and the media • The dears may have been mugged or burgled: the falling crime rate does not alleviate the fear of crime  reassurance gap METRO

  9. THINK TANK A new bank has just been launched: Punter Smith has passed his judgement on the bank. Which of the two columns tells us that he likes the new outfit? Turney, Peter D. (2002). “Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews”. In Proc of the 40th Ann. Meeting of the Ass. for Comp. Linguistics (ACL). Philadelphia, July 2002, pp. 417-424. (Available at http://acl.ldc.upenn.edu/P/P02/P02-1053.pdf).

  10. THINK TANK How can a machine detect the positive/negative sentiment from texts? We look at the collocation of words like excellent & poor in text corpus. The point wise mutual information is computed between word1 & word2: Semantic orientation of phrase is given as: Turney, Peter D. (2002). “Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews”. In Proc of the 40th Ann. Meeting of the Ass. for Comp. Linguistics (ACL). Philadelphia, July 2002, pp. 417-424. (Available at http://acl.ldc.upenn.edu/P/P02/P02-1053.pdf).

  11. THINK TANK How can a machine detect the positive/negative sentiment from texts? We look at the collocation of words like excellent & poor in a number of texts.

  12. THINK TANK How can a machine detect the positive/negative sentiment from texts? We look at the collocation of words like excellent & poor in a number of texts. Note subjectivity: The analyst has chosen the pivotal words poor & excellent. How well can the method be adapted to other domains? Adaptive Information Extraction? For automatic choosing the pivots!

  13. THINK TANK Japanese yen/US dollar exchange rate (decreasing solid line); US consumer price index (increasing solid line); Japanese consumer price index (increasing dashed line), 1970:1 − 2003:5, monthly observations Why is it that Japanese consumer price index is following the same trend as the US CPI?

  14. THINK TANK The return series – the first difference values of US $/Japanese Yen exchange (Price t – Price t-1) between 1970-2003, monthly data

  15. High Volatility Clusters THINK TANK The volatility series – the four-week moving average of the square of the changes in the values of US $/Japanese Yen exchange (Price t – Price t-1) between 1970-2003.

  16. THINK TANK • Robert Engle’s contribution:Volatility may vary considerably over time: large (small) changes in returns are followed by large (small) changes. Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica Vol 50, pp 987—1007.

  17. THINK TANK Engle and Ng have developed the concept of the news impact curve. • To condition at time t on the information available at t − 2 and thus consider the effect of the shock ε t−1 on the conditional variance htin isolation. • The conditional variance is affected by the latest information, “the news” ε t−1: • The symmetric case: Both positive and negative news has the same effect. • The assymetric case: a positive and an equally large negative piece of “news” do not have the same effect on the conditional variance. Engle, R. F. and Ng, V. K (1993). Measuring and testing the impact of news on volatility, Journal of Finance Vol. 48, pp 1749—1777.

  18. THINK TANK Asymmetric case Symmetric case Engle, R. F. and Ng, V. K (1993). Measuring and testing the impact of news on volatility, Journal of Finance Vol. 48, pp 1749—1777.

  19. Rationality, Bounded Rationality and Sentiment • News Effects • I: News Announcements Matter, and Quickly; • II: Announcement Timing Matters • III: Volatility Adjusts to News Gradually • IV: Pure Announcement Effects are Present in Volatility • V: Announcement Effects are Asymmetric – Responses Vary with the Sign of the News; • VI: The effect on traded volume persists longer than on prices. Andersen, T. G., Bollerslev, T., Diebold, F X., & Vega, C. (2002). Micro effects of macro announcements: Real time price discovery in foreign exchange. National Bureau of Economic Research Working Paper 8959, http://www.nber.org/papers/w8959

  20. Rationality, Bounded Rationality and Sentiment The following statements based entirely on statistical analysis of quantitative data: • Bad news in “good times” should have an unusually large impact • In a purely ‘good times’ sample “bad news should have unusually large effects,”

  21. Rationality, Bounded Rationality and Sentiment • On average, the effect of macroeconomic news often varies with its sign. In particular, negative surprises often have greater impact than positive surprises. Andersen, T. G., Bollerslev, T., Diebold, F X., & Vega, C. (2002). Micro effects of macro announcements: Real time price discovery in foreign exchange. National Bureau of Economic Research Working Paper 8959, http://www.nber.org/papers/w8959

  22. Rationality, Bounded Rationality and Sentiment • So, where is the news? It is not the news but the timing of the announcement  the timings are used as an information proxy. Andersen, T. G., Bollerslev, T., Diebold, F X., & Vega, C. (2002). Micro effects of macro announcements: Real time price discovery in foreign exchange. National Bureau of Economic Research Working Paper 8959, http://www.nber.org/papers/w8959

  23. Rationality, Bounded Rationality and Sentiment • Firm-level Information Proxies: • Closed-end fund discount (CEFD); • Turnover ratio (in NYSE for example) (TURN) • Number of Initial Public Offerings (N-IPO); • Average First Day Returns on R-IPO • Equity share S • Dividend Premium • Age of the firm, external finance, ‘size’(log(equity))……. • Each sentiment proxy is likely to include a sentiment component and as well as idiosyncratic or non-sentiment-related components. Principal components analysis is typically used to isolate the common component. • A novel composite index built using Factor Analysis: • Sentiment = -0.358CEFDt+0.402TURNt-1+0.414NIPOt +0.464RIPOt+0.371 St-0.431Pt-1 Baker, M., and Wurgler, J. (2004). "Investor Sentiment and the Cross-Section of Stock Returns," NBER Working Papers 10449, Cambridge, Mass National Bureau of Economic Research, Inc.

  24. Rationality, Bounded Rationality and Sentiment • So, where is the news and financial data? There is plenty of it but in a noisy state. • Today’s news and figure may contradict yesterdays or, worse still, reinforce false hopes and prejudices. • The financial news and data are truly organic data – not manufactured in a laboratory

  25. The Surrey Society Grids Project A 24-node data and compute cluster (64 cpus) interfaced to a ‘real world’ data stream (Reuters News and Financial Time series Feed) for capturing, analysing and fusing quantitative and ‘qualitative’ data. A small but well-formed grid – for creating a data nursery Reuters Feed: 2 dedicated data lines, PC and Sun for feed management and associated networking

  26. Text and Time Series Service Streaming Textual Data Distribute Tasks 1 2 Send Service Request Streaming Numeric Data Notify user about results Main Cluster Receive Results 4 3 GRID Cluster 24 Slaves Surrey Grid Surrey Society Grids Architecture • Given an allocated task, the corresponding data is retrieved from the data providers by the slave machines. • The main cluster monitors the slave machines until they have completed their tasks, and subsequently combines the interim results. • The final result is sent back to the client machine.

  27. Surrey Society Grids: Streaming Data STREAMING ECONOMIC/POLITICAL NEWS- Reuters; Yahoo; Bloomberg, BBC! Al Jazeera

  28. Surrey Society Grid: Performance • Increasing the throughput • We have created a 24 node grid infrastructure, which can provide access to upto 64 processors simultaneously • Processing the (complete) RCV1 corpus: 181 million words in 806,791 texts

  29. Surrey Society Grid: Performance Automatic extraction and annotation of sentiment bearing words in a 1,000,000 word text corpus –four days output from Reuters news feed – using automatically extracted key words and an automatically extracted local grammar for pattern identification.

  30. Surrey Society Grid: Algorithms and Methods • We have developed a for visualising and correlating the sentiment and instrument time series both as text (and numbers) and graphically as well.

  31. Surrey Society Grid: Algorithms and Methods • Interface the grid to local news media (e.g. Bradford Argus & Burnley Express) and local data repositories – crime statistics (crime surveys and police data), ethnicity compliance data, housing queues, field data

  32. Surrey Society Grid:Social Science Data? Language and text are constitutive (and not merely representational): but ‘society is not reducible to language and linguistic analysis (Hodgson 2000:62). Discourses are broader than language, being constituted notjust in texts, but also in definite institutional and organizational practices’ (Jackson 2004). But text is all we have after the event, the interview, the survey

  33. Surrey Society Grid:Social Science Data?

  34. Surrey Society Grid:Social Science Data? • There is no visible technique in social science research methodology that can improve the researchers productivity in collecting and analysing large volumes of speech and text. • Social scientists survey, and occasionally interview, interesting individuals in various social groups – analyse the survey form and quantify. • So what about the data collected in the field. Data is buried in tombs never to be taken out again. • Most text, if ever, is hand-coded by the social science researcher and then the proxy of the interpretation of the codes is presented as objective analysis.

  35. Surrey Society Grid:A Case Study • We present a method for systematically identifying sentiment bearing phrases in large volumes of streaming texts – a local grammar comprising templates to extract the phrases with a minimal number of false positives. • The sentiments are aligned with quantitative (time-varying) information and results co-integrated and tested for Granger causality • The grammar itself is constructed automatically from a corpus of domain specific texts

  36. Surrey Society Grid:A Case Study • Of all the contested boundaries that define the discipline of sociology, none is more crucial than the divide between sociology and economics […] Talcott Parsons, for all [his] synthesizing ambitions, solidified the divide. “Basically,” […] “Parsons made a pact ... you, economists, study value; we, the sociologists, will study values.” • If the financial markets are the core of many high-modern economies, so at their core is arbitrage: the exploitation of discrepancies in the prices of identical or similar assets. • Arbitrage is pivotal to the economic theory of financial markets. It allows markets to be posited as efficient without all individual investors having to be assumed to be economically rational. MacKenzie, Donald. 2000b. “Long-Term Capital Management: a Sociological Essay.” In (Eds) in Okönomie und Gesellschaft, Herbert Kaltoff, Richard Rottenburg and Hans-Jürgen Wagener. Marberg: Metropolis. Pp 277-287.

  37. Rationality, Bounded Rationality and Sentiment • A financial economist can analyse quantitative data using a large body of methods and techniques in statistical time series analysis on “fundamental data”, related, for example, to fixed assets of an enterprise, and on “technical data”, for example, share price movement; • The economist can study the behaviour of a financial instrument, for example individual shares or currencies, or aggregated indices associated with stock exchanges, by looking at the changes in the value of the instrument at different time scales – ranging from minutes to decades; • Financial investors/traders are trying to discover the market sentiment, looking for consensus in expectations, rising prices on falling volumes, and information/assistance from back-office analysts; • The efficient market hypothesis suggests that quirks caused by sentiments can be rectified by the supposed inherent rationality of the majority of the players in the market

  38. Rationality, Bounded Rationality and Sentiment • Recent developments in financial economics, signified by the emergence of derivatives and arbitrage, show the triumph of rational reasoning: such instruments/strategies were created on the basis of mathematical models (Black and Scholes 1972), and the trading can be monitored using the self same models (Miller 1990); • The assumption of overarching rational behaviour has been reviewed by Herbert Simon (1978/1992) and Daniel Kahnneman (2003), and arguments have been presented in favour of a model of bounded rationality where the actors in a given social situation prefer to ignore facts and trust their own version of reality and the efficient market mechanisms fail to operate;

  39. News Analysis and Sentiment Analysis • Qualitative research methods are being used in financial economics, and in sociological studies of financial markets, for systematically studying the hopes and fears of the traders, investors, and regulators in the analysis of the behaviour of the markets. • Since 2000, the analysis of news wire has become selective and targeted. • Some researchers choose news related to economic and financial topics • news about employment • distinguish between scheduled and non-scheduled news announcements;

  40. News Analysis and Sentiment Analysis • Some pre-select keywords that indicate change in the value of a financial instrument – including metaphorical terms like above, below, up and down – and use them to ‘represent’ positive/negative news stories. • Some use the frequency of collocational patterns for assigning a ‘feel-good/bad’ score to the story • ‘Good’ news stories appear to comprise collocates like revenues rose, share rose; • ‘Bad’ news stories contain profit warning, poor expectation; • ‘Neutral’ stories contain collocates such as announces product, alliance made; • The ‘sentiment’ of the story is then correlated with that of a financial instrument cited in the stories and inferences made. DeGennaro, R., and R. Shrieves (1997): ‘Public information releases, private informationarrival and volatility in the foreign exchange market’. Journal of Empirical Finance Vol. 4, pp 295–315. ; Koppel, M and Shtrimberg, I. (2004). ‘Good News or Bad News? Let the Market Decide’. In AAAI Spring Symposium on Exploring Attitude and Affect in Text. Palo Alto: AAAI Press. pp. 86-88;

  41. A method for identifying and extracting sentiment • No proxies – but the real data • We adopt a text-driven and bottom-up method: starting from a collection of texts in a specialist domain, together with a representative general language corpus, • A five-step algorithm for identifying discourse patterns with more or less unique meanings, without any overt access to an external knowledge base

  42. An algorithm for identifying and extracting sentiment • Select training corpora: a randomly sampled special language corpus and a general language corpus. • Extract key words; • Extract key collocates; • Extract local grammar using collocation analysis and relevance feedback; • Assert the grammar as a finite state automaton.

  43. Experiments and Evaluation of sentiment analysis method I. Select training corpora Training-Corpus • The British National Corpus, comprising 100-million tokens distributed over 4124 texts (Aston and Burnard 1998); • Reuters Corpus Volume 1 (RCV1) comprising news texts produced in 1996-1997 and contains 181 million words distributed over 806,791 texts

  44. Experiments and Evaluation of sentiment analysis method • II. Extract key words • The frequencies of individual words in the RCV1 were computed using System Quirk; • For describing how our method works we will use a randomly selected component of the corpus – the output of February 1997, henceforth referred to as the RCV1-Feb97 corpus; • The RCV1-Feb97 corpus containing 14 Million words distributed 63,364 texts.

  45. Experiments and Evaluation of sentiment analysis method

  46. Experiments and Evaluation of sentiment analysis method

  47. Experiments and Evaluation of sentiment analysis method III. Extract key collocates

  48. Experiments and Evaluation of sentiment analysis method IV. Extract local grammar using collocation and relevance feedback

  49. Experiments and Evaluation of sentiment analysis method V. Assert the grammar as a finite state automaton • The (re-) collocation patterns can then be asserted as a finite state automata for each of the movement verbs and spatial preposition metaphors

  50. Experiments and Evaluation of sentiment analysis method • V. Assert the grammar as a finite state automaton • The (re-) collocation patterns can then be asserted as a finite state automata for each of the movement verbs and spatial preposition metaphors

More Related