1 / 32

eTRACES at GESIS

eTRACES at GESIS. Brigitte Mathiak , Farag Ahmed and Andreas Oscar Kempf brigitte.mathiak@gesis.org Leipzig, 07-05-2012. eTRACES for Social Sciences. Text Re-Use. Context of the quotation. Knowledge transfer. Who cites whom?. Transfer of ideas. Text Re-Use. Who influences whom?.

regis
Télécharger la présentation

eTRACES at GESIS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. eTRACES at GESIS Brigitte Mathiak, Farag Ahmed and Andreas Oscar Kempf brigitte.mathiak@gesis.org Leipzig, 07-05-2012

  2. eTRACESfor Social Sciences Text Re-Use Context of the quotation Knowledge transfer

  3. Who cites whom? Transfer of ideas Text Re-Use Who influences whom? Why? Analysis • Tracking ideas through time for a number of applications: • Better ranking • Better filtering (based on ideas, not words) • Objective criteria on idea generation • To help literature analysis • Motivation of the author • Strengthening own arguments • Information for the reader • Separation • Critique • …

  4. Why eTRACES* is interesting • Text re-use instead of bibliometrics to find inter-document relationships • We are the first to use this on Social Sciences texts • Analysis of citation intention • Results become immediately available to the end user *from GESIS point of view

  5. AP 5.1 Social Scientific Annotation • The Habermas-Luhmann-Debate • Habermas, Jürgen/Luhmann, Niklas (1971) Theorie der Gesellschaft oder Sozialtechnologie. Was leistet die Systemforschung? Frankfurt/Main: Suhrkamp. • We chose about 30 Documents in that context • The texts are annotated with CiTO (Citation Typing Ontology) • Two dimensions: intention and type • The Method is based on qualitative social science research • Especially reconstructive and sequential analysis

  6. Theoretical background for the Methodology „Erzähltheorie“ by Fritz Schütze (1976, 1977) • Development of central categories for the formal analysis of stories („Erzählungen“) • Distinctionbetween three different modes:story, description, argumentation/evaluation Expansion for this project: • Distinction between direct and indirect citation and paraphrasing/summarizing of authors in scientific texts

  7. Methodology • We start with reconstructive and sequential text analysis • When looking at citations, the functional reason for the citation is most important, which can be deduced from the overall context • Texts are segmented within the text and differentiated according to mode • That way describing, argumentative and evaluating passages can be identified and differentiated from the summarizing passages

  8. CiTO Auszug aus CiTO (CitationTypingOntology)

  9. Direct Citation with Text-Reuse Text-Reuse w/o sourceFriedrich Schiller, Wilhelm Tell I,3 / Tell Paraphrase with source

  10. Cites as authority Includes quotation from Refutes

  11. Annotation

  12. Goals of the annotation • The annotation will be (is already) used to • Train algorithms to find similar pattern automatically • Make original social scientific research • Support bibliometrical research at our institute • We plan to annotate a second distinctly different data set before the end of the project for comparison

  13. AP 2 Data Cleansing • The DGS corpus has 5,594 documents with 523,834 unique terms • It includes the proceedings of the German Society for the Social Sciences spanning 100 years • There are mostly German texts as PDF • Some are derived from OCR, newer ones have been converted directly

  14. AP 2 Data Cleansing (Re-) OCR CitationRecommender Text Extraction and Clean-up StandardSearch FilteredSearch Cleaned data VisualizedSearch Unified Database SentimentAnalysis

  15. AP 2 Data Cleansing • PDF conversion proves to be difficult. Some of the OCR has been based on bad scans, there are missing line breaks, irregular spaces • By using a dictionary method, we are able to cope with most of those mistakes automatically Scan with bad quality Letters are too much spread, leading to spaces inside of words

  16. Data cleansing Statistics* • 155 untreated OCR documents were automatically identified and Re-OCRed * Function words were excluded from the corpus statistics

  17. Example: Topic Trends over time 1 frankfurt luhmann moderne theorie begriff form modernen ordnung macht subjekt soziologische unterscheidung differenz sinn 3 deutschen menschen geschichte deutsche deutschland jahrhunderts jahre gesellschaft jahren jahrhundert welt kultur revolution krieg 2 internet beziehungen evaluation daten forschung methoden qualitative online verfahren informationen sozialforschung netzwerke gruppe netzwerk

  18. First resultswithtextre-use • Atfirstwefoundmainlyreferencesandduplicatedocuments • The algorithmisvery robust versus wrongspacerecognition • Example: • In fact, Weber’s sociology of religion turned into an ambivalent intellectualist and moralistic affirmation of asceticism, individualism, professionalism, and institutional rationalization. • The affirmative project of modernity is largely engaged in a reversion of Nietzsche’s critique, turning it into an ambivalent intellectualist and moralistic affirmation of asceticism, individualism, professionalism, and institutional rationalization. • Wecanseeherethatthecontextandintentionisimportant

  19. Components – Current Status • StandardSeachisimplemented([Histo] Suche). • FilteredSearch: near duplicate can be filtered, based on “Tracer” tool, ASV-Leipzig more filters are to come • VisualisedSearch: supports users in exploring and navigating through the displayed result is still in the concept phase • SentimentSearch: improve the retrieved results, by recommendingspecificarticles to the user, also still in the concept phase

  20. CitationRecommender • Integrates all tools, available as mock-up

  21. The Implemented: [histo] Suche

  22. Sentiment Analysis • The task of studying whether the expressed opinion in a piece of text is positive, negative, or neutral • Why sentiment analysis is important : • Support a decision making (hearing others opinion about a certain thing) • For our goal, to support citation search e.g., ranking based on the work quality rather than citations frequency

  23. Sentiment Analysis ofCitation Challenges • Citation context extraction: • Citation context boundaries can vary greatly. Therefore a fixed window size might not effectively include all citation terms • Citations that are in close proximity can interact with each other which leads to ownership ambiguity for the surrounding words • Citing author motivation is not an easy to identify automatically e.g., is it persuasiveness or to notifythereaderaboutsomethingorpositive, negative or neutral mining

  24. Sentiment Analysis ofCitation Challenges • Not much work has been done in this regard, but esp. for Humanities it is very relevant • It is a very tough problem, therefore even small advances are valuable, e.g. semi-automatic or partial processes

  25. Sentiment Analysis of CitationMain Components • In order to perform a sentiment analysis of citation, we need: • Citing author information (name, address, organization etc.) • Citing article information (paper id, title, place of publication etc.) • Citation context (which words the citing author used to describe the cited article) • Cited article information • Author information • Paper information

  26. Author Information Structure

  27. Paper Information Structure

  28. Citation Information Structure

  29. Recommendation/Sentiment Search Overview Userfeedback Recommended top n-documents based on citation context analysis query Initial retrieved top n-documents Evaluation Retrieval Model Re-Ranking Sentiment analysis Das Prozeßergebnisauf seiten des Individuums läßtsich in den Termini von Marcia (Marcia 1980, S.161)zwar gut beschreiben, die Prozeßqualitätund insbesondere die 'Transaktionen' ... zwischen Außen- und Innenwelt bleiben im Verborgenen. Documents Database Citation context extraction Marcia, Jarnes E. (1980). Identity in adolescence. In Joseph Adelson (Hrsg.), Handbook of adolescent psychology (S. 159-187). New York: Wiley. Author’s info. extraction

  30. Documents Re-ranking Cycle query Initial retrieved documents Process next document Documents Re-ranking D1 D2 D3 Paper info. extraction Dm D209 Dn D1 D1020 Dn D3 D2 Citation context analysis to Re-weight initial retrieved documents Paper id Documents database Cited on Citation context Citation context Citation context

  31. More Future Work • Based on the citation temperature used in eAqua • We will color often cited works (e.g. Luhmann, Marx, Weber) based on the citation context • Agreeable sections will be distinguishable from controversial and from negatively judged sections

  32. Conclusion • Text re-use instead of bibliometricsto find inter-document relationships • Build interactive tool to support effectively citation context extraction • In eTRACEScitation ranking will be done based on the work quality rather than citations frequency • CitationRecommender supports social scientists to perform their information search tasks in an effective way

More Related