1 / 30

Stance Classification for Fact- Checking

Stance Classification for Fact- Checking. Lecture : Web Science 04.06.2019 Luca Brandt. Table of Content. Introduction Motivation Fact- Checking Process Paper 1 – Fake News Challenge Paper 2 – Relevant Document Discovery for Fact- Checking Articles Future Work. Introduction.

hayes
Télécharger la présentation

Stance Classification for Fact- Checking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stance Classificationfor Fact-Checking Lecture: Web Science 04.06.2019 Luca Brandt

  2. Table of Content • Introduction • Motivation • Fact-CheckingProcess • Paper 1 – Fake News Challenge • Paper 2 – Relevant Document Discovery for Fact-CheckingArticles • Future Work Luca Brandt Stance Classification Web Science 2019

  3. Introduction • Whatare Fake News ? • Propaganda (not objectiveinformation) • Deliberatedisinformation/hoaxes • Reporters payingsourcesforstories • Made upstories • Problem with Fake News • Spreadingmisinformation • Reducestrust in newsmedia • Manipulatingsociety Luca Brandt Stance Classification Web Science 2019

  4. Introduction • Whatis Fact-Checking? • Process to check the veracity and correctness of a claim/statement • In non-fictional texts • A classic journalism task • Claim • Statement made by a politician • Story published by another journalist • Rumor on Social Media • Etc. Luca Brandt Stance Classification Web Science 2019

  5. Example Claim Luca Brandt Stance Classification Web Science 2019

  6. Motivation • Why Fact-Checking? • Goal istoprovide a verdictif a claimistrue, falseormixed • Fight misinformation • Identify Fake News • Providing context to users to understand information better • Fact-Checking and AI • Automatically detect Fake News • Automatically gather documents relevant to a claim Luca Brandt Stance Classification Web Science 2019

  7. Fact-CheckingProcess • Given a claim/statement • Find documents relevant toclaim • Classic Information Retrieval task • Understand the stance of relevant documents • Stance Classification, Classification Problem • Give verdict if claim is true or false • Classification Problem Luca Brandt Stance Classification Web Science 2019

  8. Fake News Challenge (FNC-1) • Foster developmentof AI techtodetect Fake News • 50 teams participated from industry and academia • Task: Stance Detection of an entire document • Learn classifier f: (document, headline) -> stance {AGR,DSG,DSC,UNR} • Dataset: 300 Topics – claims with 5-20 documents each • Every document summarized to a headline • Each document matched with every headline to generate dataset Luca Brandt Stance Classification Web Science 2019

  9. A Retrospective Analysis of the Fake News Challenge Stance Detection Task • By Hanselowski et al. • 13. June 2018 • Main Contributions: • First summarizing and analyzingpaperforthe FNC-1 • Reproductionofresultsof top-3 performers • Proposed a newevaluationmetric • Proposed a newmodel Luca Brandt Stance Classification Web Science 2019

  10. Top-3 Performers • 1. TalosComb • Weightedaveragemodelofdeepconvnet and gradient-boosteddecisiontree • TalosCNN uses pre-trained word2vec embeddings • TalosTree based on word count, TF-IDF, sentiment, and word2vec embeddings • 2. Athene • Multilayer perceptron (MLP) with 6 hidden layers, with handcrafted features • Unigrams, cosine similarity, topic models • 3. UCL Machine Reading (UCLMR) • MLP as well but only 1 layer • Term frequency vectors of 5000 most frequent unigrams • Cosine Similarity between TF-IDF vectors of headline and doc Luca Brandt Stance Classification Web Science 2019

  11. Problem withthemetric and dataset • Hierarchicalmetric • .25 pointsifclassifiedcorrectlyasrelated {AGR, DSG, DSC} orunrelated {UNR} • .75 pointsifclassifiedcorrectlyas AGR, DSG, or DSC • But relatedclassisimbalanced • Not difficult to predict related or unrelated (best systems reach 99% UNR) • Correctly predicting related vs unrelated and always picking DSC would achieve a FNC-1 score of .833 -> better than the winner Luca Brandt Stance Classification Web Science 2019

  12. Theirmodel and metric • F1m metric • Class-wise F1 scores and macroaverageto F1m score • F1 = • Not affectedby large sizeofmajorityclass • Naive approachpredicting UNR and always DSC -> F1m = .444 • StackLSTM • Combinesbestfeaturesoftheir feature test • ConcatenateGloVeWordembeddingsfedthrough 2 stacked LSTMs • Understandthemeaningofwholesentence • Hidden stateof LSTMs fedthrough 3 layers NN • Softmaxtoobtainprobabilities Luca Brandt Stance Classification Web Science 2019

  13. ReproductionofResults Luca Brandt Stance Classification Web Science 2019

  14. Pros & Cons • Pros: • First papertosummarize and analyzethe FNC-1 • First papertoreproducetheresults • Proposed a bettermetric • Proposed a newmodelbetterthanthestateoftheart • Cons: • Proposednewmodel still haslowaccuracyofDisagreeingclass Luca Brandt Stance Classification Web Science 2019

  15. Relevant Document Discovery for Fact-Checking Articles • Paper by Wang et al. • 23. April 2018 • Main Contributions: • End-to-end systemfor relevant documentdiscoveryforfact-chekingarticles • Betterthanstateoftheartstanceclassification • Betterthanstateoftheartrelevanceclassification Luca Brandt Stance Classification Web Science 2019

  16. Fact-CheckingArticles • Adopted Schema.org ClaimReview Markup • Providesstructureof an article • Key Fields on top ofcontent: • Claim • Claimant • Verdict • Structured fieldscannotprovidedocuments relevant toclaim • Identifyingclaim relevant documentsextremelyusefull Luca Brandt Stance Classification Web Science 2019

  17. Fact CheckingArticle & Claim Relevant Doc Luca Brandt Stance Classification Web Science 2019

  18. Overviewoftheirsystem Luca Brandt Stance Classification Web Science 2019

  19. Candidate Generation • Via Navigation • Outgoing links fromfact-checkingarticle • But mostofthem not relevant • Via Search with Google • Key challenge: generate the right set of queries • Texts from title and claim • Title and claim text transformed with entity annotations • Click graph queries • Combining both generating about 2400 related documents Luca Brandt Stance Classification Web Science 2019

  20. Relevance Classification • Classifier M: (f,d) -> {relevant, irrelevant} • Features: • Building confidencevectorsofentities • Cosine similarity between confidence vectors of • Claim and text/sentence/paragraph of related doc • Sentence of fact-checking article and sentence of related doc • And whole documents • Publication date • Gradient Boosted Decision Tree • Combines all features – predicts relevant or irrelevant Luca Brandt Stance Classification Web Science 2019

  21. Stance Classification • Buildmodel M: (f, d) -> {contradict, support} • Similarity not goodforstanceclassification • Find keycontradictingpatterns in contextsimilartotheclaim • Collected 3.1k (claim, contradictingstatement) pairs • Built 900-dim lexiconfrom uni- and bi-grams withgramsofhighestprobability • Uni-grams: hoax, fake, purportedly, rumor • Bi-grams: made-up, fact check, not true, noevidence Luca Brandt Stance Classification Web Science 2019

  22. Stance Classification • from relevant docuse title, headline, text and pruneawaytextwhosesimilarityissmallerthan a threshold • Concatenatetextwithonesentencebefore and after the text-> keycomponents • Extract uni- and bi-grams ofkeycomponents -> final feature vector • Using a Gradient boosteddecisiontreeforprediction Luca Brandt Stance Classification Web Science 2019

  23. Resultsoftheirmodel

  24. Pros & Cons • Pros: • New stateoftheartstanceclassificationalgorithm • Proposedwhole end-to-end systemfor relevant documentdiscovery • Cons: • Not providingtheirdataset • Not providingthedistributionofthedataset • Not providingthe per classscores • Evaluation in isolation • IgnoredtheDiscussingclass Luca Brandt Stance Classification Web Science 2019

  25. Conclusion • Whatare Fake News • Whatis Fact-Checking • Fake News Challenge • Top-3 Performers • Stack LSTM • End-to-End System for Relevant Document Discovery • The Disagreeing Class is not reallywellpredicted Luca Brandt Stance Classification Web Science 2019

  26. Future Work • All featurestextbased • Including non-textualdataasfeatures • Videos -> Image & Speech Recognition • Social Media Pages -> embeddedimage/graphicinformation • Develop ML techniqueswithdeepersemanticunderstanding • Not relying on lexicalfeatures • Disagreeingclasshastobepredictedbetter Luca Brandt Stance Classification Web Science 2019

  27. Thankyouforyourattention Any Questions?

  28. Dataset of Paper 2 • Unlabled Corpus: 14731 fact-checkingarticlesbyhighly reputable fact-checkers • Relevance-labeledcorpus: usingtheircandidategenerationalgorithm 2400 relateddocuments per fact-checkingarticle 33.5M • With crowd-workers and balancingfor positive and negative examples a total of 8000 (claim, doc) pairs • Crowd-workingquestion: doesthisdocaddresstheclaim? • Stance-labeledcorpus: randomlysampled 1200 ofthe positive instancesofrelevance-labledcorpus and crowdsourced -> support, contradict, neither, can‘ttell • For 12% workerscouldn‘tagree -> removed • Manual Corpus: to measure candidate generation, randomly sampled 450 fact-checking articles and let crowdworkers search for candidates Luca Brandt Stance Classification Web Science 2019

  29. Features Description • Bag of Words : • 1- and 2-Grams with 5000 tokenvocabularyforheadline and doc • Added a negationflag „_NEG“ asprefixtoeverywordbetweenspecialnegationwords like „not“, „never“, „no´“ untilnextpunctuationmark • Topic Models: • Non-negative matrixfactorization • Latent semanticindexing • Latent dirichletallocation • Similaritybetweentopicmodels and headlines and bodies Luca Brandt Stance Classification Web Science 2019

  30. Feature Description II • Lexicon-basedfeatures • Based on lexiconswhich hold thesentiment/polarityforeachword • Computedseperatelyforheadline and body • Count positive and negative polarizedwords -> features • Find maximum positive and negative polarity -> features • Last wordwith negative or positive polarity -> feature • Refutingwordslist („fake“, „hoax“) -> features • Concatenating all featuresfromabove • Readabilityfeatures • Measuredwith different metrics (e.g. SMOG, Flesch-Kincaid, Gunningfog) Luca Brandt Stance Classification Web Science 2019

More Related