1 / 55

Data and Text Mining: The Search for Unknown Knowns

This article explores the concept of data and text mining, the process of discovering previously unknown information from unstructured text. It discusses the differences between data mining, information retrieval, information extraction, and information analysis. The article also raises the question of how publishers can explore new ways to "publish" hidden information in unstructured text.

Télécharger la présentation

Data and Text Mining: The Search for Unknown Knowns

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data and text mining: the search for unknown knowns • Geoffrey Bilder • UKSG, 2007 • gbilder@crossref.org

  2. "Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns -- the ones we don't know we don't know."

  3. The Mining Metaphor

  4. Gold Mining

  5. Diamond Mining

  6. Data Mining

  7. Data Mining- What it isn’t

  8. ≠ Information Retrieval

  9. ≠ Information Extraction

  10. ≠ Information Analysis

  11. Information Retrieval Information Extraction Information Analysis + +

  12. new, previously unknown information Data Mining

  13. And so what is text data mining?

  14. Text Mining

  15. Information Retrieval Information Extraction Information Analysis + +

  16. Crucial question for publishers is: “If ‘hiding’ information in unstructured text is a problem- then shouldn’t we be exploring new ways to “publish”?

  17. So how did we get here?

  18. The word tobacco originates from the Taino indians. • There is no I in the word Team. • The book captured the zeitgeist of the time. • I am sure that I turned the gas off.

  19. The book captured the <foreign_phrase lang="DE">zeitgeist</foreign_phrase> of the time. I am <emphasis>sure</emphasis> that I turned the gas off.

  20. Semantic Web “Light”

  21. But we can do more...

  22. The web as a database

  23. The Relational Model

  24. Rows represent things

  25. Columns are properties

  26. The thing’s property The book has an author “Jorge Luis Borges” Subject Predicate Object

  27. URI URI The book has an author “Jorge Luis Borges” Subject Predicate Object

  28. RDF: Resource Description Framework http://www.amazon.com/isbn/978-0140286809 has an author http://www.wikipedia.com/borges

  29. Blog Journal A Journal B Wiki Personal Website OPAC

  30. Blog Journal A Journal B Wiki Personal Website OPAC

  31. SPARQL http://api.ingentaconnect.com/content/cabi/nrr/latest?format=rss PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX foaf: <http://xmlns.com/foaf/0.1/> SELECT DISTINCT ?name WHERE { ?x rdf:type foaf:Person . ?x foaf:name ?name } ORDER BY ?name

  32. Creative Commons FOAF Geo RSS 1.0 FRBR SKOS

  33. The Early Modern Internet

  34. Data Mining = Information retrieval + Information extraction + Information analysis... With the goal of discovering new, previously unknown information

  35. Data Mining = Information retrieval + Information extraction + Information analysis... With the goal of discovering new, previously unknown information Text Data Mining = Complex data extraction layer + data mining

More Related