Data and Text Mining: The Search for Unknown Knowns

Data and text mining: the search for unknown knowns • Geoffrey Bilder • UKSG, 2007 • gbilder@crossref.org

"Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns -- the ones we don't know we don't know."

The Mining Metaphor

Gold Mining

Diamond Mining

Data Mining

Data Mining- What it isn’t

≠ Information Retrieval

≠ Information Extraction

≠ Information Analysis

Information Retrieval Information Extraction Information Analysis + +

new, previously unknown information Data Mining

And so what is text data mining?

Text Mining

Information Retrieval Information Extraction Information Analysis + +

Crucial question for publishers is: “If ‘hiding’ information in unstructured text is a problem- then shouldn’t we be exploring new ways to “publish”?

So how did we get here?

The word tobacco originates from the Taino indians. • There is no I in the word Team. • The book captured the zeitgeist of the time. • I am sure that I turned the gas off.

The book captured the <foreign_phrase lang="DE">zeitgeist</foreign_phrase> of the time. I am <emphasis>sure</emphasis> that I turned the gas off.

Semantic Web “Light”

But we can do more...

The web as a database

The Relational Model

Rows represent things

Columns are properties

The thing’s property The book has an author “Jorge Luis Borges” Subject Predicate Object

URI URI The book has an author “Jorge Luis Borges” Subject Predicate Object

RDF: Resource Description Framework http://www.amazon.com/isbn/978-0140286809 has an author http://www.wikipedia.com/borges

Blog Journal A Journal B Wiki Personal Website OPAC

SPARQL http://api.ingentaconnect.com/content/cabi/nrr/latest?format=rss PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX foaf: <http://xmlns.com/foaf/0.1/> SELECT DISTINCT ?name WHERE { ?x rdf:type foaf:Person . ?x foaf:name ?name } ORDER BY ?name

Creative Commons FOAF Geo RSS 1.0 FRBR SKOS

The Early Modern Internet

Data Mining = Information retrieval + Information extraction + Information analysis... With the goal of discovering new, previously unknown information

Data Mining = Information retrieval + Information extraction + Information analysis... With the goal of discovering new, previously unknown information Text Data Mining = Complex data extraction layer + data mining

Data and Text Mining: The Search for Unknown Knowns