Learning Classifiers from Chains of Multiple Interlinked RDF Data Stores

Machine Learning Learning Classifiers from Chains ofMultiple Interlinked RDF Data Stores ? Relational, Distributed Harris T. Lin and VasantHonavar Artificial Intelligence Research Laboratory Department of Computer Science Iowa State University htlin@iastate.edu

Resource Description Framework (RDF) Primer • RDF triple = subject-predicate-object triple • RDF graph = set of RDF triples • Directed labeled graph whose nodes are URIs (Inception, hasActor, Ellen Page) (Inception, hasActor, Leonardo DiCaprio) (Titanic, hasActor, Leonardo DiCaprio) (Ellen Page, yearOfBirth, 1987) (Ellen Page, gender, F) (Leonardo DiCaprio, yearOfBirth, 1974) (Leonardo DiCaprio, gender, M) yearOfBirth xsd:integer hasActor Movie gender Actor Gender RDF Schema RDF Data (Triple representation) RDF Data (Graph representation) 1987 yearOfBirth gender hasActor Inception Ellen Page F hasActor 1974 yearOfBirth hasActor Leonardo DiCaprio Titanic gender M

Introduction • Motivating scenario: Facebook + New York Times • Facebook users share posts about news items published in New York Times • Goal: predict the interest of a user in joining a group • Challenges for Machine Learning • Multiple interlinked data stores • Physically distributed data stores • Autonomously maintained data stores

Introduction • Linked Open Data cloud • 300+ interlinked datasets • 30+ trillion triples • Multiple interlinked, physically distributed, autonomously maintained data stores • Prohibits downloading all data together • Bandwidth limits • Access limits • Storage and Memory limits • Privacy and confidentiality constraints • We need • Learning from multiple interlinked RDF stores that support only indirect access to data (e.g. SPARQL query interface) Linked Open Data cloud diagram,by Richard Cyganiak and AnjaJentzsch. http://lod-cloud.net/

Summary of Contribution • Learning Classifiers from Chains of Multiple Interlinked RDF Data Stores • Contributions • Statistical query-based formulations of several representative algorithms for learning classifiers from RDF data • Distributed learning framework for RDF stores that form a chain[Not covered in this talk] • Identify 3 special cases of RDF data fragmentation[Not covered in this talk] • Novel application of matrix reconstruction for approximating statistics, which dramatically reduce communication • Experimental results demonstrating feasibility

Problem Formulation • Last.fm dataset: Dataset (Conceptual) User1 Y ((B11, …, B1K), c1) ((B21, …, B2K), c2) … ((Bn1, …, BnK), cn) User2 N User3 N

Learning with Indirect Access to Data • Single RDF data store • Lin et al. [10] • Multiple Interlinked RDF data stores • This work RDF data stores Statistics viaSPARQL queries New instance ((B11, …, B1K), c1) ((B21, …, B2K), c2) … ((Bn1, …, BnK), cn) Learner Classifier Predicted class

Learning Algorithms • Aggregation • Simple aggregation (max, min, avg, etc.) • Vector distance aggregation (Perlich and Provost [12]) • Generative Models • Naïve Bayes (with 4 different distributions) • Bernoulli • Multinomial • Dirichlet • Polya (Dirichlet-Multinomial) • Key sufficient statistic:count for each value, for each instance(= histogram for each instance) • How to obtain this efficiently?

Obtaining Statistics for Learning Schema: Users Track Artist Tag Data Graph: Track Artist Tag Tag Matrix Representation: User Track Artist User

Approximating Statistics Track Artist Tag Tag User Track Artist User Track Artist Tag Tag ColumnProjection: User Track Artist User RowProjection: Track Artist Tag Tag User Track Artist User

Approximating Statistics • Could we approximate this matrixfrom the two projections? • CT scans for the rescue! • CT: Reconstruct 3D object from its projected slices • We want: Reconstruct 2D matrix from its projections Tag ? User Source: http://health-fts.blogspot.com/2012/01/brain-ct-mri.html Source: https://www.medicalradiation.com/types-of-medical-imaging/imaging-using-x-rays/computed-tomography-ct/

Approximating Statistics • We adapted one of the simplestreconstruction method:Algebraic Reconstruction Technique • Proposed scheme • Use SPARQL queries to accumulate and pass along column and row vectors, ultimately send back to the learner • Learner use a CT method to reconstruct matrix from projections • Use the approximated matrix to compute necessary statistics for learning • Dramatically reduce communication! • How accurate are the learned classifiers? Tag ? User

Experimental Results • Two subsets of Last.fm dataset • 2 aggregation and 4 naïve Bayes models • Compares against centralized counterpart • Uses exact matrix for learning • Accuracy Results (10-fold cross validation) • ART approximation has different effects depend on models • NB(Pol) is competitive, even in the ART approximated case • NB(Mul) is competitive too, despite using less information than NB(Pol) • NB (Bernoulli) and NB (Multinomial) only need projections for learning, hence their results are identical (*) • Sensitivity of ART on different models [Not covered in this talk]

Communication Complexity • Size of query results transferredv.s.Size of the dataset (# users) • Size of projections are several orders of magnitude smaller

Conclusion • Challenges • Multiple interlinked, physically distributed, autonomously maintained RDF data stores • Learner may be prohibited to download all data due to limitations in bandwidth, access, storage and memory, privacy and confidentiality constraints • We need • Learning from multiple interlinked RDF stores that support only indirect access to data (e.g. SPARQL query interface) • Contributions • Statistical query-based formulations of several representative algorithms for learning classifiers from RDF data • Distributed learning framework for RDF stores that form a chain[Not covered in this talk] • Identify 3 special cases of RDF data fragmentation[Not covered in this talk] • Novel application of matrix reconstruction from Computerized Tomography for approximating statistics, which dramatically reduce communication • Experimental results demonstrating feasibility

Related Work and Future Work • Related Work • Most existing work on learning from RDF data assume direct access • Lin et al. [10] learns relational Bayesian classifiers from a single remote RDF store via SPARQL queries • Extends the remote access framework [20] to multiple RDF stores • Future Work • Consider more recent and complex CT methods • Explore other ways of taking projections • Consider more complex RDF data fragmentations • Consider richer classes of learning models

Learning Classifiers from Chains of Multiple Interlinked RDF Data Stores

Learning Classifiers from Chains of Multiple Interlinked RDF Data Stores

Presentation Transcript

RDF Triple Stores

Learning Kernel Classifiers

LEARNING FROM DATA

Learning Relational Bayesian Classifiers from RDF Data

Predicting Income from Census Data using Multiple Classifiers

Using WebFOCUS to Consolidate Multiple Legacy Data Stores

Learning from Multiple Outlooks

Interlinked Registries

Interlinked Registries

Practical RDF Chapter 10. Querying RDF: RDF as Data

Active Learning of Binary Classifiers

Combining Multiple Classifiers

RDF: Data Description

Practical RDF Ch.10 Querying RDF: RDF as Data

Learning Classifiers from Distributional Data

Learning Classifiers For Non-IID Data

Learning From Data

Learning Ontologies from RDF Annotations

C-Store: RDF Data Management Using Column Stores

Handling of data from multiple databases

Displaying Data from Multiple Tables

Displaying Data from Multiple Tables