1 / 16

Learning Classifiers from Chains of Multiple Interlinked RDF Data Stores

Machine Learning. Learning Classifiers from Chains of Multiple Interlinked RDF Data Stores. ?. Relational, Distributed. Harris T. Lin and Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science Iowa State University htlin@iastate.edu.

hali
Télécharger la présentation

Learning Classifiers from Chains of Multiple Interlinked RDF Data Stores

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning Learning Classifiers from Chains ofMultiple Interlinked RDF Data Stores ? Relational, Distributed Harris T. Lin and VasantHonavar Artificial Intelligence Research Laboratory Department of Computer Science Iowa State University htlin@iastate.edu

  2. Resource Description Framework (RDF) Primer • RDF triple = subject-predicate-object triple • RDF graph = set of RDF triples • Directed labeled graph whose nodes are URIs (Inception, hasActor, Ellen Page) (Inception, hasActor, Leonardo DiCaprio) (Titanic, hasActor, Leonardo DiCaprio) (Ellen Page, yearOfBirth, 1987) (Ellen Page, gender, F) (Leonardo DiCaprio, yearOfBirth, 1974) (Leonardo DiCaprio, gender, M) yearOfBirth xsd:integer hasActor Movie gender Actor Gender RDF Schema RDF Data (Triple representation) RDF Data (Graph representation) 1987 yearOfBirth gender hasActor Inception Ellen Page F hasActor 1974 yearOfBirth hasActor Leonardo DiCaprio Titanic gender M

  3. Introduction • Motivating scenario: Facebook + New York Times • Facebook users share posts about news items published in New York Times • Goal: predict the interest of a user in joining a group • Challenges for Machine Learning • Multiple interlinked data stores • Physically distributed data stores • Autonomously maintained data stores

  4. Introduction • Linked Open Data cloud • 300+ interlinked datasets • 30+ trillion triples • Multiple interlinked, physically distributed, autonomously maintained data stores • Prohibits downloading all data together • Bandwidth limits • Access limits • Storage and Memory limits • Privacy and confidentiality constraints • We need • Learning from multiple interlinked RDF stores that support only indirect access to data (e.g. SPARQL query interface) Linked Open Data cloud diagram,by Richard Cyganiak and AnjaJentzsch. http://lod-cloud.net/

  5. Summary of Contribution • Learning Classifiers from Chains of Multiple Interlinked RDF Data Stores • Contributions • Statistical query-based formulations of several representative algorithms for learning classifiers from RDF data • Distributed learning framework for RDF stores that form a chain[Not covered in this talk] • Identify 3 special cases of RDF data fragmentation[Not covered in this talk] • Novel application of matrix reconstruction for approximating statistics, which dramatically reduce communication • Experimental results demonstrating feasibility

  6. Problem Formulation • Last.fm dataset: Dataset (Conceptual) User1 Y ((B11, …, B1K), c1) ((B21, …, B2K), c2) … ((Bn1, …, BnK), cn) User2 N User3 N

  7. Learning with Indirect Access to Data • Single RDF data store • Lin et al. [10] • Multiple Interlinked RDF data stores • This work RDF data stores Statistics viaSPARQL queries New instance ((B11, …, B1K), c1) ((B21, …, B2K), c2) … ((Bn1, …, BnK), cn) Learner Classifier Predicted class

  8. Learning Algorithms • Aggregation • Simple aggregation (max, min, avg, etc.) • Vector distance aggregation (Perlich and Provost [12]) • Generative Models • Naïve Bayes (with 4 different distributions) • Bernoulli • Multinomial • Dirichlet • Polya (Dirichlet-Multinomial) • Key sufficient statistic:count for each value, for each instance(= histogram for each instance) • How to obtain this efficiently?

  9. Obtaining Statistics for Learning Schema: Users Track Artist Tag Data Graph: Track Artist Tag Tag Matrix Representation: User Track Artist User

  10. Approximating Statistics Track Artist Tag Tag User Track Artist User Track Artist Tag Tag ColumnProjection: User Track Artist User RowProjection: Track Artist Tag Tag User Track Artist User

  11. Approximating Statistics • Could we approximate this matrixfrom the two projections? • CT scans for the rescue! • CT: Reconstruct 3D object from its projected slices • We want: Reconstruct 2D matrix from its projections Tag ? User Source: http://health-fts.blogspot.com/2012/01/brain-ct-mri.html Source: https://www.medicalradiation.com/types-of-medical-imaging/imaging-using-x-rays/computed-tomography-ct/

  12. Approximating Statistics • We adapted one of the simplestreconstruction method:Algebraic Reconstruction Technique • Proposed scheme • Use SPARQL queries to accumulate and pass along column and row vectors, ultimately send back to the learner • Learner use a CT method to reconstruct matrix from projections • Use the approximated matrix to compute necessary statistics for learning • Dramatically reduce communication! • How accurate are the learned classifiers? Tag ? User

  13. Experimental Results • Two subsets of Last.fm dataset • 2 aggregation and 4 naïve Bayes models • Compares against centralized counterpart • Uses exact matrix for learning • Accuracy Results (10-fold cross validation) • ART approximation has different effects depend on models • NB(Pol) is competitive, even in the ART approximated case • NB(Mul) is competitive too, despite using less information than NB(Pol) • NB (Bernoulli) and NB (Multinomial) only need projections for learning, hence their results are identical (*) • Sensitivity of ART on different models [Not covered in this talk]

  14. Communication Complexity • Size of query results transferredv.s.Size of the dataset (# users) • Size of projections are several orders of magnitude smaller

  15. Conclusion • Challenges • Multiple interlinked, physically distributed, autonomously maintained RDF data stores • Learner may be prohibited to download all data due to limitations in bandwidth, access, storage and memory, privacy and confidentiality constraints • We need • Learning from multiple interlinked RDF stores that support only indirect access to data (e.g. SPARQL query interface) • Contributions • Statistical query-based formulations of several representative algorithms for learning classifiers from RDF data • Distributed learning framework for RDF stores that form a chain[Not covered in this talk] • Identify 3 special cases of RDF data fragmentation[Not covered in this talk] • Novel application of matrix reconstruction from Computerized Tomography for approximating statistics, which dramatically reduce communication • Experimental results demonstrating feasibility

  16. Related Work and Future Work • Related Work • Most existing work on learning from RDF data assume direct access • Lin et al. [10] learns relational Bayesian classifiers from a single remote RDF store via SPARQL queries • Extends the remote access framework [20] to multiple RDF stores • Future Work • Consider more recent and complex CT methods • Explore other ways of taking projections • Consider more complex RDF data fragmentations • Consider richer classes of learning models

More Related