1 / 6

Scalable Data Science Systems

Scalable Data Science Systems. My research on Data Science. Input? large data sets, large files, many documents, many tables, fast growing => Big Data How? Fast external algorithms; efficient data structures at two storage levels. Parallel: multi-threaded or multi-node, distributed

nbermudes
Télécharger la présentation

Scalable Data Science Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Data Science Systems

  2. My research on Data Science • Input? large data sets, large files, many documents, many tables, fast growing => Big Data • How? Fast external algorithms; efficient data structures at two storage levels. • Parallel: multi-threaded or multi-node, distributed • Ideal goals: linear time O(n), linear speedup • Hardware? Multicore CPU, GPU or parallel cluster • Infrastructure? Cloud, distributed memory, parallel file system • Analytics: queries, cubes, statistics, Machine learning • Challenge: apply CS Theory to programming

  3. Data Systems research today • Transaction processing? More into main memory, lock-free • Efficient analysis? joins, compiled queries, streams, exploit ample RAM, multi-core, leverage R/ScaLAPACK • Compiler versus interpreter? Dev. More into Python and JavaScript • Massive storage? Posix file system vs HDFS • Fast external algorithms? Simple tasks. • Parallel computation? Multi-core with threads, Shared-nothing (embedded message-passing) • Exploiting new hardware? Interesting,difficult,but customized

  4. Data Science involves Core CS research:Theory+Programming • Theory we use: • Time complexity (big O()) and I/O cost (disk, solid state memory) • Many data structures (arrays, trees, hash tables, linked lists) • Relational model and information retrieval models • Linear algebra • Multivariate statistics, machine learning models • Compilers and programming languages: parsing/compiling/optimizing code; recursion • Programming: • Languages: C++ and Python, Also, Java combined with R, SQL, Scala • Systems: Unix (Linux), Spark • OS: Unix, but we have a lot of past work on MS Windows .net • Systems aspects: Threads, text/binary I/O, parallel file systems, memroy management, code generation, code optimization, ..a lot of fun.

  5. Typical Problems Summarization for linear models: vector outer products Exploration: cubes, lattices Graph transitive closure (linear recursion), clique enumeration Bayesian models: MCMC, classification, regression, variable/feature selection

  6. Why join my group? • Balance between theory (mathematics) and programming (C++) • Lots of machine learning and graph analytics • Build libraries and tools to help analysts • Many scientific applications

More Related