Big Data Technologies Lecture 1: Big Data & Big Data Analysis

Big Data TechnologiesLecture 1:Big Data & Big Data Analysis Assoc. Prof.Marc FRÎNCU, PhD. Habil. marc.frincu@e-uvt.ro

Lecture Structure • 1 lecture hour + 2 lab hours / week • Aim of this lecture? • Importance of Big Data analysis • ImpactofBig Data in science and technology • Paralleland distributed architectures • Parallelizationof algorithms • Importance of hardwareand datastructurein the designof Big Data processing algorithms • Analisys ofindependent, dependentand streaming data (homogeneous and heterogeneous) • Practically (laborator) • Using Google Cloud to run parallel and distributed applications • Paralelizingsequentialbasic sequential algorithms • Design, test, evaluation

Minimal Requirements • Passing grade (5) • 1 parallel algorithm implemented(in a single technology) and evaluated • One scientific presentation (10 minsthe presentation + 2 questions) about ascientific paper (published or tech report) with focus on Big Data, Cloud Computing, Bioinformatics, Security in Big Data. • Maximum grade (10) • All algorithms given during lab hours should be implemented, and a final technical report should be presented • One scientific presentation (10 minsthe presentation + 2 questions) about a top scientific paper (IPDPS, Supercomputing, Europar, CCGrid, ICDCS, IEEE Trans. PDC, IEEE Trans. Computing, FGCS, TPDS) with focus on Big Data, Cloud Computing, Bioinformatics or Security in Big Data.

A world increasingly connected Je Suis Charlie: 6500 retweets per minute

A world increasingly interconnected Cyberphysical systems: IT + communication + intelligence

Knowledge = Power = data Data: decision  control autonomy intelligence

What is Big Data? • Oxford English Dictionary (OED) • data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges • Wikipedia • an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications • datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze • The ability of society to harnessinformation in novel ways to produce useful insightsor goods and services of significant value” and “…things one can do at a large scale that cannot be done at a smaller one, to extract new insights or create new forms of value. • The broad range of new and massivedata typesthat have appeared over the last decade or so • The new tools helping us find relevantdata and analyze its implications • The convergence of enterprise and consumer IT • The shift (for enterprises) from processing internal data to mining external data • The shift (for individuals) from consuming data to creatingdata. • The merger of Madame OlympeMaxime and Lieutenant Commander Data • The belief that the more data you have the more insights and answers will rise automatically from the pool of ones and zeros • A new attitude by businesses, non-profits, government agencies, and individuals that combining data from multiple sources could lead to better decisions. https://www.forbes.com/sites/gilpress/2014/09/03/12-big-data-definitions-whats-yours/#66e783be13ae

What is Big Data? • Volume • Velocity • Variety • ...

What is Big Data? Furthermore, Big Data means: • Using multiple data sources • Data ambiguities and human/machine errors Big Data != Better Data Unprocessed data has no meaning! Data analysis increases their value → information!

Big Data in numbers

Big Data in the current global context

Why now? • ”We could have gotten started a lot earlier. We simply weren’t stepping back and looking at how to use the data” – Brad Smith, Intuit • Dataare too preciousto be erased!

What do we do with the data? Ethics! • Private data • Sensible data

Information extraction • Exploratory • Theory based on observing phenomena • Constructive • Theory based on axioms and theorems

The 4th paradigm • Big Data + analysis • Prediction of the future • Analysis • Follows an exploratorypath and studiesdata • Infersknowledge based on statisticsormachine learningtechniques • Constructsmodelsand validatesthem based on data

Data analysis • The process of studying various types of data and to identify so far unknown correlations and to extract otheruseful information • Based on data mining Data flow

Types of data analytics • Descriptive • What has happened? • Diagnostics • Why it happened? • Predictive • What will happen? • Prescriptive • What should we do with the data? Level of understanding data & its value

Few examples • Medical monitoring of children to alert doctors and parents when an intervention is needed • Predicting the status of industrial machinery • Preventing traffic jams, saving fuel, cutting down pollution

Data value

Data analysis flow • Data Acquisition • Datacleaning, annotation andextraction • Missing values, outliers, duplicates • Between 50-70% of the effort is put here! • Heterogeneousdata integration andrepresentation in a common format • Dataanalysis • Automatedandvisualinterpretationof results • People often see patterns that algorithms fail to identify! • Decisionmaking

Big Data roles • Data scientist • Data science = systematic method dedicated to uncovering knowledge through data analysis • In business • Process optimizations for increased efficiency • In science • Observed/experimental data analysis with the aim of drawing a conclusion • Requirements • Statistics • Java, Python, R, .... • Domain knowledge • Data engineer • Data engineering= field that designs, implements and offers systems for Bog Data analysis • Builds scalable and modular platforms for data scientists • Installs Big Data solutions • Requirements • databases, software engineering, parallel and cloud processing, real-time processing • C++, Java, Python • Understating performance factors and the limitations of the systems/algorithms

Areas of interest

Data vs. processing speed • Data • Annotated: L • Unannotated: U • Learning algorithm: Φ • f = Φ(L + U) • Minimize error function • Avoid over training • Results: • Scalability: • Supervised learning: f = Φ(L) • Training data is large but insufficient!!! • Semisupervised learning f = Φ(L* + U) • L* most relevant training data • L* + U is large • Unsupervised learning: f = Φ(U) • Nearest Neighbor, convolutionary neural networks, unrestricted Bolzmann machines, Deep Learning

Practical exampleClassification in DNA microarray studies • Classifying and prediction of the diagnosis based on the gene profile • Measuring gene expression on a sample of 4,026 genesfrom 59 patients(39 used for training) exhibiting lymphoma and divided in 3 classesbased on the type of the lymphoma • Problem • Few classes, hard to classify data (volume) • Algorithm • Find the centroid (mean expression of each gene) of each lymphoma • Find the genes that belong to it http://statweb.stanford.edu/~tibs/ftp/ncshrink2.pdf

Useful links • http://www.comp.nus.edu.sg/~tankl/cs5344/slides/2016/intro.pdf • http://infolab.stanford.edu/~echang/BigDat2015/BigDat2015-Lecture1-Edward-Chang.pdf • https://wr.informatik.uni-hamburg.de/_media/teaching/wintersemester_2015_2016/bd-1516-einfuehrung.pdf • https://www.ee.columbia.edu/~cylin/course/bigdata/EECS6893-BigDataAnalytics-Lecture1.pdf

Next lecture • Parallel and distributed architectures • Parallel systems • Shared memory • Distributed memory • Distributed systems • Cloud computing

Big Data Technologies Lecture 1: Big Data & Big Data Analysis