Exploring Big Data Trends: From Basics to Cutting-Edge Technologies

Big Data George K. Thiruvathukal, PhD IEEE Computer Society, Loyola University Chicago

Evolution of the tat gene from HIV isolates taken from the US 1990-2009 We will come back to this in the case study.

Topics • What is Big Data? • The Sliding Scale of Big Data • Brief Observations about Computing Education and Big Data • Sources of Big Data • Emerging Technologies/Techniques • NoSQLapproaches (MongoDB) • Private Clouds and Open Stack • Post-Java Era and Scala • Using Python as Glue Language • RESTful Thinking • Case Study: Building a Genomic Data Warehouse to Study HIV (and other virus) Evolution • Future Directions • Acknowledgments

Big Data “Defined” “Big Data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. This definition is intentionally subjective and incorporates a moving definition of how big a dataset needs to be in order to be to be considered big data—i.e. we don’t define big data in terms of being larger than a certain number of Terabytes (thousands of gigabytes). We assume that over time that the definition of datasets that qualify as big data will also increase. Also note that the definition can vary by sector, depending on what kinds of software tools are commonly available and what sizes of datasets are common in a particular industry. With those caveats, big data in many sectors today will range from a few dozen terabytes to multiple petabytes (thousands of terabytes).” Source: Big data: The next frontier for innovation, competition, and productivity (McKinsey and Company) http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation

McKinsey Report on Big Data • $600 to buy hard drive that stores all the world’s music • 5 billion mobile phones in use • 30 billion Facebook posts • 40% growth in global data generated per year • Corresponding 5% growth in global IT spending • 235 TB data collected by US Library of Congress in April 2011 • 15/170 sectors in USA store more data per second than Library of Congress • $300 billion business to US health care • 250 billion euro European public sector • Continued importance to retail operations (not just Wal-Mart anymore!) • Shortage of analytical talent positions • Shortage of data-savvy managers

The Audacity of Storage Source: Seagate Electronics Web Site (seagate.com)

Units of Measurement • Terabyte – 0.25 of these 4TB drives • Petabyte – 250 • Exabyte – 250,000 • Zettabyte – 250,000,000 • Yottabyte – 250,000,000,000 • We’re well on our way to exascale and beyond in most domains. Big Data today implies exascale thinking.

The CS Education Challenge • Big data skills still not primary focus of most universities • “Computational thinking” good but inclusion of “Data Science” even better • Traditional HPC more computationally driven • Cloud Computing more application/data driven • Need to teach about “out of core” to excel in the data-driven world • Need to teach about “web/distributed scale” in order to query the data effectively • Various initiatives beginning to connect the dots • EduPar to learn parallel/distributed computing principles early • Emergence of “data science” academic programs • In short, we need to rethink whether CS accurately describes what we are doing!

Sources of Big Data • social media: tweets, likes (+1s), etc. • http://www.fastcompany.com/3013208/these-amazing-twitter-metadata-visualizations-will-blow-your-mind • user-generated content like photos and videos • VOIP traffic • customer and B2B transactions • GPS-equipped cell phones • mobile devices • system logs of all kinds • RFID tag sensors embedded in everything from airport runways to casino chips • e-mail • …mostly unstructured -> documents as opposed to relations (a la RDBMS)

Distributed Design Principles are Vitally Important to Big Data • NoSQL emerging as the way to operate at web (distributed) scale • Transparency principles (from Coulouris, Dollimore, Kindberg): • Access transparency: - enables local and remote objects to be accessed using identical operations • Location transparency: location of resources is hidden • Migration transparency: resources can move without changing names • Replication transparency: users cannot tell how many copies exist • Concurrency transparency: multiple users can share resources automatically • Parallelism transparency: - activities can happen in parallel without user knowing about it • Failure transparency: concealment of faults • Modern NoSQL databases employ most of these, especially when combined with cloud computing.

MongoDB: Secret Sauce of Many Big Data Environments • Document storage model (JSON/BSON) • Ad hoc/dynamic schemas (by default) • JSON does not imply total absence of a schema • http://json-schema.org/ • Distributed design principles/sharding • Replication • High Availability • Distributed query processing • Full indexing support • Atomic updates and transactions (if required) • Map/Reduce from Hadoop (JavaScript) • Other embedding possible with JavaScript (built around Node.js technology) • File Storage via GridFS • …capabilities not limited to MongoDB; see http://couchdb.apache.org/ (same organization behind Hadoop)

Distributed Architecture of MongoDB • Mongod – main database process, one per shard • Replica sets to allow for failover execution with a revolving master model • Mongos – routers to expose the collection of shards as a single server • Config servers – Servers that contain metadata about the cluster and what chunks of data reside on each shard • MongoDB allows for reasonable operation to continue (often in read-write mode) in the presence of single daemon/node failures

MongoDBSharding/Replicas

Sharding (Literature) • BigTable • http://research.google.com/archive/bigtable.html • Yahoo! Hosted Data Serving Platform • http://www.brianfrankcooper.net/pubs/pnuts.pdf

NoSQLvs. and/or Hadoop • Doesn’t need to be either-or; could be and. • Hadoop • Ideal for periodic jobs that do map-reduce (e.g. ETL or data warehouse from multiple sources) • Transparent support for clustered execution • NoSQL (e.g. MongoDB, a front-runner in many projects) • Built-in aggregation, including map-reduce support a la Hadoop • Sharding allows for distributed query processing from any node (and in any language) • Additional reading: • http://docs.mongodb.org/ecosystem/use-cases/hadoop/ • http://docs.mongodb.org/manual/tutorial/map-reduce-examples/

OpenStackIaaS(I=infrastructure)

OpenStack Components • Object Store: storage and retrieval of files • Image: catalog/repository of virtual disk images • Compute: virtual servers on demand • Dashboard: user interface for accessing all components • Identity: authentication/authorization • Network: network connectivity as a service (on demand as well) • Block Storage: block storage for guest VMs (similar to iSCSI) • http://en.wikipedia.org/wiki/OpenStack

Emerging Case Study • Phylogenetic Analysis of HIV-1 • Evolution of HIV happens so rapidly that we need to build an online analytical system for understanding it in space and time. • Components/Pipeline • Genbank data ETL (Scala + Python); import into MongoDB warehouse • RESTful Querying to slice/dice gene information into FASTA (format used by alignment tools • Example of JavaScript (server side) map-reduce embedded analytics using MongoDB • Alignment and Visualization using existing (offline) tool. • http://ecommons.luc.edu/cs_facpubs/68/ • Our approach is best summarized: Use the best tools and languages for the task you are trying to do (polyglot) • Case study uses 3 programming languages, MongoDB, a web services framework (Flask), existing bioinformatics tools, and a test VMware cluster for hosting it all before moving to an IaaS solution.

Working with Genbank Data • Not completely unstructured but messy nevertheless. • Often contain errors, owing to the complexity of the domain. • Errors in most cases are innocuous. Can correct them offline and add to warehouse later (or whenever) • Parsers present unwanted complexity; see API at http://www.biojava.org/docs/api16/org/biojava/bio/seq/io/SeqIOTools.html • Our approach: Transform the Genbank Data Directly into MongoDB “documents” (JSON objects) for a posteriori and long-term analysis • …and never think about Genbank format again! • Focus is on extracting features of interest, although our future efforts will be to transform the entire Genbank corpus into documents for easier parsing/processing (for understanding other viruses, etc.)

LOCUS HM067748 9680 bp DNA linear VRL 27-JUN-2010 DEFINITION HIV-1 isolate nx2 from China, complete genome. ACCESSION HM067748 VERSION HM067748.1 GI:298919707 KEYWORDS . SOURCE Human immunodeficiency virus 1 (HIV-1) ORGANISM Human immunodeficiency virus 1 Viruses; Retro-transcribing viruses; Retroviridae; Orthoretrovirinae; Lentivirus; Primate lentivirus group. REFERENCE 1 (bases 1 to 9680) AUTHORS Miao,W., Liu,Y., Wang,Z., Zhuang,D., Bao,Z., Li,H., Liu,S., Li,L. and Li,J. TITLE Sequence and characterization of full-length genome of two HIV-1 strains isolated from two infected patients in China JOURNAL Unpublished FEATURES Location/Qualifiers source 1..9680 /organism="Human immunodeficiency virus 1" /proviral /mol_type="genomic DNA" /isolate="nx2" /host="Homo sapiens" /db_xref="taxon:11676" /country="China" /collection_date="16-Oct-2006" LTR 1..591 gene 747..2226 /gene="gag" /note="gag protein" gene <2030..5030 /gene="pol" /note="pol protein" gene 4975..5556 Most of the data in this flat file is not important to our study, Watson. We need to import the bold fields and extract the data from the DNA in the next slide. (Both of these slides are one Genbank data file.) Crick and Watson, DNA (1953)

ORIGIN 1 ttgatttgtgggtctatcacacacaaggctacttccctgattggcacaactacacaccgg 61 gaccagggaccagattcccgctgacttttgggtggtgcttcaagctagtaccagttgacc 121 caagggaagtagaagaggccagcgaaggagaagacaacagtttgctacaccctgtctgcc 181 agcatggaatggaggatgaacacagagaagtgttaaagtggaagtttgacagccaattag 241 catacagacactgggcccgcgagctacatccggagttttacaagaactgctgatacagaa 301 gggactttccgcgggactttccaccagggcgttccgggaggtgtggtctgggcggtactg 361 ggagtggtcaaccctcagatgctgcatataagcagctgctttgcgcctgtaccgggtctc 421 ttagttagaccagatctgagcctgggagctctctggctagctaggaacccactgcttaag 481 cctcaataaagcttgccttgagtgctctgagcagtgtgtgcccatctgttgtgtgactct 541 ggtaactagagatccctcagacccttgtggcagtgtggaaaatctctagcagtggcgccc 601 gaacaggggcaagaaaaggaaaatgagacccgaggggatttcttgacgcaggactcggct 661 tgctgaagtgcactcggcaagaggcgagaggggcgactggtgagtacgccaattttattt 721 gactagcggaggctagaaggagagagatgggtgcgagagcgtcaatattaagaggggaaa 781 aattggataaatgggaaagaattaggttaaggccagggggaaagaaacactatctgctaa 841 aacacatagtatgggcaagcagagagctggaaaaatttgcacttaaccctggccttttag 901 agacatcagaaggatgtaagcaaataataaaacagctacaaccagctcttcagacaggaa 961 cagaggaacttaaatcattatacaacacagtagcagttctctattgtgtacatgaaaaaa 1021 tagacatacgagacaccaaagaagccttagacaagatagaagaagaacaaaataaatgtc 1081 agcagaaaacacagcaggcaaaaaaggatgatgagaaggttagtcaaaattatcctatag 1141 tgcagaatctccaagggcacatggtacatcagcctctatcacctagaactttaattgcat 1201 gggtaatagtagtggacagagaagactccttagctcagaagtaatacccctgttcacagc 1261 ataatcagaaggagccaccccacaagatctaaactccatgttaaatacagtagggcgaca 1321 tcaagcagctatgcaaatgttaaaagataccatcaatggagaggctgcagaatgagatag 1381 attgcatccagtgcatgcagggccagtggcaccaggccagatgagagaaccaaggggtag 1441 tgacatagcaggaactactagtactctccaggagcaaataggatggatgacaaataatcc 1501 acctatcccagtaggagaaatctataaaagatggataatcgtcggattaaataaattagt [...] 9541 gcctgggagctctctggctagctaggaacccactgcttaagcctcaataaagcttggctt 9601 gagtgctctgagcagtgtgtgcccatctgttgtgtgactctggtaactagagatccctca 9661 gacccttgtggcagtgtgga One of the genes (gag) of interest at positions 747…2226. Need to extract lots of these for our HIV repository!

Basic Structure • Genbank is a collection of Sequences • Sequence is a collection of Features • Sequences and Features have annotations (e.g. simple key/value pairs, possibly having ad hoc structure within) • We use Scala (an emerging object-functional language; runs on JVM) to parse this format. • Task naturally suited to the stream-oriented facilities found in functional languages • Support for “failure” as a concept allows processing to continue meaningfully. • Leverages existing BioJava library, which is adapted to Scala. • Scala code writes a delimiter separated file which is postprocessed by Python to create/update MongoDB entries.

ScalaGenbank Parsing • The next few slides show • ScalaGenbank file parser (shows how to extract the entire corpus of files as a stream) • Python postprocessor to transform flattened stream produced by Scala into a MongoDB collection (Python great glue language!) • Python RESTful API to query the collection for offline analytics (using Clustal and visualization tools) • Embedded analytics example using native map-reduce with JavaScript in MongoDB • The full source for what we’re doing is available from our Bitbucket repository at: • https://bitbucket.org/loyolachicagocs_bioi/hiv-biojava-scala • This is still work in progress but is being used to build our computational biology data warehouse.

ScalaGenbank File Parser (Importer)

object TryBio { case class SourceInformation(country: String, collectionDate: String, note: String) case class SequenceInformation(accession: String, origin: String) case class GeneInformation(gene: String, start: Int, end: Int) /** * Converts an annotation to a properly typed Scala map. */ implicit defannotationAsScalaMap(annotation: Annotation) = annotation.asMap.asInstanceOf[JMap[String, String]].asScala defprocessFile(file: java.io.FileReader) = { val sequences: JIterator[Sequence] = SeqIOTools.readGenbank(new BufferedReader(file)) for { seq <- sequences.asScala seqInfo <- getSequenceInformation(seq) sourceInfo <- getSourceInformation(seq) gene <- getGenes(seq) } { val fields = List(seqInfo.accession, gene.gene, sourceInfo.country, sourceInfo.collectionDate, sourceInfo.note, seqInfo.origin.substring(gene.start, gene.end)) println(fields.mkString("|")) } } def main(args: Array[String]) { for (arg <- args) { val f = new FileReader(arg) processFile(f) f.close() } } }

defgetSequenceInformation(sequence: Sequence): Option[SequenceInformation] = for { // returns None for sequences without accession so they get skipped in main acc <- sequence.getAnnotation get "ACCESSION" origin = sequence.seqString } yield SequenceInformation(acc, origin) defgetSourceInformation(sequence: Sequence): Option[SourceInformation] = for { // returns None for non-source sequences so they get skipped in main f <- sequence.features.asScala.find { _.getType == "source" } a = f.getAnnotation } yield SourceInformation( a.getOrElse("country", UNKNOWN_COUNTRY), a.getOrElse("collection_date", UNKNOWN_DATE), a.getOrElse("note", UNKNOWN_NOTE)) private valallowedGenes = Set("gag", "pol", "env", "tat", "vif", "rev", "vpr", "vpu", "nef") defgetGenes(sequence: Sequence): Iterator[GeneInformation] = for { f <- sequence.features.asScala // skip features without gene annotation g <- f.getAnnotation get "gene" if f.getType == "CDS" && (allowedGenes contains g) l = f.getLocation } yield GeneInformation(g, l.getMin - 1, l.getMax - 1)

Using Python to Import Genbank Stream into Mongo

defmain(): mongo_db_name = sys.argv[1] # Assume Mongo is running on localhost at its defaults client = MongoClient() db = client[mongo_db_name] folder = sys.argv[1] if db.posts.count() > 0: print("Mongo database %s is not empty. Please create new" % folder) sys.exit(1) for line in sys.stdin: text = line.strip() (accession, gene, country, date, note, sequence) = data = line.split("|")[:6] document = { 'accession' : clean(accession), 'gene' : clean(gene), 'country' : clean(country), 'date' : clean(date), 'note' : clean(note), 'sequence' : sequence } db.posts.insert(document) print("Wrote %d documents" % db.posts.count())

RESTful Services using Python and Flask Micro Web Framework

RESTful Queries for Querying Genbank HIV Corpus • RESTful architectural style exposes the collection as a discoverable hierarchy of resources: • <base>/genbank: Returns datasets that we’ve imported (hiv the only one right now) • <base>/genbank/<collection>: Returns list of discovered genes • <base>/genbank/<collection>/<gene>: Returns a FASTA file for all files where <gene> was present • <base>/genbank/<collection>/unknown/<thing> produces a report of what data were not imported (<thing> is a code indicating country, date, or notes that are needed to support the previous three common queries • Self-hosted Live Sandbox (not Open Stack based yet) • http://tirtha.cs.luc.edu:5000/genbank/.

Queries (Live) • Show datasets • http://tirtha.cs.luc.edu:5000/genbank/ • You’ll see “hiv”, perhaps others (test databases) • Show genes within “hiv” collection • http://tirtha.cs.luc.edu:5000/genbank/hiv/ • You’ll see gene names, e.g. gag, env, … • Show FASTA file for offline alignment/analytics (eventually we’ll embed it in our web services) • http://tirtha.cs.luc.edu:5000/genbank/hiv/env (gives FASTA for env gene across ALL datasets) • http://tirtha.cs.luc.edu:5000/genbank/hiv/gag (gives FASTA for gag gene across ALL datasets) • These queries are all achieved using a RESTful service, written using Python micro web framework and MongoDBdirectly. We’ll look at the code.

@app.route("/genbank") defget_databases(): client = MongoClient() db_names = client.database_names() text = '\n'.join(db_names) print("DB Names",text) resp = Response(text, status=200, mimetype='text/plain') return resp

@app.route("/genbank/<collection>") defget_collection_gene_names(collection): text = '\n'.join(get_collection_genes(collection)) resp = Response(text, status=200, mimetype='text/plain') return resp defget_collection_genes(collection): client = MongoClient() db = client[collection] return db.posts.distinct('gene')

FASTATEMPLATE=""">%(accession)s|%(gene)s|%(country)s|%(date)s|%(note)sFASTATEMPLATE=""">%(accession)s|%(gene)s|%(country)s|%(date)s|%(note)s %(sequence)s""” defget_fasta(collection, gene): client = MongoClient() db = client[collection] cursor = db.posts.find({ 'gene' : gene }) fasta = StringIO.StringIO() for item in cursor: fasta.write(FASTATEMPLATE % item) text = fasta.getvalue() fasta.close() return text @app.route("/genbank/<collection>/<gene>") defget_collection_gene(collection, gene): resp = Response(get_fasta(collection, gene), status=200, mimetype='text/plain') return resp

Doing Hadoop Map-Reduce Style Processing directly (via Mongo Shell)

Map-Reduce Mongo Style • JSON is the native storage format of MongoDB. • JavaScript is the native query language. • Much of what Hadoop does can be done in JavaScript without writing full Java programs/classes. • Map function: projects a list of key/value pairs from JSON documents (by selecting attributes of interest) • Reduce function: iterates over the keys and/or values of interest using JavaScript libraries for aggregate operations (or your own code) • Fully interactive execution makes it easy to test code without launching executable jobs to do it • Doesn’t replace Hadoop fully though (no job control) but can be addressed with off-the-shelf job schedulers/load balancers (say, in a cluster). • Following example shows how to create a map-reduce computation to determine the average length of the nucleotide sequences discovered in our Genbank data set. • This can run in parallel/distributed mode in a sharded configuration.

Map/Reduce Using MongoDB varcomputeAvgSequenceLength = function(accession, sequences) { var total = 0; for (vari=0; i < sequences.length; i++) { total += sequences[i].length } return (total + 0.0) / sequences.length; } varemitByAccessionSequence = function() { emit(this.accession, this.sequence) } db.results.remove() db.posts.mapReduce( emitByAccessionSequence, computeAvgSequenceLength, { out : "results" } ) var results = db.results.find() while (results.hasNext()) { var result = results.next(); print("average(",result['_id'], ") = ", result['value']) }

Output mongo localhost:27017/hivmapreduce.js average( JN235962 ) = 1694.5714285714287 average( JN235963 ) = 1705.857142857143 average( JN235964 ) = 1703.7142857142858 average( JN235965 ) = 1563.7142857142858 average( JN248316 ) = 1443.375 average( JN248317 ) = 1446 average( JN248318 ) = 1570 average( JN248319 ) = 1425 average( JN248320 ) = 1431.857142857143 average( JN248321 ) = 1591.4444444444443 average( JN248322 ) = 1584.3333333333333 average( JN248323 ) = 1441.875 average( JN248324 ) = 1430.142857142857 average( JN248325 ) = 1579 average( JN248326 ) = 1558.8333333333333 average( JN248327 ) = 1439.625 average( JN248328 ) = 1562.5714285714287 average( JN248329 ) = 1558.4444444444443 average( JN248330 ) = 1570.5555555555557 average( JN248331 ) = 1451.375 …

Early Visualizations of HIV Evolution (from our warehouse/web services) • The pipeline presented thus far allows us to use existing tools/services to do the visualization • BioEdit workstation tool used to show colorized view/alignment of sequence data obtained by gene. • Dendroscope library used to show the hierarchical decomposition (the Phylogenetic tree) of how the virus has evolved. • Details beyond the scope of this talk but we know we can get from here to a real-time, longitudinal view of what HIV is doing. • Future work will be to embed all of this as RESTful services and use a cluster to render any visualization on demand.

Colorized view of FASTA data (acquired by web service/Mongo collection)

Examples of trees – evolution of the tat gene from HIV isolates taken from the US 1990-2009

Same tree… different view

Future Directions • Deploy to OpenStack based private cloud (in progress) • Import entire Genbank corpus and other genomics data sets • Incorporate alignment as embedded analytics to precompute visualizations of interest • Integrate visualization into the web services. • Work on new predictive piece to identify emerging threats/mutations. • Early results suggest MongoDB can do most queries on important slices of data (virus/gene) in fractions of a second, including map-reduce style. • We’re hoping to import the entire Genbank corpus after getting our private cloud established.

Acknowledgments • Debbie Sims and colleagues at IEEE Computer Society (for the opportunity to give this webinar) • Catherine Putonti, Loyola University Chicago (Biology and Computer Science) • Steven Reisman (Graduate Student, Computer Science) for work on the longitudinal visualizations • Joe Kaylor and Konstantin Läufer, Loyola University Chicago for pairing on Scala and RESTful services work • Manish Parashar, Rutgers University (Computer Science) to discuss our shared view of big data • Rusty Eckman(Northrop-Grumman) for his helpful input and feedback.

Exploring Big Data Trends: From Basics to Cutting-Edge Technologies

Exploring Big Data Trends: From Basics to Cutting-Edge Technologies

Presentation Transcript

Big Data

Big Data

Big Data

„Big data ”

Big Data

Big Data

Big Data – Big ROI

Big Data

Big Data

BIG DATA

BIG DATA

Big Data

Big Data

BIG DATA

Big Data

Big Data

Big Data Training | Big Data Courses | Big Data Online Courses

Big Data Big Data

Big Data