1 / 28

DataLines a framework for building steaming data applications

DataLines a framework for building steaming data applications. Mike Haberman Senior Software/Network Engineer mikeh@ncsa.edu. The Problem. Data deluge: routers, switches, IDS, servers (web, mail, logs, etc), software (tcpdump, web100, SNMP, tarpit, etc), sensors, taps, … (help me). ?. ?.

Télécharger la présentation

DataLines a framework for building steaming data applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DataLinesa framework for building steaming data applications Mike Haberman Senior Software/Network Engineer mikeh@ncsa.edu

  2. The Problem • Data deluge: routers, switches, IDS, servers (web, mail, logs, etc), software (tcpdump, web100, SNMP, tarpit, etc), sensors, taps, … (help me) ? ? ?

  3. The problem (continues) • Disparate data formats • Software (sometimes) to manage each • Tweaking to get what you want (custom software) • Correlating data (more custom software)

  4. DataLines • Can we build a framework that can remove all (most) of the tedium of working with all these disparate data formats?

  5. DataLines Framework • designed to manage and build streaming data processing applications

  6. DataLines Framework • designed to manage and buildstreaming dataprocessing applications

  7. DataLines Framework designed to manage and build streaming data processing applications • Manage: would like one tool to handle all these different data sources.

  8. DataLines Framework designed to manage and build streaming data processing applications • Build: uniform way of creating a data processing application.

  9. DataLines Framework designed to manage and build streaming data processing applications • Streaming data: • Never ending stream of ‘manageable’ chunks of data • No random access, no blocking operators • One look, linear or sub-linear algorithms/data ops • Each data item (a tuple in DataLines) is an independent entity • Many tools were not designed for streaming data

  10. DataLines Framework designed to manage and build streaming data processing applications • Processing: • Something you want to do to the data (e.g. reading, writing, parsing, event generation, filtering, statistics, reports, data synopsis, …)

  11. DataLines • Creating a DataLines application: DataLines Application “compile” XML

  12. DataLines • XML file defines 3 major components: • Data Processors • What one does with the data • Processing Order • The order in which the processors will operate on the data • Event Management • What to do when a processor generates an event

  13. DataLines Processors • Data Processors are the heart of D.L. • I/O: socket, file • Filters: inline, dispatch • Collectors: binning, windowing (w/operators) • Gui: charts, picture taking • Converters: binary to tuple • Misc: printers, counters, iterators, timers, data generators, gates, delays • Processors can generate events • Processors can drop, mutate, mutilate the tuple being processed, generate new tuples

  14. DataLines Pipelines • Control tuple movement among processors • Can connect either processors or other pipelines • Two paths within a pipeline: binary and tuple

  15. Event Management • Allow processors to signal an event • timers, open/close, client connects, etc • Allow the user to tie in domain logic • Allow the user to call a processor specific API

  16. DataLines Data • The generalization of data is a DlTuple • Tuple is just a set of values • DlTuple is the interface processors use • String[] <-- getFieldNames() • DlValue <-- getValue(fieldname)

  17. DataLines Data • Tuples can have virtual fields • calculated values, static values • Tuples can have composite fields • The creation of the tuple is left to the processor in charge of conversion

  18. XML Syntax … run away! <application> <dataline name =“dl”> <processor name=“reader” type=“FileReader”> <configInfo> </configInfo> </processor> <pipeline name =“p1”> <pipe from = “reader” to = “parser” /> <pipe from = “parser” to = “printer” /> </pipeline> <eventManagement> <event name=“start”> <call method = “start” target = “reader”/> </event> <event name=“alert” from = “reader”> <call method=“stop” target=“parser” /> </event> </eventManagement> <dataline> </application>

  19. Data Example <arg name = “tupleField”> <map name = “name” value = “Src Ip”/> <map name = “peer” value = “IpV4AddressPeer” /> <map name = “length” value = “4” /> </arg>

  20. Data Example <arg name = “tupleField”> <map name = “name” value = “A”/> <map name = “peer” value = “IntegerPeer” /> <map name = “length” value = “4” /> </arg> <arg name = “tupleField”> <map name = “name” value = “B”/> <map name = “peer” value = “IntegerPeer” /> <map name = “length” value = “4” /> </arg> <arg name = “tupleField”> <map name = “name” value = “C”/> <map name = “peer” value = “JepPeer” /> <data name = “expression”> ${A} + ${B} </data> </arg>

  21. DataLines Tutorial • Fast forward past a painful 3 hour tutorial covering each of those sections in detail (tuples, processors, pipelines, event management, configurations) • You have seen all the XML though!

  22. DataLines Distilled • A library of data processors that operate on “Tuples” • one of the processors takes the raw data and creates the tuple • An XML compiler that takes the xml file, the library, and creates an application

  23. DataLines Example

  24. DataLines in use • DataLines does make it easier to hit the ground running. Much of the tedious work you need to do is taken care of • For highly specific needs, you still need to write code. But that code then becomes part of the DataLines lib. That others can build on

  25. Balance Sheet • Negative • Positive • Flexible (vendor neutral, data, debugging) • Reusable (pipelines, processors) • Fast development time • “easy” to change the client (cli, desktop, web page) • May need to write domain specific code • Learning curve -- processors config, data expectations, events

  26. DataLines in Action • Network Engineering group • Monitor router, tar pit, IDS, packet sampling, L2/L3 mappings • Security Group • Network forensics • Intergroup Wiring • Use DataLines to share data between groups/projects

  27. DataLines in Action • Network Research group • Monitor cluster network activity from MPI layer • Data Mining • Misc. NSF data oriented projects

  28. Future • Open Source • More Info: mikeh@ncsa.edu • http://datalines.ncsa.uiuc.edu (a work in progress)

More Related