Abstract

Integrating Cybersecurity Log Data Analysis in HadoopBryan Stearns, Susan Urban, SindhuriJuturuTexas Tech University 2014 NSF Research Experience for Undergraduates Site Program • Conclusion: • The HVID design: • Unifies data into a common format via a unique value-based structure • Allows heterogeneous datasets to be merged by any field • Supports external or internal data mining and manipulation • Supports internal MapReduceon merged data • Should allow value-based merge and join queries to be run in comparable time to plain select queries, • Requires less space than comparable row-oriented solutions* • Can utilize backup copies for improved data interconnectivity • Requires further research and development • Provides a means to unite dirty cybersecurity data in storage without the need to explicitly outline how information should me merged until it is needed. • Introduction • The amount of unstructured or semi-structured data recorded in cybersecurity endeavors is growing every day [1]. • Increasingly sophisticated cybersecurity threats require large-scale holistic pattern analysis for proper detection [1][2]. • Heterogeneous “dirty” data generated from multiple network sources needs customized unification to be useful for such purposes [3]. • MapReduce is desirable to analyze unstructured data, but existing unification methods structure data externally from unstructured storage platforms, requiring additional I/O to feed merged data back to storage [3]. • A better method is needed to unify and manipulate disparate information from across services in a network! • Abstract • In cybersecurity, the growth of Advanced Persistent Threats (APTs) and other sophisticated attacks makes it increasingly important to analyze network and system activities from all event sources. If logs are recorded through different software packages, the resulting big data can be in a form known as dirty data that is difficult to merge and cross-analyze. Additionally, storing logs in a big data architecture like Apache Hadoop can make data joins time-consuming. Merging data as it is stored rather than at access can greatly simplify unified analysis, yet doing so requires knowledge of what kinds of merges will be required. Services wishing to holistically analyze dirty data are required to merge information externally each time data is pulled. This research is developing a file system called HVID (Hadoop Value-oriented Integration of Data) that will represent dirty log data as a single table while maintaining its raw form using a novel variant of column-oriented storage. This system utilizes the open-source big data system Hadoop HBase to enable fast access to unified views without the need for predetermined joins. This design will allow more natural and efficient holistic analysis of stored cybersecurity data both with external mining applications and with local MapReduce tools. • Design: • An inverted variant of column-oriented storage was developed in which values are the primary key and row IDs are the dependent value. • This physical association of rows with shared values allows dynamic views on relations without lengthy scans and comparisons. • HBase was chosen as the HVID platform for its flexibility, support for read-intensive applications, and its ability to integrate with Hadoop MapReduce. • Value-oriented data reside in row key byte arrays for quick scanning • Rows are sorted for fast collection of values from contiguous ranges. • Large unstructured data is stored in a separate row-oriented table. • This row-oriented form can be used to store backups of value-oriented data, while enhancing row ID resolution of value-based queries. Source Data Classic HBaseRow-Oriented storage T2 T1 • Objectives • Design and prototype a file system that: • Runs in Hadoop • Supports structured, semi-structured, and unstructured data • Provides quick access to merged tables • Does not restrict what columns are used for merging • Supports MapReduceoperations upon merged data • Implications • Faster value-based retrieval of data • Eliminate need to individually merge tables containing shared features • Reduce I/O needed for holistic unstructured MapReduce analysis T1 T2 B A C B D C * When using non-duplicated value-oriented usage row1 row1 {(“B”, “b1”), (“C”, “c2”), (“D”, “d1”)} {(“A”, “a1”), (“B”, “b1”), (“C”, “c1”)} D A C B c2 b1 a1 b1 d1 c1 {(“A”, “a2”), (“B”, “b2”), (“C”, “c2”)} {(“B”, “b2”), (“C”, “c4”), (“D”, “d2”)} row2 row2 row1 vs row1 b1:T1:row1 a1:T1:row1 d1:T2:row1 c1:T1:row1 row3 row3 {(“A”, “a3”), (“B”, “b3”), (“C”, “c3”)} {(“B”, “b4”), (“C”, “c5”), (“D”, “d3”)} c4 b2 a2 c2 b2 d2 row2 row2 c2:T1:row2 b1:T2:row1 d2:T2:row2 a2:T1:row2 c5 b3 d3 a3 c3 b4 • Future Work: • Complete working implementation for data lookup • Compare functionality with and without backup row-oriented records • Modify BulkLoad to support multi-table output from a single MapReduce • Analyze row key structure for region balancing • Load implementation onto cluster • Benchmark various operations in various cluster configurations • Create Hive interface for increased functionality and ease-of-use • Create automatic upload system and interface • Create web-interface for database access • Explore support for varying data-types within a field via qualifiers • Explore inverted clustering techniques based on inverted data row3 b2:T1:row2 c2:T2:row1 a3:T1:row3 d3:T2:row3 row3 c3:T1:row3 b2:T2:row2 HVID Storage c4:T2:row2 b3:T1:row3 b4:T2:row3 c5:T2:row3 rows {(A, a1), (B, b1), (C, c1)} T1:row1 {(A, a2), (B, b2), (C, c2)} T1:row2 {(A, a3), (B, b3), (C, c3)} T1:row3 {(B, b1), (C, c2), (D, d1)} T2:row1 {(B, b2), (C, c4), (D, d2)} T2:row2 Figure 1: Value-Based storage {(B, b4), (C, c5), (D, d3)} T2:row3 • Preliminary Testing: • Collective storage consumption by value-based HVID tables was found to be 86% of that required for a classic HBase table. • Further compression can be realized when employing numeric data • Value-based access is not yet fully implemented for testing. • Results show little difference between HVID and classic HBasefor basic row-oriented retrieval (when using rows as a backup form). • The system must be tested on a full Hadoop cluster before any speed tests can contain significant meaning. • Methods • Equipment • 64-bit single-node virtual Linux machine • IBM BigInsights V3.0.0.0 • Station with 8-thread 1.87GHz processor and 8GB RAM • Process • Design – The design proceeded with two fronts: abstract and physical. Abstract design focused on the data to be merged, while physical focused on the selection, implementation, and optimization of features available in Hadoop. • Implement – The designed data architecture was created along with basic access features. Java was used for system creation and interfacing. • Test – Basic speed tests were performed on working features. Generic system time properties at the start and completion of operations were used for tests. 128 108 • References: • [1] A. A. Cárdenas, P. K. Manadhata, and S. P. Rajan, "Big Data Analytics for Security," IEEE Security & Privacy, vol. 11, pp. 74-76, 2013. • [2] A. K. Sood and R. J. Enbody, "Targeted Cyberattacks: A Superset of Advanced Persistent Threats," IEEE Security & Privacy, vol. 11, p. 7, 2013. • [3] T.-F. Yen, A. Oprea, K. Onarlioglu, T. Leetham, W. Robertson, A. Juels, et al., "Beehive: large-scale log analysis for detecting suspicious activity in enterprise networks," presented at the Proceedings of the 29th Annual Computer Security Applications Conference, New Orleans, Louisiana, 2013. ** Only Row IDs were selected.Pulling content remains to be implemented. Figure 2: Preliminary Space and Time Comparisons • Results: • Custom database generation and access tools were created for HBase using Java. Full functionality is not yet complete, but load and basic data retrieval have been implemented for the row-oriented segment of HVID. • Whether row-oriented versions of data should be kept alongside value-oriented tables remains unknown until a full prototype is built and merge speeds are tested under various configurations. • While results are inconclusive, selection of rows through value-oriented storage shows promise. 11500 116 5500 DISCLAIMER: This material is based upon work supported by the National Science Foundation and the Department of Defense under Grant No. CNS-1263183. Any opinions, findings, and conclusions or recommendation expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or the Department of Defense. * Tests used 25MB .tsvtext file as source data

Abstract

Abstract

Presentation Transcript

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

ABSTRACT

Abstract

ABSTRACT

Abstract

ABSTRACT

Abstract

ABSTRACT

ABSTRACT

Abstract

Abstract

Abstract

ABSTRACT THE ABSTRACT / TUTORIALOUTLETDOTCOM

Abstract