Stream Processing with BigData: SSS-MapReduce

Stream Processing with BigData: SSS-MapReduce HidemotoNakada, Hirotaka Ogawa and Tomohiro Kudoh National Institute of Advanced Industrial Science and Technology, 1-1-1 Umezono, Tsukuba, Ibaraki 305-8568, JAPAN 報告者:蔡育龍

Outline • Introduction • Implementation • Overview of SSS-MapReduce • Stream Processing in SSS • Sliding Window Management • Preliminary Evaluation • Discussion • Related Work • Conclusion

1. Introduction • Existing stream processing systems are mainly targeting on the low-latency data processing and work only on the relatively small on-memory data-set • This kind of systems is very effective for specific class of applications, such as algorithm trading, but applicable area is not so large.

1. Introduction • We propose SSS，which can process streamed data along with the stored large data • SSS is basically a KVS based MapReduce System, but can handle treamed data with Continuous Mapper and Reducer process,which is periodically invoked by the system.

2. Implementation • Overview of SSS-MapReduce • Server Configuration :

2. Implementation • Implementation of Distributed KVS: • When SSS servers put key-value pair to the distributed KVSs, it determines unitary KVS to put with hashed value of the key. • All the SSS servers shares the same hash function to guarantee that key-value pairs with the same key go to the same unitary KVS. • SSS writes key-value pairs in bulk. The pairs are sorted with the key beforehand to reduce the burden on Tokyo Cabinet. SSS reads key-value pairs, again, in bulk, specifying the beginning key and the endding key of the range of keypairs.

2. Implementation • We also implemented a network service layer, called Data Server that wraps Tokyo Cabinet so that it can be accessed from remote SSS servers. • The protocol is specially designed to leverage the specific access patterns described above and to enable pipeline processing in the SSS servers. • Tuple Group: • In SSS, data space is divided into several sub namespaces called ’Tuple Group’. Mappers and Reduces read input from tuple group(s) and write the output into tuple group(s). • The Data Server allocates one date file to each TupleGroup.This design allows us to remove a whole file when we want to remove a Tuple Group.

2. Implementation • Stream Processing in SSS: • Stream Input and Output: • In SSS, stream input is represented as a continuous writes to a specific tuplegroup.Thetuple group works as input buffer for the input stream.ProcessingMapper / Reducer will read from the tuple • Periodic Mapper / Reducer: • We implemented streamed data processing by invoking Mappers and Reducers continuously and periodically. • The Mappers and Reducers reads and delete Key Value Pairs from the specified tuple Group, to ensure that one Key Value Pair is not processed more than once.

2. Implementation • When Periodic Mapper or Reducer kicks in on a TupleGroup, the Data Server create a new database file and redirectsuccessive write operations to the new file, while serving theold file for read operations from the Mapper or Reducer.

2. Implementation • MergeReducer: • The MergeReducers are special Reducer that can handle inputs more than one tuple groups • The inputs for the MergeReducers will be one Key and more than two Value lists. • In the SSS Server there are multiple threads that read tuples from each Tuple Group.

2. Implementation • Note that the data in the Tuple Group are always sorted by key. • SSS Server controls read threads so that they will give MergeReducertuples that have same keys.

2. Implementation • Sliding Window Management

2. Implementation • PostReducer have small amount of persistent storage on memory as ring buffer. • The length of the buffer is length/subwindowLength, for each key.

3. Preliminary Evaluation • We have performed a preliminary evaluation to know data stream handling throughput of SSS on one node. • The input data was randomly generated so that they mimic the Apache Web Server log records. • The record size was about 300 bytes. We repeatedly put 10000 records with 10ms interval. The input stream is directly fed into the Data Server,without bothering the SSS server.

3. Preliminary Evaluation

4.Discussion • The event (Keyvalue pare) stream will be distributed to mappers and reducers on several nodes. • And there is a shuffle phase between Map and Reduce, where events from nodes are shuffled each other. This means that Reducers will receive event in out-of-order fashion.

5. Related Work • Hadoop Online Prototype • HOP (Hadoop Online Prototype)is a Hadoop variant that enabled pipeline processing by directly connecting Mapper with Reducer and even Reducer with Mappers in the next iteration using sockets, aiming at quick iteration of MapReduce operations. • Although it can handle BigData, since it is based on Hadoop, the Continuous query only works on the stream data from outside and cannot handle the static data store in the HDFS.

5. Related Work • C-MR (Continuous MapReduce) • C-MR is a stream processing system that is targeting on a single node with multi-core processors. • C-MR adopts MapReduce as the programming interface and supports strict Sliding Window management. • Since it is meant for single node, it cannot scale out by increasing the number of nodes.

5. Related Work • S4 • S4 is a distributed processing framework for stream data, which is originally implemented by Yahoo and contributed to Apache. • The basic data structure in S4 is called Event, which is composed of key and value. • Operations are performed in modules called PEs (Processing Elements). • The basic concept of S4 is somewhat similar to SSS. The main difference is that SSS can handle off-memory big data while S4 only supports on-memory data.

6.Conclusion • SSS handles stream data with continuous MapReduce that is periodically invoked by the system and perform operation in the storage. • With the continuous MapReduce and MergeReducer, SSS can perform stream processing based on static BigData stored in the system.

Stream Processing with BigData: SSS-MapReduce