Pig and IndexedHBase for Social Media Data Queries in Scientific Applications

Using Pig with Scientific Applications and IndexedHBase for Social Media Data Queires Tak-Lon (Stephen) Wu

Motivation • Provide Matlab-like UDF libraries in Pig for data scientists, save reimplementation time for same algorithm, e.g. Kmeans Clustering • Investigate the benefits of using Pig • Performance, Lines of code, Reusability, DataFlow control • Integrate Pig with different execution engines to meet different needs, such as iterative applications and streaming queries. • working on Harp Integration to support iterative application in a single pig script. • Pig with Apache Storm is one of the proposed Pig projects in open source community.

Pig Kmeans (Single Iteration) 1 REGISTER pig-kmeans-udf-yarn.jar; 2 raw = LOAD ‘hdfs://KmeansInput’ using PigKmeans('$centroids', 50000) AS (datapoints); • 3datapointsbag = FOREACH raw GENERATE FLATTEN(datapoints) AS datapointInString:chararray; • 4 datapoints = FOREACH datapointsbagGENERATE STRSPLIT(datapointInString, ',', 5) AS splitedDP; 5 grouped = GROUP datapoints by splitedDP.$0; 6 newCentroids = FOREACH grouped GENERATE CalculateNewCentroids($1); 7 STOREnewCentroids INTO ‘output';

Performance

Lines of Code

Pig and IndexedHBase for Social Media Data Queries in Scientific Applications