50 likes | 144 Vues
Explore the benefits of using Pig and IndexedHBase for social media data queries in scientific applications. Provide Matlab-like UDF libraries in Pig for data scientists to save reimplementation time. Investigate performance, lines of code, reusability, and data flow control. Integrate Pig with different execution engines and work on Harp Integration for iterative applications. Consider Pig with Apache Storm for streaming queries. Implement Pig Kmeans clustering in a single script for improved performance and efficiency.
E N D
Using Pig with Scientific Applications and IndexedHBase for Social Media Data Queires Tak-Lon (Stephen) Wu
Motivation • Provide Matlab-like UDF libraries in Pig for data scientists, save reimplementation time for same algorithm, e.g. Kmeans Clustering • Investigate the benefits of using Pig • Performance, Lines of code, Reusability, DataFlow control • Integrate Pig with different execution engines to meet different needs, such as iterative applications and streaming queries. • working on Harp Integration to support iterative application in a single pig script. • Pig with Apache Storm is one of the proposed Pig projects in open source community.
Pig Kmeans (Single Iteration) 1 REGISTER pig-kmeans-udf-yarn.jar; 2 raw = LOAD ‘hdfs://KmeansInput’ using PigKmeans('$centroids', 50000) AS (datapoints); • 3datapointsbag = FOREACH raw GENERATE FLATTEN(datapoints) AS datapointInString:chararray; • 4 datapoints = FOREACH datapointsbagGENERATE STRSPLIT(datapointInString, ',', 5) AS splitedDP; 5 grouped = GROUP datapoints by splitedDP.$0; 6 newCentroids = FOREACH grouped GENERATE CalculateNewCentroids($1); 7 STOREnewCentroids INTO ‘output';