1 / 5

Using Pig with Scientific Applications and IndexedHBase for Social Media Data Queires

Using Pig with Scientific Applications and IndexedHBase for Social Media Data Queires. Tak -Lon (Stephen) Wu. Motivation. Provide M atlab -like UDF libraries in Pig for data scientists, save reimplementation time for same algorithm, e.g. Kmeans Clustering

dewei
Télécharger la présentation

Using Pig with Scientific Applications and IndexedHBase for Social Media Data Queires

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Pig with Scientific Applications and IndexedHBase for Social Media Data Queires Tak-Lon (Stephen) Wu

  2. Motivation • Provide Matlab-like UDF libraries in Pig for data scientists, save reimplementation time for same algorithm, e.g. Kmeans Clustering • Investigate the benefits of using Pig • Performance, Lines of code, Reusability, DataFlow control • Integrate Pig with different execution engines to meet different needs, such as iterative applications and streaming queries. • working on Harp Integration to support iterative application in a single pig script. • Pig with Apache Storm is one of the proposed Pig projects in open source community.

  3. Pig Kmeans (Single Iteration) 1 REGISTER pig-kmeans-udf-yarn.jar; 2 raw = LOAD ‘hdfs://KmeansInput’ using PigKmeans('$centroids', 50000) AS (datapoints); • 3datapointsbag = FOREACH raw GENERATE FLATTEN(datapoints) AS datapointInString:chararray; • 4 datapoints = FOREACH datapointsbagGENERATE STRSPLIT(datapointInString, ',', 5) AS splitedDP; 5 grouped = GROUP datapoints by splitedDP.$0; 6 newCentroids = FOREACH grouped GENERATE CalculateNewCentroids($1); 7 STOREnewCentroids INTO ‘output';

  4. Performance

  5. Lines of Code

More Related