Student Workshop Readout
Student Workshop Readout. Big Data and Cloud Read By: Parag Deshmukh. Schedule. Sample Research Areas in Big Data and Cloud. Warehouse scale computing Big Data Algorithms and Data Structures by Giridhar Nag Yasa Resource management at scale
Student Workshop Readout
E N D
Presentation Transcript
Student Workshop Readout Big Data and Cloud Read By: Parag Deshmukh NetApp Confidential - Internal Use Only
Schedule NetApp Confidential - Internal Use Only
Sample Research Areas in Big Data and Cloud • Warehouse scale computing • Big Data Algorithms and Data Structures by Giridhar Nag Yasa • Resource management at scale • Cloud Resource Management using Machine Learning by P C Nagesh • Issues in multi-tenant environment • Security in Cloud by SrinivasanNarayanamurthy • Reliable computing with unreliable components • Reliability in Cloud by Ranjit Kumar NetApp Confidential - Internal Use Only
Brainstorming Outcome NetApp Confidential - Internal Use Only
Group 1 (Sai Susarla) Sunil Kumar (IISc), Sandeep Kumar (IISc), Shashank Gupta (IIT Bombay) • Size of lemma model for index building grown beyond memory size • Tasks are uneven in there complexities Distribution of work for even utilization of cluster while handling the lemma model which is larger than memory size NetApp Confidential - Internal Use Only
Group 2 (Vipul Mathur) Vineet P (ATG), LavanyaT (IISc), B. Ramakrishna (IIT Delhi) , Nikhil Krishnan (IISc), S.SreeVivek (IIT Chennai) • How do we secure inline-deduped uploads? • A scheme for making sure a user actually has the data, before deduplicating uploads. • Data Redundancy: Dedup, Replication and Erasure Coding • Can we find the appropriate level of redundancy to feed dedup vs. replication vs. erasure coding mechanisms? • Accessing petabytes of data at small block granularity is inefficient. • Can we learn the “appropriate” block size for a file using regression and change dynamically NetApp Confidential - Internal Use Only
Group 3 (Ajay Bakre) BirenjithSasidharan (IISc),ManjeetDahiya (IIT Delhi), PriyankaKumar (IIT Patna) • “Aadhar” dedup problem • What data structures can be used for avoiding perturbations in the finger printing store. • What should be the layout of data store and/or change in dedup algorithm so that • We have a deterministic response time of dedup algorithm irrespective of repository size NetApp Confidential - Internal Use Only
Group 4 (Ameya Usgaonkar) N. Prakash (IISc), V. Lalitha (IISc), PriyankaSingla (IISc) • De‐duplication and RAIDing • Both being at the level of 4K blocks, is there any advantage to jointly design them ? NetApp Confidential - Internal Use Only
Table arrangement for breakout session NetApp Confidential - Internal Use Only
Workshop Readout (Three Ideas) Table 2 Students: Nikhil, Vivek, Lavanya, Ramakrishna NetApp: Vineet,Vipul NetApp Confidential - Internal Use Only
A: Secure Deduped Uploads • How do we secure inline-deduped uploads? • A scheme for making sure a user actually has the data, before deduplicating uploads. • User H1(D) Server match+ dedup is insecure if some malicious person gets hold of H1(D), they can ask for D. • Server generates nonce r User H2(H1(D), r) Server match + dedup is secure as H1(D) is never sent over the network. NetApp Confidential - Internal Use Only
B: Data Redundancy: Dedup, Replication and Erasure Coding • Considerations: • Dedup removes redundancy in data,replication for performance adds redundancy • Replication for reliability vs. erasure coding • Can we find the appropriate level of redundancy to feed dedup vs. replication vs. erasure coding mechanisms? • Ideas: • Learn/ specify an activity level for m users • No dedup, possible replication for active data • Heavy dedup for cold data • Erasure coding for reliability not needed if performance replicas provide reliability too or if non-deduped copies exist • Summary: Derive and use a function f(m) to select the appropriate redundancy level taking into account dedup, replication and erasure coding. NetApp Confidential - Internal Use Only
C: Variable Block Sizes • Accessing petabytes of data at small block granularity is inefficient. • Can we learn the “appropriate” block size for a file using regression based on: • Access patterns: sequential vs. random • File sizes • Duplication factor • Track changes in patterns over time • Vary block size to adapt • Reliability methods affected: “block checksums” • Considerations: • Can a single file have variable block sizes? • Is it possible to change block sizes over time? • Use multiples of single block size. • Start with a prediction based on user’s profiles. • Hot vs. cold data should have different block sizes NetApp Confidential - Internal Use Only