140 likes | 240 Vues
Explore how to introduce scalability into smart grid projects by managing heterogeneous data, training prediction models, and efficiently updating training models. Learn about data organization, prediction models, clustering techniques, and parallelization challenges and solutions.
E N D
Introducing Scalability into Smart Grid presented by Vasileios Zois CS at USC 09/20/2013
Smart Grid Project Services • Manage Data • Sparse Data • Heterogeneous Data • Semantic Represantation • Train Prediction Models • Data Intensive Application • On Demand Procedure • Make Prediction & Update Models • Fast Access to Trained Models • Update with new values
Steps to Scalability • Management of Data • Choose Underline Technology • Evaluate provided services • Training of Models • Design Training Tools • Take Advantage of Infrastructure • Give Efficient Solutions to Training • Access & Update Training Models • Update: Change Invariants that Effect Prediction • Do it Efficiently
Managing Data • Requirements • Efficient Usage of Storage • Access Client to Data • Semantic Organization of Data • Possible Solutions • Distributed File System (HDFS) • Raw Data • Work out a Structure (XML, Ontology Schemas) • Column Oriented NoSQL Systems(Hbase,Cassandra) • Structure offered – Column Families • Implemented Operations • Still Needs Reasoning Operations
Prediction Models • Regression Tree • Support Features • Tree Building • Scalable Implementation OpenPlanet • ARIMA Model • Short Term Prediction • Does Not Support Features? • On Demand Training • Small Prediction Window
Scalable Prediction • Brute Force • Efficient use of resources • Build a system from scratch • Decrease Problem Size • Group Data and Pick Representatives • Clustering of Data with Similar Features • Introduce Features into ARIMA model • Use features to cluster Data • Execute Model on Clustered Data • Customer SuperCustomer
Parallel Clustering • Problem • Computationally Expensive • High Dimensional • Inevitable Parallelization • Challenges to Parallelization • Partitioning of Data to achieve Load Balance • Reduction of the Communication Cost • Approaches • Hierarchical Clustering : PBirch • Evolutionary Strategies Clustering • Density Based Clustering : PDBSCAN • Model Based Clustering : Autoclass System
Parallel Hierarchical Clustering • PBirch • Single Program Multiple Data(SPMD) • Message Passing Interface (MPI) • Steps • Distribute Data Equally • Build Tree on Each Processor • Execute Clustering on Leaf nodes - Parallel Kmeans • Results • Linear Speedup • Increased Communication Latency • http://www.cs.gsu.edu/~wkim/index_files/papers/pbirch.pdf
Clustering with Evolutionary Strategies • Model • Stochastic Optimization • Biological Evolution Concepts • Recombination, Mutation • Motive: Huge Range of Possible Solutions • Parallelization Techniques • Master – Slave Model • Master in charge of parent solutions • Slave in charge of recombination and mutation • Fits into mapreduce model • Proposed Solution • http://www.cs.gsu.edu/~wkim/index_files/papers/clusteringwithes.pdf
Parallel Density Based Clustering • PDBSCAN • Based on original DBSCAN Algorithm • Shared Nothing Architecture • Execution • Divide Input into Several Partitions • Concurrently Cluster Data Locally with DBSCAN • Merge Local Clusters into Global Clusters • dR*-Tree Introduced • Decreased Communication Cost – Efficient Access of Data • Distributed Data Pages • Replicated Indices on all Machines • Results • Near Linear Speedup to the number of Machines • http://www.cs.gsu.edu/~wkim/index_files/papers/fastParallel_XU.pdf
Parallel Model Based Clustering • Auto-class System • Bayesian Classification • Probability of an Instance belonging to a class • Approach • SIMD Single Instruction Multiple Data • Divide Input into Processors • Update Parameters for Classification Locally • No Need for Load Balancing • Results • Good Scaling • After a certain threshold the communication starts to hinder the performance
Clustering By Sorting Potential Values • Main Idea • Potential Model • Derived from Gravitational Force Model in Euclidean Space • Parameters: • Gravitational Constant, • Bandwidth Distance B ( Max Distance from center of cluster ) • δ threshold distance (avoid singularity problem) • Execution • Calculate Potential at each Point • Sort Points According to the Calculated Potential • Choose Cluster Centers by iteration over sorted array • If distance between to points in array > B create new cluster • Results • Near optimal Solution • http://www.sciencedirect.com/science/article/pii/S0031320312001136
Thank you for your attention! Vasilis Zois vzois@usc.edu