1 / 50

Adaptabilité

Adaptabilité. Les données varient. Les ressources varient. Application. Nécessité d’adaptation pour améliorer la performance. MiniSymposium Adaptive Algortihms for Scientific computing. 9h45 Adaptive algorithms - Theory and applications

melvyn
Télécharger la présentation

Adaptabilité

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptabilité Les données varient Les ressources varient Application Nécessité d’adaptation pour améliorer la performance

  2. MiniSymposiumAdaptive Algortihms for Scientific computing • 9h45 Adaptive algorithms - Theory and applications • Collective work - AHA Team Jean-Louis Roch, INRIA-CNRS Grenoble, France 10h15 Hybrids in exact linear algebra • Dave Saunders, U. Delaware, USA 10h45 Adaptive programming with hierarchical multiprocessor tasks Thomas Rauber, U. Bayreuth, Germany 11h15 Cache-Obloivious algorithms Michael Bender, Stony Brook U., USA

  3. Why adaptive algorithms ? Resources availability is versatile Data vary Objectif de AHA : vision intégrée de l’adaptation Approche algorithmique : combinaison auto-adaptative d’algorithmes avec comportement global justifié d’un point de vue théorique Mesures sur les ressources Mesures sur les données Adaptations • Choix algorithme • séquentiels/parallèle(s) • approché/exact • en mémoire / out of core • Ordonnancement • planification (scheduling) volume calculs / hétérogénéité • redistribution (load-balancing) • Calibrage • pré-paramétrage taille de blocs / cache choix d’instructions • gestion de priorités

  4. Algorithmes parallèles à grain adaptatif Exemple du préfixe Jean-Louis.Roch@imag.fr Projet MOAIS (www-id.imag.fr/MOAIS) Laboratoire ID-IMAG (CNRS-INRIA INPG-UJF)

  5. How to adapt the application ? • By minimizing communications • e.g. amortizing synchronizations in the simulation [Beaumont, Daoudi, Maillard, Manneback, Roch - PMAA 2004]adaptive granularity • By contolling latency (interactivity constraints) : • FlowVR[Allard, Menier, Raffin] overhead • By managing node failures and resilience [Checkpoint/restart][checkers] • FlowCert[Jafar, Krings, Leprevost; Roch, Varrette] • By adapting granularity • malleable tasks [Trystram, Mounié] • dataflow cactus-stack : Athapascan/Kaapi[Gautier] • recursive parallelism by « work-stealling » [Blumofe-Leiserson 98, Cilk, Athapascan, ... ] [Bender Rabin 2002] • Self-adaptive grain algorithms • dynamic extraction of paralllelism [Daoudi, Gautier, Revire, Roch - J. TSI 2005 ] [Roch, Traore, Bernard - … ]

  6. Algorithmes parallèles à grain adaptatif : Quelques exemples • Ordonnancement de programme parallèle à grain fin : work-stealing • Algorithmes à grain adaptatif : principe d’une « cascade » dynamique exemple du produit itéré • Couplage séquentiel - parallèle : exemple du préfixe

  7. F(2,a) G(a,b) H(b) b a H(a) O(b,7) High potential degree of parallelism In « practice »: coarse granularity Splitting into p = #resources Drawback : heterogeneous architecture, dynamic: i(t) : speed of processor i at time t In « theory »: fine granularity Maximal parallelism Drawback : overhead of tasks management How to choose/adapt granularity ?

  8. Greedy scheduling «Depth » parallel time on resources W = #ops on a critcal path « Work » sequential timeW1= #operations Homogeneous case [Graham 69] : greedy scheduling : No ready task when a processor is idleTp < W1/p + (1-1/p).W => Tp < W1/p + W Heterogeneous case [Jaffe 80]  Maximum utilization schedule If i < p ready tasks, assign the threads to the i faster procs High utilisation schedule [Bender 02] : parameter B If i < p ready tasks, the fastest idle processor is at most B times faster than the slowest busy processorTp < W1/(p. ave) + B.W/ave

  9. Work stealing • Distributed randomized implementation of greedy scheduling • Each processor manages locally the tasks it creates • When idle, a processor steals the oldest ready task on a remote -non idle- processor (randomly chosen) • Implementation: local stack = deque [Cilk, Kaapi] • Local parallelism is implemented by sequential function call • Local sequential execution correct => restrictions • serie-parallel/Cilk - reference order/Kaapi • On heteorogeneous processors : • Slight modification : when a processor steals a B-times slower busy processor, it preempts its task • Interests : => with good probability, #succeeded steals < p. W few task migrations [Blumofe 98, Narlikar 01, Bender 02,Revire-Roch 03, ....] => suited to heterogeneous architectures [Bender-Rabin 02] • Tp < W1/(p. ave) + O ( W/ ave )with good probability => How to have W small and W1 = #ops seq ???

  10. Best case : parallel algorithm is efficient Wis small and W1 = Wseq The parallel algorithm is an optimal sequential one Exemples: parallel D&C algorithms Implementation: work-first principle - no overhead when local execution of tasks Examples : Cilk : THE protocol Kaapi : Compare&swap only

  11. Experimentation: knary benchmark Distributed Archi. iClusterAthapascan SMP Architecture Origin 3800 (32 procs)Cilk / Athapascan Ts = 2397 s  T1 = 2435

  12. But usually, when Wis small W1 >> Wseq • Solution: to mix both sequential and parallel algorithm • Basic technique : • Parallel algorithm until a certain « grain »; then use the sequential one • Problem : T increases also, the number of migration … and the inefficiency ;o( • Work-preserving speed-up[Bini-Pan 94] = cascading technique [Jaja92] Careful interplay of both algorithms to build one with both T small and T1 = O( Ts ) • Divide the sequential algorithm into block • Each block is compute with the (non-optimal) parallel algorithm • Drawback : sequential at coarse grain and parallel at fine grain ;o( • Adaptive grain: dual approach : parallelism is extracted from any sequential task

  13. How to obtain an efficientfine-grain algorithm ? • Hypothesis for efficiency of work-stealing : • the parallel algorithm is « work-optimal » • T is very small (recursive parallelism) • Problem : • Fine grain (T small) parallel algorithms may involve a large overhead with respect to a sequential efficient algorithm: • Overhead due to parallelism creation and synchronization • But also arithmetic overhead

  14. Self-grain Adaptive algorithms • Recursive computations • Local sequential computation • Special case: • recursive extraction of parallelism when a resource becomes idle • But local execution of a sequential algorithm • Hypothesis : two algorithms : • - 1 sequential : SeqCompute • - 1 parallel : LastPartComputation => at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm • Example : • - iterated product [Vernizzi] - gzip / compression [Kerfali] • - MPEG-4 / H264 [Bernard ….] - prefix computation [Traore]

  15. SeqCompute SeqCompute Extract_par LastPartComputation Self-adaptive grain algorithm Principle : To save parallelism overhead by privilegiating a sequential algorithm : => use parallel algorithm only if a processor becomes idle by extracting parallelism from a sequential computation Hypothesis : two algorithms : • - 1 sequential : SeqCompute- 1 parallel : LastPartComputation => at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm • Examples : • - iterated product [Vernizzi] - gzip / compression [Kerfali] • - MPEG-4 / H264 [Bernard ….] - prefix computation [Traore]

  16. a0 a1 a2 a3 an-1 an * * * Préfixe ( n / 2 ) P1 P3 Pn * * * Pn-1 P2 P4 Indeed parallelism often costs ...Eg : Prefix computation P1 = a0*a1, P2=a0*a1*a2, …, Pn=a0*a1*…*an • Sequential algorithm :for (i= 0 ; i <= n; i++ ) P[ i ] = P[ i – 1 ] * a [ i ] ;T1= n • Parallel algorithm : T =2. log n but T1= 2.n

  17. Adaptive prefix computation Any (parallel) algorithm with depth T =d performs at least 2n-d operations Slower bound on p identical processors: 2n/(p+1) Block algorithm + pipeline [Nicolau 2000] Adaptive scheme : One process performs sequential computation p-1 processes perform a parallel « segmented » prefix computation :Tp < 2n/((p+1). ave) + O (log n/ ave)

  18. Adaptive Prefix versus optimalon identical processors

  19. Adaptive Prefix with variable speeds - Lower bound: decreasing parallel time => #ops increases > 2n. (1-1/p) - Adaptive grain algorithm with provable performances : dynamic cascading of two algorithms (sequential/parallel) [TSI2005}] - Theorem : T = 2n / (p*+1) + O(log n) ~ optimal on processors with average speed p* [soon 2006] External charge Parallel Parallel Adaptive Adaptive • Single user context • Adaptive is equivalent to: • - sequential on 1 proc • - optimal parallel-2 proc. on 2 processors • - … • - optimal parallel-8 proc. on 8 processors Multiuser context Adaptive is the fastest15% benefit over a static grain algorithm

  20. Adaptative 8 proc. Parallel 8 proc. Parallel 7 proc. Parallel 6 proc. Parallel 5 proc. Parallel 4 proc. Parallel 3 proc. Parallel 2 proc. Sequential The race: sequential/parallel fixed/ Adaptive Prefix

  21. Conclusion Adaptive algorithm with provable performances -> also confirmed by first experimentations To experiment : - on SMP at fine grain [floating point prefix sum] (memory, fixing workstealer on cpus) - on distributed heterogeneous architectures The scheme (and its complexity analysis) appears general - to apply the technique on oher problems [AHA]

  22. Annex

  23. f1 f1 steal f2 P P’ Implementation of work-stealing Hypothesis : a sequential schedule is valid + non-préemptive execution of ready task • Intérêt : Grain fin « statique », mais contrôle dynamique • Inconvénient: surcôut possible de l’algorithme parallèle [ex. préfixes] Stack f1() { …. fork f2 ; … } fork f2

  24. Generic self-adaptive grain algorithm

  25. Illustration : f(i), i=1..100 LastPart(w) W=2..100 SeqComp(w) sur CPU=A f(1)

  26. Illustration : f(i), i=1..100 LastPart(w) W=3..100 SeqComp(w) sur CPU=A f(1);f(2)

  27. Illustration : f(i), i=1..100 LastPart(w)on CPU=B W=3..100 SeqComp(w) sur CPU=A f(1);f(2)

  28. Illustration : f(i), i=1..100 LastPart(w)on CPU=B LastPart(w’) LastPart(w) W=3..51 W’=52..100 SeqComp(w) surCPU=A f(1);f(2) SeqComp(w’)

  29. Illustration : f(i), i=1..100 LastPart(w’) LastPart(w) W=3..51 W’=52..100 SeqComp(w) sur CPU=A f(1);f(2) SeqComp(w’)

  30. Illustration : f(i), i=1..100 LastPart(w) LastPart(w’) W=3..51 W’=53..100 SeqComp(w) sur CPU=A f(1);f(2) SeqComp(w’) sur CPU=B f(52)

  31. Adaptivité • Kaapi: réification, interaction avec l’environnement (ajout de ressources), … (interaction) • Mais aussi : impact sur l’algorithmique / ordonnancement • Example : workstealing based algorithms • Recursive parallel computations • Local sequential computation • Special case: • recursive extraction of parallelism when a resource becomes idle • But local execution of a sequential algorithm • Example : prefix computation • Sequential : n operations • Parallel on p identical resources : at least 2n.(p/(p+1)) operations • Adaptive with work-stealing : • Coupling sequential and parallel partial-prefix computation • May benefit of an unbounded number or ressources • Performance : on p processors of variable speeds :2n/(p+1) + O(log n)

  32. Adaptive algorithms • Recursive computations • Local sequential computation • Special case: • recursive extraction of parallelism when a resource becomes idle • But local execution of a sequential algorithm • Example : prefix computation • Sequential : n operations • Parallel on p identical resources : at least 2n.(p/(p+1)) operations • Adaptive with work-stealing : • Coupling sequential and parallel partial-prefix computation • May benefit of an unbounded number or ressources • Performance : on p processors of variable speeds :2n/(p+1) + O(log n)

  33. Sequential algorithm : T1= n2/2; T= n (fine grain) 0 0 0 .x = b .x = b .x = b E.g.Triangular system solving 1/ x1 = - b1 / a11 2/ For k=2..n bk = bk - ak1.x1 A system of dimension n system of dimension n-1

  34. Sequential algorithm : T1= n2/2; T= n (fine grain) 0 • Using parallel matrix inversion : T1= n3; T= log2 n (fine grain) .x = b -1 A = with -1 and x=A-1.b 0 0 -1 • Self-adaptive granularity algorithm : T1= n2; T= n.log n A11 A11 -1 -1 S= -A22.A21.A11 = A21 A22 S -1 self adaptive sequential algorithm A22 m self-adaptivematrix inversion h 0 .x = b and self-adaptive scalar product ExtractPar choice of h = m E.g.Triangular system solving

  35. Algorithmes parallèles à grain adaptatif : Quelques exemples • Ordonnancement de programme parallèle à grain fin : work-stealing et efficacité • Algorithmes à grain adaptatif : principe d’une « cascade » dynamique • Exemples • Produit itéré, préfixe • Compression gzip • Inversion de systèmes triangulaire • Vision 3D / Calcul d’oct-tree

  36. Expérimentation : parallèle <=> adaptatif Produit iteré Séquentiel, parallèle, adaptatif [Davide Vernizzi] • Séquentiel : • Entrée: tableau de n valeurs • Sortie: • c/c++ code: for (i=0; i<n; i++) res += atoi(x[i]); • Algorithme parallèle : • calcul récursif par bloc (arbre binaire avec fusion) • Taille de bloc = pagesize • Code kaapi : athapascan API

  37. Expérimentation : - l’algorithme parallèle coûte environ 2 fois plus que l’algorithme séquentiel- l’algorithme adaptatif a une efficacité proche de 1 Variante : somme de pages • Entrée: ensemble de n pages. Chaque page est un tableau de valeurs • Sortie: une page où chaque élément estla somme des éléments de même indice des pages précédentes • c/c++ code: for (i=0; i<n; i++) for (j=0; j<pageSize; j++) res [j] += f (pages[i][j]);

  38. Démonstration sur ensibull Script: [vernizzd@ensibull demo]$ more go-tout.sh #!/bin/sh ./spg /tmp/data & ./ppg /tmp/data 1 --a1 -thread.poolsize 3 & ./apg /tmp/data 1 --a1 -thread.poolsize 3 & Résultat: [vernizzd@ensibull demo]$ ./go-tout.sh Page size: 4096 Memory allocated Memory allocated 0:In main: th = 1, parallel 0: ----------------------------------------- 0: res = -2.048e+07 0: time = 0.408178 s ADAPTATIF (3 procs) 0: Threads created: 54 0: ----------------------------------------- 0: res = -2.048e+07 0: time = 0.964014 s PARALLELE (3 procs) 0: #fork = 7497 0: ----------------------------------------- : ----------------------------------------- : res = -2.048e+07 : time = 1.15204 s SEQUENTIEL (1 proc) : -----------------------------------------

  39. D’où vient la différence ? …Les sources des programmes Source des codes pour la somme des pages : parallèle / arbre binaire adaptatif par couplage ; - séquentiel + Fork<LastPartComp> - LastParComp: génération (récursive) de 3 tâches

  40. Algorithme parallèle struct Iterated { void operator() (a1::Shared_w<Page> res, int start, int stop) { if ( (stop-start) <2) { // If max num of pages is reached, sequential algorithm Page resLocal (pageSize); IteratedSeq(start, resLocal); res.write(resLocal); } else { // If max num of pages is not reached int half = (start+stop)/2; a1::Shared<Page> res1; // First thread result a1::Shared<Page> res2; // Second thread result a1::Fork<Iterated> () (res1, start, half); //First thread a1::Fork<Iterated> () (res2, half, stop); //Second thread a1::Fork<Merge> () (res, res1, res2); //Merging results... }}};

  41. Parallélisation adaptative • Calcul par bloc sur des entrées en k blocs: • 1 bloc = pagesize • Exécution indépendante des k tâches • Fusion des resultats

  42. Algorithme adaptatif (1/3) • Hypothèse: ordonnancement non préemptif - de type work-stealing • Couplage séquentiel adaptatif : void Adaptative (a1::Shared_w<Page> *resLocal, DescWork dw) { // cout << "Adaptative" << endl; a1::Shared <Page> resLPC; a1::Fork<LPC>() (resLPC, dw); Page resSeq (pageSize); AdaptSeq (dw, &resSeq); a1::Fork <Merge> () (resLPC, *resLocal, resSeq); }

  43. Algorithme adaptatif (2/3) • Côté séquentiel : void AdaptSeq (DescWork dw, Page *resSeq){ DescLocalWork w; Page resLoc (pageSize); double k; while (!dw.desc->extractSeq(&w)) { for (int i=0; i<pageSize; i++ ) { k = resLoc.get (i) + (double) buff[w*pageSize+i]; resLoc.put(i, k); } } *resSeq=resLoc; }

  44. Algorithme adaptatif (3/3) • Côté extraction = algorithme parallèle : struct LPC { void operator () (a1::Shared_w<Page> resLPC, DescWork dw){ DescWork dw2; dw2.Allocate(); dw2.desc->l.initialize(); if (dw.desc->extractPar(&dw2)) { a1::Shared<Page> res2; a1::Fork<AdaptativeMain>() (res2, dw2.desc->i, dw2.desc->j); a1::Shared<Page> resLPCold; a1::Fork<LPC>() (resLPCold, dw); a1::Fork<MergeLPC>() (resLPCold, res2, resLPC); } } };

  45. Parallélisation adaptative • Une seule tache de calcul est demarrée pour toutes les entrées • Division du travail qui reste à faire seulement dans le cas où un processeur devient inactif • Moins de taches, moins de fusions

  46. Exemple 2 : parallélisation de gzip • Gzip : • Utilisé (web) et coûteux bien que de complexité linéaire • Code source :10000 lignes C, structures de données complexes • Principe : LZ77 + arbre Huffman • Pourquoi gzip ? • Problème P-complet, mais parallélisation pratique possible • Inconvénient: toute parallélisation (connue) entraîne un surcoût • -> perte de taux de compression

  47. Algorithme Parallélisation => Fichieren entrée Partition statique en blocs Partition dynamique en blocs Compressionà la volée => Compressionparallèle Blocs compressés Fichiercompressé Comment paralléliser gzip ? Parallélisation « facile  » ,100% compatible avec gzip/gunzip Problèmes : perte de taux de compression, grain dépend de la machine, surcoût

  48. SeqComp InputFile Compressionà la volée Dynamicpartitionin blocks Parallelcompression Outputcompressedfile Outputcompressedblocks cat Parallélisation gzip à grain adaptatif LastPartComputation

  49. Surcoût en taille de fichier comprimé Gain enT

  50. Performances Pentium 4x200Mhz

More Related