1 / 32

Tailoring Data-Intensive Systems via Source Code Generation

Tailoring Data-Intensive Systems via Source Code Generation. Disclaimer. I’m not a data base guy!. This is a “position talk”!. Generating Data-Intensive Systems. My view: Data Management = DBMS + Application Code My View: Most tailoring on application side! My View:

errol
Télécharger la présentation

Tailoring Data-Intensive Systems via Source Code Generation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tailoring Data-Intensive Systems viaSource Code Generation

  2. Disclaimer I’m not adata base guy! This is a “position talk”!

  3. Generating Data-Intensive Systems My view: Data Management = DBMS + Application Code My View: Most tailoring on application side! My View: Tailoring = Source Code Generation from Models

  4. Generating Data-Intensive Systems • Object-Relational Mappings (ORM)

  5. Generating Data-Intensive Systems • Object-Relational Mappings (ORM) create table `Department` (`ID` int not null auto_increment, `DeptName` varchar(255), primary key (`ID`)) create table `Staff` (`ID` int not null auto_increment, `Name` varchar(255), `Address` varchar(255), `Phone` varchar(255), `Email` varchar(255), `OfficeExtension` varchar(255), `DepartmentID` int not null, primary key (`ID`)) alter table `Staff` add index `FK_Staff_4271` (`DepartmentID`), add constraint `FK_Staff_4271` foreign key (`DepartmentID`) references `Department` (`ID`)

  6. Generating Data-Intensive Systems • Object-Relational Mappings (ORM) • based on model transformations • archetypical MDA application • implemented in gazillions of free/academic/commercial tools

  7. Generating Data-Intensive Systems • Object-Relational Mappings (ORM) • Generating complete Database Application Code (CRUD)

  8. Generating Data-Intensive Systems • Object-Relational Mappings (ORM) • Generating complete Database Application Code (CRUD) • often based on template techniques • implemented in gazillions of free/academic/commercial tools

  9. Generating Data-Intensive Systems • Object-Relational Mappings (ORM) • Generating Database Access Code (CRUD) • Generating Web Applications (WebDSL)

  10. Generating Data-Intensive Systems • Object-Relational Mappings (ORM) • Generating Database Access Code (CRUD) • Generating Web Applications (WebDSL) • variety of techniques • implemented in free/academic/commercial tools

  11. Generating Data-Intensive Systems • Object-Relational Mappings (ORM) • Generating Database Access Code (CRUD) • Generating Web Applications (WebDSL) • Generating Advanced Applications (RBAC) • …

  12. Code Generation for Data Analysis . . . • Ground cover map: • multiple Landsat-bands • estimate pixel classes • estimate class parameters • Implementation problems: • which model? • which algorithm? • efficient C/C++ code? • correctness?

  13. Initial model: model landsat as ‘Landsat Clustering’. const nat N as ‘number of pixels’. const nat B as ‘number of bands’. const nat C := 5 as ‘number of classes’ where C << N. double phi(1..N) as ‘class weights’ where 1 = sum(I := 1..C, phi(I)). double mu(1..C), sig(1..C) where 0 < sig(_). int c(1..N) as ‘class assignments’. c(_) ~ discrete(phi). data double x(1..N, 1..B) as ‘pixels’. x(I,_) ~ gauss(mu(c(I)), sig(c(I))). max pr(x | {phi,mu,sig}) for {phi,mu,sig}. Model refinements: sig(_) ~ invgamma(delta/2+1,sig0*delta/2). Model changes: x(I,_) ~ cauchy(mu(c(I)), sig(c(I))). x(I,_) ~ mix(c(I) cases 1 -> gauss(0, error), _ -> cauchy(mu(c(I)),sig(c(I)))). Code Generation for Data Analysis . . . • Ground cover map: • multiple Landsat-bands • estimate pixel classes • estimate class parameters • Implementation problems: • which model? • which algorithm? • efficient C/C++ code? • correctness?

  14. //----------------------------------------------------------------------------//---------------------------------------------------------------------------- // Code file generated by AutoBayes V0.0.1 // (c) 1999-2004 ASE Group, NASA Ames Research Center // Problem: Mixture of Gaussians // Source: examples/cover.ab // Command: ./autobayes // examples/cover.ab // Generated: Wed Mar 10 16:07:54 2004 //---------------------------------------------------------------------------- #include "autobayes.h" //---------------------------------------------------------------------------- // Octave Function: cover //---------------------------------------------------------------------------- DEFUN_DLD(cover,input_args,output_args, "") { octave_value_list retval; if (input_args.length () != 4 || output_args != 3 ){ octave_stdout << "wrong usage"; return retval; } //-- Input declarations ---------------------------------------------------- // Number of classes octave_value arg_n_classes = input_args(0); if (!arg_n_classes.is_real_scalar()){ gripe_wrong_type_arg("n_classes", (const string &)"Scalar expected"); return retval; } double n_classes = (double)(arg_n_classes.double_value()); octave_value arg_x = input_args(1); if (!arg_x.is_real_matrix()){ gripe_wrong_type_arg("x", (const string &)"Matrix expected"); return retval; } Matrix x = (Matrix)(arg_x.matrix_value()); // Iteration tolerance for convergence loop octave_value arg_tolerance1 = input_args(2); if (!arg_tolerance1.is_real_scalar()){ gripe_wrong_type_arg("tolerance1", (const string &)"Scalar expected"); return retval; } double tolerance1 = (double)(arg_tolerance1.double_value()); // maximal number of EM iterations octave_value arg_maxiteration = input_args(3); if (!arg_maxiteration.is_real_scalar()){ gripe_wrong_type_arg("maxiteration", (const string &)"Scalar expected"); return retval; } double maxiteration = (double)(arg_maxiteration.double_value()); //-- Constant declarations ------------------------------------------------- // Number of data dimensions int n_dim = arg_x.rows(); // Number of data points int n_points = arg_x.columns(); // tolerance for appr. 600 data points double tolerance = 0.0003; //-- Output declarations --------------------------------------------------- Matrix mu(n_dim, n_classes); ColumnVector rho(n_classes); Matrix sigma(n_dim, n_classes); //-- Local declarations ---------------------------------------------------- // class assignment vector ColumnVector c(n_points); // Label: label0 // class membership table used in Discrete EM-algorithm Matrix q(n_points, n_classes); // local centers used for center-based initialization Matrix center(n_classes, n_dim); // Random index of data point int pick; // Loop variable int pv68; int pv65; int pv66; int pv42; int pv43; int pv13; int pv24; int pv56; int pv80; // Summation accumulator double pv83; double pv84; double pv85; // Lagrange-multiplier double l; // Common subexpression double pv52; // Memoized common subexpression ColumnVector pv58(n_classes); // Common subexpression double pv60; int pv70; int pv69; int pv71; Matrix muold(n_dim, n_classes); ColumnVector rhoold(n_classes); Matrix sigmaold(n_dim, n_classes); int pv73; int pv74; int pv75; int pv76; int pv77; // convergence loop counter int loopcounter; // sum up the Diffs double pv86; // Summation accumulator double pv92; double pv94; double pv95; double pv99; double pv98; double pv100; int pv26; int pv49; int pv48; int pv50; // Product accumulator double pv96; int pv53; int pv55; int pv89; int pv88; int pv87; int pv90; int pv91; // Check constraints on inputs ab_assert( 0 < n_classes ); ab_assert( 10 * n_classes < n_points ); ab_assert( 0 < n_dim ); ab_assert( 0 < n_points ); // Label: label1 // Label: label2 // Label: label4 // Discrete EM-algorithm // // The model describes a discrete latent (or hidden) variable problem with // the latent variable c and the data variable x. The problem to optimize // the conditional probability pr(x | {mu, rho, sigma}) w.r.t. the variables // mu, rho, and sigma can thus be solved by an application of the (discrete) // EM-algorithm. // The algorithm maintains as central data structure a class membership // table q (see "label0") such that q(pv13, pv61) is the probability that // data point pv13 belongs to class pv61, i.e., // q(pv13, pv61) == pr([c(pv13) == pv61]) // The algorithm consists of an initialization phase for q (see "label2"), // followed by a convergence phase (see "label5"). // // Initialization // The initialization is center-based, i.e., for each class (i.e., value of // the hidden variable c) a center value center is chosen first // (see "label4"). Then, the values for the local distribution are // calculated as distances between the data points and these center values // (see "label6"). // // Random initialization of the centers center with data points; // note that a data point can be picked as center more than once. for( pv65 = 0;pv65 <= n_classes - 1;pv65++ ) { pick = uniform_int_rnd(n_points - 1 - 0); for( pv66 = 0;pv66 <= n_dim - 1;pv66++ ) center(pv65, pv66) = x(pv66, pick); } // Label: label6 for( pv13 = 0;pv13 <= n_points - 1;pv13++ ) for( pv68 = 0;pv68 <= n_classes - 1;pv68++ ) { pv83 = 0; for( pv69 = 0;pv69 <= n_dim - 1;pv69++ ) pv83 += (center(pv68, pv69) - x(pv69, pv13)) * (center(pv68, pv69) - x(pv69, pv13)); pv85 = 0; for( pv70 = 0;pv70 <= n_classes - 1;pv70++ ) { pv84 = 0; for( pv71 = 0;pv71 <= n_dim - 1;pv71++ ) pv84 += (center(pv70, pv71) - x(pv71, pv13)) * (center(pv70, pv71) - x(pv71, pv13)); pv85 += sqrt(pv84); } q(pv13, pv68) = sqrt(pv83) / pv85; } // Label: label5 // EM-loop // // The EM-loop iterates two steps, expectation (or E-Step) (see "label7"), // and maximization (or M-Step) (see "label8"); however, due to the form of // the initialization used here, the are ordered the other way around. The // loop runs until convergence in the values of the variables mu, rho, and // sigma is achieved. loopcounter = 0; // repeat at least once pv86 = tolerance1; while( ((loopcounter < maxiteration) && (pv86 >= tolerance1)) ) { loopcounter = 1 + loopcounter; if ( loopcounter > 1 ) { // assign current values to old values for( pv73 = 0;pv73 <= n_dim - 1;pv73++ ) for( pv74 = 0;pv74 <= n_classes - 1;pv74++ ) muold(pv73, pv74) = mu(pv73, pv74); for( pv75 = 0;pv75 <= n_classes - 1;pv75++ ) rhoold(pv75) = rho(pv75); for( pv76 = 0;pv76 <= n_dim - 1;pv76++ ) for( pv77 = 0;pv77 <= n_classes - 1;pv77++ ) sigmaold(pv76, pv77) = sigma(pv76, pv77); } else ; // Label: label8 label3 // M-Step // // Decomposition I // The problem to optimize the conditional probability pr({c, x} | {mu, // rho, sigma}) w.r.t. the variables mu, rho, and sigma can under the // given dependencies by Bayes rule be decomposed into two independent // subproblems: // max pr(c | rho) for rho // max pr(x | {c, mu, sigma}) for {mu, sigma} // The conditional probability pr(c | rho) is under the dependencies // given in the model equivalent to // prod([pv17 := 0 .. n_points - 1], pr(c(pv17) | rho)) // The probability occuring here is atomic and can thus be replaced by // the respective probability density function given in the model. // Summing out the expected variable c(pv13) yields the log-likelihood // function // sum_domain([pv13 := 0 .. n_points - 1], // [pv18 := 0 .. n_classes - 1], [c(pv13)], q(pv13, pv18), // log(prod([pv17 := 0 .. n_points - 1], rho(c(pv17))))) // which can be simplified to // sum([pv18 := 0 .. n_classes - 1], // log(rho(pv18)) * sum([pv17 := 0 .. n_points - 1], q(pv17, pv18))) // This function is then optimized w.r.t. the goal variable rho. // The expression // sum([pv18 := 0 .. n_classes - 1], // log(rho(pv18)) * sum([pv17 := 0 .. n_points - 1], q(pv17, pv18))) // is maximized w.r.t. the variable rho under the constraint // 1 == sum([pv23 := 0 .. n_classes - 1], rho(pv23)) // using the Lagrange-multiplier l. l = n_points; for( pv24 = 0;pv24 <= n_classes - 1;pv24++ ) { // The summand // l // is constant with respect to the goal variable rho(pv24) and can // thus be ignored for maximization. The function // sum([pv18 := 0 .. n_classes - 1], // log(rho(pv18)) * // sum([pv17 := 0 .. n_points - 1], q(pv17, pv18))) - // l * sum([pv23 := 0 .. n_classes - 1], rho(pv23)) // is then symbolically maximized w.r.t. the goal variable rho(pv24). // The differential // sum([pv17 := 0 .. n_points - 1], q(pv17, pv24)) / rho(pv24) - l // is set to zero; this equation yields the solution // sum([pv26 := 0 .. n_points - 1], q(pv26, pv24)) / l pv92 = 0; for( pv26 = 0;pv26 <= n_points - 1;pv26++ ) pv92 += q(pv26, pv24); rho(pv24) = pv92 / l; } // The conditional probability pr(x | {c, mu, sigma}) is under the // dependencies given in the model equivalent to // prod([pv32 := 0 .. n_dim - 1, pv33 := 0 .. n_points - 1], // pr(x(pv32, pv33) | {c(pv33), mu(pv32, *), sigma(pv32, *)})) // The probability occuring here is atomic and can thus be replaced by // the respective probability density function given in the model. // Summing out the expected variable c(pv13) yields the log-likelihood // function // sum_domain([pv13 := 0 .. n_points - 1], // [pv34 := 0 .. n_classes - 1], [c(pv13)], q(pv13, pv34), // log(prod([pv32 := 0 .. n_dim - 1, // pv33 := 0 .. n_points - 1], // exp((x(pv32, pv33) - mu(pv32, c(pv33))) ** 2 / // -2 / sigma(pv32, c(pv33)) ** 2) / // (sqrt(2 * pi) * sigma(pv32, c(pv33)))))) // which can be simplified to // -((n_dim * n_points * (log(2) + log(pi)) + // sum([pv33 := 0 .. n_points - 1, pv34 := 0 .. n_classes - 1], // q(pv33, pv34) * // sum([pv32 := 0 .. n_dim - 1], // (x(pv32, pv33) - mu(pv32, pv34)) ** 2 / // sigma(pv32, pv34) ** 2))) / 2 + // sum([pv34 := 0 .. n_classes - 1], // sum([pv32 := 0 .. n_dim - 1], log(sigma(pv32, pv34))) * // sum([pv33 := 0 .. n_points - 1], q(pv33, pv34)))) // This function is then optimized w.r.t. the goal variables mu and // sigma. // The summands // n_dim * n_points * log(2) / -2 // n_dim * n_points * log(pi) / -2 // are constant with respect to the goal variables mu and sigma and can // thus be ignored for maximization. // Index decomposition // The function // -(sum([pv33 := 0 .. n_points - 1, pv34 := 0 .. n_classes - 1], // q(pv33, pv34) * // sum([pv32 := 0 .. n_dim - 1], // (x(pv32, pv33) - mu(pv32, pv34)) ** 2 / // sigma(pv32, pv34) ** 2)) / 2 + // sum([pv34 := 0 .. n_classes - 1], // sum([pv32 := 0 .. n_dim - 1], log(sigma(pv32, pv34))) * // sum([pv33 := 0 .. n_points - 1], q(pv33, pv34)))) // can be optimized w.r.t. the variables mu(pv42, pv43) and // sigma(pv42, pv43) element by element (i.e., along the index variables // pv42 and pv43) because there are no dependencies along thats dimensions. for( pv42 = 0;pv42 <= n_dim - 1;pv42++ ) for( pv43 = 0;pv43 <= n_classes - 1;pv43++ ) // The factors // n_classes // n_dim // are non-negative and constant with respect to the goal variables // mu(pv42, pv43) and sigma(pv42, pv43) and can thus be ignored for // maximization. // The function // -(log(sigma(pv42, pv43)) * // sum([pv33 := 0 .. n_points - 1], q(pv33, pv43)) + // sum([pv33 := 0 .. n_points - 1], // (x(pv42, pv33) - mu(pv42, pv43)) ** 2 * q(pv33, pv43)) / // (2 * sigma(pv42, pv43) ** 2)) // is then symbolically maximized w.r.t. the goal variables // mu(pv42, pv43) and sigma(pv42, pv43). The partial differentials // df / d_mu(pvar(42), pvar(43)) == // (sum([pv33 := 0 .. n_points - 1], // q(pv33, pv43) * x(pv42, pv33)) - // mu(pv42, pv43) * // sum([pv33 := 0 .. n_points - 1], q(pv33, pv43))) / // sigma(pv42, pv43) ** 2 // df / d_sigma(pvar(42), pvar(43)) == // sum([pv33 := 0 .. n_points - 1], // (x(pv42, pv33) - mu(pv42, pv43)) ** 2 * q(pv33, pv43)) / // sigma(pv42, pv43) ** 3 - // sum([pv33 := 0 .. n_points - 1], q(pv33, pv43)) / // sigma(pv42, pv43) // are set to zero; these equations yield the solutions // mu(pv42, pv43) == // cond(0 == sum([pv46 := 0 .. n_points - 1], q(pv46, pv43)), // fail(division_by_zero), // sum([pv48 := 0 .. n_points - 1], // q(pv48, pv43) * x(pv42, pv48)) / // sum([pv47 := 0 .. n_points - 1], q(pv47, pv43))) // sigma(pv42, pv43) == // cond(0 == sum([pv49 := 0 .. n_points - 1], q(pv49, pv43)), // fail(division_by_zero), // sqrt(sum([pv50 := 0 .. n_points - 1], // (x(pv42, pv50) - mu(pv42, pv43)) ** 2 * // q(pv50, pv43)) / // sum([pv51 := 0 .. n_points - 1], q(pv51, pv43)))) { // Initialization of common subexpression pv52 = 0; for( pv49 = 0;pv49 <= n_points - 1;pv49++ ) pv52 += q(pv49, pv43); if ( 0 == pv52 ) { ab_error( division_by_zero ); } else { pv94 = 0; for( pv48 = 0;pv48 <= n_points - 1;pv48++ ) pv94 += q(pv48, pv43) * x(pv42, pv48); mu(pv42, pv43) = pv94 / pv52; } if ( 0 == pv52 ) { ab_error( division_by_zero ); } else { pv95 = 0; for( pv50 = 0;pv50 <= n_points - 1;pv50++ ) pv95 += (x(pv42, pv50) - mu(pv42, pv43)) * (x(pv42, pv50) - mu(pv42, pv43)) * q(pv50, pv43); sigma(pv42, pv43) = sqrt(pv95 / pv52); } } // Label: label7 // E-Step // Update the current values of the class membership table q. for( pv13 = 0;pv13 <= n_points - 1;pv13++ ) { // Initialization of common subexpression for( pv56 = 0;pv56 <= n_classes - 1;pv56++ ) { pv96 = 1; for( pv53 = 0;pv53 <= n_dim - 1;pv53++ ) pv96 *= exp((x(pv53, pv13) - mu(pv53, pv56)) * (x(pv53, pv13) - mu(pv53, pv56)) / (double)(-2) / (sigma(pv53, pv56) * sigma(pv53, pv56))) / (sqrt(M_PI * (double)(2)) * sigma(pv53, pv56)); pv58(pv56) = pv96 * rho(pv56); } pv60 = 0; for( pv55 = 0;pv55 <= n_classes - 1;pv55++ ) pv60 += pv58(pv55); for( pv80 = 0;pv80 <= n_classes - 1;pv80++ ) // The denominator pv60 can become zero due to round-off errors. // In that case, each class is considered to be equally likely. if ( pv60 == 0 ) q(pv13, pv80) = (double)(1) / (double)(n_classes); else q(pv13, pv80) = pv58(pv80) / pv60; } if ( loopcounter > 1 ) { pv98 = 0; for( pv87 = 0;pv87 <= n_dim - 1;pv87++ ) for( pv88 = 0;pv88 <= n_classes - 1;pv88++ ) pv98 += abs(mu(pv87, pv88) - muold(pv87, pv88)) / (abs(mu(pv87, pv88)) + abs(muold(pv87, pv88))); pv99 = 0; for( pv89 = 0;pv89 <= n_classes - 1;pv89++ ) pv99 += abs(rho(pv89) - rhoold(pv89)) / (abs(rho(pv89)) + abs(rhoold(pv89))); pv100 = 0; for( pv90 = 0;pv90 <= n_dim - 1;pv90++ ) for( pv91 = 0;pv91 <= n_classes - 1;pv91++ ) pv100 += abs(sigma(pv90, pv91) - sigmaold(pv90, pv91)) / (abs(sigma(pv90, pv91)) + abs(sigmaold(pv90, pv91))); pv86 = pv98 + pv99 + pv100; } else ; } retval.resize(3); retval(0) = mu; retval(1) = rho; retval(2) = sigma; return retval; } //-- End of code --------------------------------------------------------------- Code Generation for Data Analysis Generated program: . . . • Ground cover map: • multiple Landsat-bands • estimate pixel classes • estimate class parameters • Implementation problems: • which model? • which algorithm? • efficient C/C++ code? • correctness?

  15. //----------------------------------------------------------------------------//---------------------------------------------------------------------------- // Code file generated by AutoBayes V0.0.1 // (c) 1999-2004 ASE Group, NASA Ames Research Center // Problem: Mixture of Gaussians // Source: examples/cover.ab // Command: ./autobayes // examples/cover.ab // Generated: Wed Mar 10 16:07:54 2004 //---------------------------------------------------------------------------- #include "autobayes.h" //---------------------------------------------------------------------------- // Octave Function: cover //---------------------------------------------------------------------------- DEFUN_DLD(cover,input_args,output_args, "") { octave_value_list retval; if (input_args.length () != 4 || output_args != 3 ){ octave_stdout << "wrong usage"; return retval; } //-- Input declarations ---------------------------------------------------- // Number of classes octave_value arg_n_classes = input_args(0); if (!arg_n_classes.is_real_scalar()){ gripe_wrong_type_arg("n_classes", (const string &)"Scalar expected"); return retval; } double n_classes = (double)(arg_n_classes.double_value()); octave_value arg_x = input_args(1); if (!arg_x.is_real_matrix()){ gripe_wrong_type_arg("x", (const string &)"Matrix expected"); return retval; } Matrix x = (Matrix)(arg_x.matrix_value()); // Iteration tolerance for convergence loop octave_value arg_tolerance1 = input_args(2); if (!arg_tolerance1.is_real_scalar()){ gripe_wrong_type_arg("tolerance1", (const string &)"Scalar expected"); return retval; } double tolerance1 = (double)(arg_tolerance1.double_value()); // maximal number of EM iterations octave_value arg_maxiteration = input_args(3); if (!arg_maxiteration.is_real_scalar()){ gripe_wrong_type_arg("maxiteration", (const string &)"Scalar expected"); return retval; } double maxiteration = (double)(arg_maxiteration.double_value()); //-- Constant declarations ------------------------------------------------- // Number of data dimensions int n_dim = arg_x.rows(); // Number of data points int n_points = arg_x.columns(); // tolerance for appr. 600 data points double tolerance = 0.0003; //-- Output declarations --------------------------------------------------- Matrix mu(n_dim, n_classes); ColumnVector rho(n_classes); Matrix sigma(n_dim, n_classes); //-- Local declarations ---------------------------------------------------- // class assignment vector ColumnVector c(n_points); // Label: label0 // class membership table used in Discrete EM-algorithm Matrix q(n_points, n_classes); // local centers used for center-based initialization Matrix center(n_classes, n_dim); // Random index of data point int pick; // Loop variable int pv68; int pv65; int pv66; int pv42; int pv43; int pv13; int pv24; int pv56; int pv80; // Summation accumulator double pv83; double pv84; double pv85; // Lagrange-multiplier double l; // Common subexpression double pv52; // Memoized common subexpression ColumnVector pv58(n_classes); // Common subexpression double pv60; int pv70; int pv69; int pv71; Matrix muold(n_dim, n_classes); ColumnVector rhoold(n_classes); Matrix sigmaold(n_dim, n_classes); int pv73; int pv74; int pv75; int pv76; int pv77; // convergence loop counter int loopcounter; // sum up the Diffs double pv86; // Summation accumulator double pv92; double pv94; double pv95; double pv99; double pv98; double pv100; int pv26; int pv49; int pv48; int pv50; // Product accumulator double pv96; int pv53; int pv55; int pv89; int pv88; int pv87; int pv90; int pv91; // Check constraints on inputs ab_assert( 0 < n_classes ); ab_assert( 10 * n_classes < n_points ); ab_assert( 0 < n_dim ); ab_assert( 0 < n_points ); // Label: label1 // Label: label2 // Label: label4 // Discrete EM-algorithm // // The model describes a discrete latent (or hidden) variable problem with // the latent variable c and the data variable x. The problem to optimize // the conditional probability pr(x | {mu, rho, sigma}) w.r.t. the variables // mu, rho, and sigma can thus be solved by an application of the (discrete) // EM-algorithm. // The algorithm maintains as central data structure a class membership // table q (see "label0") such that q(pv13, pv61) is the probability that // data point pv13 belongs to class pv61, i.e., // q(pv13, pv61) == pr([c(pv13) == pv61]) // The algorithm consists of an initialization phase for q (see "label2"), // followed by a convergence phase (see "label5"). // // Initialization // The initialization is center-based, i.e., for each class (i.e., value of // the hidden variable c) a center value center is chosen first // (see "label4"). Then, the values for the local distribution are // calculated as distances between the data points and these center values // (see "label6"). // // Random initialization of the centers center with data points; // note that a data point can be picked as center more than once. for( pv65 = 0;pv65 <= n_classes - 1;pv65++ ) { pick = uniform_int_rnd(n_points - 1 - 0); for( pv66 = 0;pv66 <= n_dim - 1;pv66++ ) center(pv65, pv66) = x(pv66, pick); } // Label: label6 for( pv13 = 0;pv13 <= n_points - 1;pv13++ ) for( pv68 = 0;pv68 <= n_classes - 1;pv68++ ) { pv83 = 0; for( pv69 = 0;pv69 <= n_dim - 1;pv69++ ) pv83 += (center(pv68, pv69) - x(pv69, pv13)) * (center(pv68, pv69) - x(pv69, pv13)); pv85 = 0; for( pv70 = 0;pv70 <= n_classes - 1;pv70++ ) { pv84 = 0; for( pv71 = 0;pv71 <= n_dim - 1;pv71++ ) pv84 += (center(pv70, pv71) - x(pv71, pv13)) * (center(pv70, pv71) - x(pv71, pv13)); pv85 += sqrt(pv84); } q(pv13, pv68) = sqrt(pv83) / pv85; } // Label: label5 // EM-loop // // The EM-loop iterates two steps, expectation (or E-Step) (see "label7"), // and maximization (or M-Step) (see "label8"); however, due to the form of // the initialization used here, the are ordered the other way around. The // loop runs until convergence in the values of the variables mu, rho, and // sigma is achieved. loopcounter = 0; // repeat at least once pv86 = tolerance1; while( ((loopcounter < maxiteration) && (pv86 >= tolerance1)) ) { loopcounter = 1 + loopcounter; if ( loopcounter > 1 ) { // assign current values to old values for( pv73 = 0;pv73 <= n_dim - 1;pv73++ ) for( pv74 = 0;pv74 <= n_classes - 1;pv74++ ) muold(pv73, pv74) = mu(pv73, pv74); for( pv75 = 0;pv75 <= n_classes - 1;pv75++ ) rhoold(pv75) = rho(pv75); for( pv76 = 0;pv76 <= n_dim - 1;pv76++ ) for( pv77 = 0;pv77 <= n_classes - 1;pv77++ ) sigmaold(pv76, pv77) = sigma(pv76, pv77); } else ; // Label: label8 label3 // M-Step // // Decomposition I // The problem to optimize the conditional probability pr({c, x} | {mu, // rho, sigma}) w.r.t. the variables mu, rho, and sigma can under the // given dependencies by Bayes rule be decomposed into two independent // subproblems: // max pr(c | rho) for rho // max pr(x | {c, mu, sigma}) for {mu, sigma} // The conditional probability pr(c | rho) is under the dependencies // given in the model equivalent to // prod([pv17 := 0 .. n_points - 1], pr(c(pv17) | rho)) // The probability occuring here is atomic and can thus be replaced by // the respective probability density function given in the model. // Summing out the expected variable c(pv13) yields the log-likelihood // function // sum_domain([pv13 := 0 .. n_points - 1], // [pv18 := 0 .. n_classes - 1], [c(pv13)], q(pv13, pv18), // log(prod([pv17 := 0 .. n_points - 1], rho(c(pv17))))) // which can be simplified to // sum([pv18 := 0 .. n_classes - 1], // log(rho(pv18)) * sum([pv17 := 0 .. n_points - 1], q(pv17, pv18))) // This function is then optimized w.r.t. the goal variable rho. // The expression // sum([pv18 := 0 .. n_classes - 1], // log(rho(pv18)) * sum([pv17 := 0 .. n_points - 1], q(pv17, pv18))) // is maximized w.r.t. the variable rho under the constraint // 1 == sum([pv23 := 0 .. n_classes - 1], rho(pv23)) // using the Lagrange-multiplier l. l = n_points; for( pv24 = 0;pv24 <= n_classes - 1;pv24++ ) { // The summand // l // is constant with respect to the goal variable rho(pv24) and can // thus be ignored for maximization. The function // sum([pv18 := 0 .. n_classes - 1], // log(rho(pv18)) * // sum([pv17 := 0 .. n_points - 1], q(pv17, pv18))) - // l * sum([pv23 := 0 .. n_classes - 1], rho(pv23)) // is then symbolically maximized w.r.t. the goal variable rho(pv24). // The differential // sum([pv17 := 0 .. n_points - 1], q(pv17, pv24)) / rho(pv24) - l // is set to zero; this equation yields the solution // sum([pv26 := 0 .. n_points - 1], q(pv26, pv24)) / l pv92 = 0; for( pv26 = 0;pv26 <= n_points - 1;pv26++ ) pv92 += q(pv26, pv24); rho(pv24) = pv92 / l; } // The conditional probability pr(x | {c, mu, sigma}) is under the // dependencies given in the model equivalent to // prod([pv32 := 0 .. n_dim - 1, pv33 := 0 .. n_points - 1], // pr(x(pv32, pv33) | {c(pv33), mu(pv32, *), sigma(pv32, *)})) // The probability occuring here is atomic and can thus be replaced by // the respective probability density function given in the model. // Summing out the expected variable c(pv13) yields the log-likelihood // function // sum_domain([pv13 := 0 .. n_points - 1], // [pv34 := 0 .. n_classes - 1], [c(pv13)], q(pv13, pv34), // log(prod([pv32 := 0 .. n_dim - 1, // pv33 := 0 .. n_points - 1], // exp((x(pv32, pv33) - mu(pv32, c(pv33))) ** 2 / // -2 / sigma(pv32, c(pv33)) ** 2) / // (sqrt(2 * pi) * sigma(pv32, c(pv33)))))) // which can be simplified to // -((n_dim * n_points * (log(2) + log(pi)) + // sum([pv33 := 0 .. n_points - 1, pv34 := 0 .. n_classes - 1], // q(pv33, pv34) * // sum([pv32 := 0 .. n_dim - 1], // (x(pv32, pv33) - mu(pv32, pv34)) ** 2 / // sigma(pv32, pv34) ** 2))) / 2 + // sum([pv34 := 0 .. n_classes - 1], // sum([pv32 := 0 .. n_dim - 1], log(sigma(pv32, pv34))) * // sum([pv33 := 0 .. n_points - 1], q(pv33, pv34)))) // This function is then optimized w.r.t. the goal variables mu and // sigma. // The summands // n_dim * n_points * log(2) / -2 // n_dim * n_points * log(pi) / -2 // are constant with respect to the goal variables mu and sigma and can // thus be ignored for maximization. // Index decomposition // The function // -(sum([pv33 := 0 .. n_points - 1, pv34 := 0 .. n_classes - 1], // q(pv33, pv34) * // sum([pv32 := 0 .. n_dim - 1], // (x(pv32, pv33) - mu(pv32, pv34)) ** 2 / // sigma(pv32, pv34) ** 2)) / 2 + // sum([pv34 := 0 .. n_classes - 1], // sum([pv32 := 0 .. n_dim - 1], log(sigma(pv32, pv34))) * // sum([pv33 := 0 .. n_points - 1], q(pv33, pv34)))) // can be optimized w.r.t. the variables mu(pv42, pv43) and // sigma(pv42, pv43) element by element (i.e., along the index variables // pv42 and pv43) because there are no dependencies along thats dimensions. for( pv42 = 0;pv42 <= n_dim - 1;pv42++ ) for( pv43 = 0;pv43 <= n_classes - 1;pv43++ ) // The factors // n_classes // n_dim // are non-negative and constant with respect to the goal variables // mu(pv42, pv43) and sigma(pv42, pv43) and can thus be ignored for // maximization. // The function // -(log(sigma(pv42, pv43)) * // sum([pv33 := 0 .. n_points - 1], q(pv33, pv43)) + // sum([pv33 := 0 .. n_points - 1], // (x(pv42, pv33) - mu(pv42, pv43)) ** 2 * q(pv33, pv43)) / // (2 * sigma(pv42, pv43) ** 2)) // is then symbolically maximized w.r.t. the goal variables // mu(pv42, pv43) and sigma(pv42, pv43). The partial differentials // df / d_mu(pvar(42), pvar(43)) == // (sum([pv33 := 0 .. n_points - 1], // q(pv33, pv43) * x(pv42, pv33)) - // mu(pv42, pv43) * // sum([pv33 := 0 .. n_points - 1], q(pv33, pv43))) / // sigma(pv42, pv43) ** 2 // df / d_sigma(pvar(42), pvar(43)) == // sum([pv33 := 0 .. n_points - 1], // (x(pv42, pv33) - mu(pv42, pv43)) ** 2 * q(pv33, pv43)) / // sigma(pv42, pv43) ** 3 - // sum([pv33 := 0 .. n_points - 1], q(pv33, pv43)) / // sigma(pv42, pv43) // are set to zero; these equations yield the solutions // mu(pv42, pv43) == // cond(0 == sum([pv46 := 0 .. n_points - 1], q(pv46, pv43)), // fail(division_by_zero), // sum([pv48 := 0 .. n_points - 1], // q(pv48, pv43) * x(pv42, pv48)) / // sum([pv47 := 0 .. n_points - 1], q(pv47, pv43))) // sigma(pv42, pv43) == // cond(0 == sum([pv49 := 0 .. n_points - 1], q(pv49, pv43)), // fail(division_by_zero), // sqrt(sum([pv50 := 0 .. n_points - 1], // (x(pv42, pv50) - mu(pv42, pv43)) ** 2 * // q(pv50, pv43)) / // sum([pv51 := 0 .. n_points - 1], q(pv51, pv43)))) { // Initialization of common subexpression pv52 = 0; for( pv49 = 0;pv49 <= n_points - 1;pv49++ ) pv52 += q(pv49, pv43); if ( 0 == pv52 ) { ab_error( division_by_zero ); } else { pv94 = 0; for( pv48 = 0;pv48 <= n_points - 1;pv48++ ) pv94 += q(pv48, pv43) * x(pv42, pv48); mu(pv42, pv43) = pv94 / pv52; } if ( 0 == pv52 ) { ab_error( division_by_zero ); } else { pv95 = 0; for( pv50 = 0;pv50 <= n_points - 1;pv50++ ) pv95 += (x(pv42, pv50) - mu(pv42, pv43)) * (x(pv42, pv50) - mu(pv42, pv43)) * q(pv50, pv43); sigma(pv42, pv43) = sqrt(pv95 / pv52); } } // Label: label7 // E-Step // Update the current values of the class membership table q. for( pv13 = 0;pv13 <= n_points - 1;pv13++ ) { // Initialization of common subexpression for( pv56 = 0;pv56 <= n_classes - 1;pv56++ ) { pv96 = 1; for( pv53 = 0;pv53 <= n_dim - 1;pv53++ ) pv96 *= exp((x(pv53, pv13) - mu(pv53, pv56)) * (x(pv53, pv13) - mu(pv53, pv56)) / (double)(-2) / (sigma(pv53, pv56) * sigma(pv53, pv56))) / (sqrt(M_PI * (double)(2)) * sigma(pv53, pv56)); pv58(pv56) = pv96 * rho(pv56); } pv60 = 0; for( pv55 = 0;pv55 <= n_classes - 1;pv55++ ) pv60 += pv58(pv55); for( pv80 = 0;pv80 <= n_classes - 1;pv80++ ) // The denominator pv60 can become zero due to round-off errors. // In that case, each class is considered to be equally likely. if ( pv60 == 0 ) q(pv13, pv80) = (double)(1) / (double)(n_classes); else q(pv13, pv80) = pv58(pv80) / pv60; } if ( loopcounter > 1 ) { pv98 = 0; for( pv87 = 0;pv87 <= n_dim - 1;pv87++ ) for( pv88 = 0;pv88 <= n_classes - 1;pv88++ ) pv98 += abs(mu(pv87, pv88) - muold(pv87, pv88)) / (abs(mu(pv87, pv88)) + abs(muold(pv87, pv88))); pv99 = 0; for( pv89 = 0;pv89 <= n_classes - 1;pv89++ ) pv99 += abs(rho(pv89) - rhoold(pv89)) / (abs(rho(pv89)) + abs(rhoold(pv89))); pv100 = 0; for( pv90 = 0;pv90 <= n_dim - 1;pv90++ ) for( pv91 = 0;pv91 <= n_classes - 1;pv91++ ) pv100 += abs(sigma(pv90, pv91) - sigmaold(pv90, pv91)) / (abs(sigma(pv90, pv91)) + abs(sigmaold(pv90, pv91))); pv86 = pv98 + pv99 + pv100; } else ; } retval.resize(3); retval(0) = mu; retval(1) = rho; retval(2) = sigma; return retval; } //-- End of code --------------------------------------------------------------- Code Generation for Data Analysis • Generated program: • ~ 1sec. generation time • ≥ 600 lines • ≥ 1:30 leverage • fully documented • deeply nested loops • complex calculations • “correct-by-construction” . . . • Ground cover map: • multiple Landsat-bands • estimate pixel classes • estimate class parameters • Implementation problems: • which model? • which algorithm? • efficient C/C++ code? • correctness?

  16. Transformation Schemas • Big-step transformations • horizontal (model decompositions / transformations) • vertical (domain-specific algorithms) • Implemented as combination of techniques • meta-program (check conditions) • graph rewriting (transform model) • templates (represent code fragments)

  17. [...] // Initialization for(v44=0;v44<=n-1;v44++) for(v45=0;v45<=c-1;v45++ ) q(v44,v45)=0; for(v46=0;v46<=n-1;v46++) q(v46,z(v46))=1; [...] for(v12=0;v12<=n-1;pv12++) for(v13=0;v13<=c-1;pv13++) { pv68 = 0; for(v41=0;v41 <= c-1;pv41++ ) v68+=exp((x(v12)-mu(v41))* (x(v12)-mu(v41))/ (double)(-2)/ [...]); [...] model mog as 'Mixture of Gaussians'. [...] % Class probabilities double rho(1..c). where 1 = sum(I:=1..c, rho(I)). % Class parameters double mu(1..c). double sigma(1..c). where 0 < sigma(_). % Hidden variable nat z(1..n) ~ discrete(rho). % Data data double x(1..n). x(I) ~ gauss(mu(z(I)),sigma(z(I))). % Goal max pr(x|{rho,mu,sigma}) for {rho,mu,sigma}. Code Model Generator Assurance Should you trust a code generator? • Correctness of the generated code depends oncorrectness of the generator • Correctness of the generator is difficult to show • very large • very complicated • very dynamic So what do you do?

  18. Trust me – I’m a doctor Generator Assurance Approaches (I) Correctness-by-construction: Generator is based on logical framework; code is derived by correctness-preserving transformations Techniques: • deductive program synthesis • refinement and transformation systems Advantages: • highest degree of confidence (“proofs as programs”) Disadvantages: • expensive – systems difficult to build and maintain • opaque – correctness argument convoluted (and buried in generator)

  19. Trust me – I’m an engineer Generator Assurance Approaches (II) Generator Qualification: Generator is tested to same level of criticality asgenerated code Advantages: • currently only technique allowed by FAA • state of the practice Disadvantages: • expensive – testing effort very high • expensive – re-qualification required after changes • limited – only partial assurance • opaque – no explicit correctness argument

  20. [...] // Initialization for(v44=0;v44<=n-1;v44++) for(v45=0;v45<=c-1;v45++ ) q(v44,v45)=0; for(v46=0;v46<=n-1;v46++) q(v46,z(v46))=1; [...] for(v12=0;v12<=n-1;pv12++) for(v13=0;v13<=c-1;pv13++) { pv68 = 0; for(v41=0;v41 <= c-1;pv41++ ) v68+=exp((x(v12)-mu(v41))* (x(v12)-mu(v41))/ (double)(-2)/ [...]); [...] model mog as 'Mixture of Gaussians'. [...] % Class probabilities double rho(1..c). where 1 = sum(I:=1..c, rho(I)). % Class parameters double mu(1..c). double sigma(1..c). where 0 < sigma(_). % Hidden variable nat z(1..n) ~ discrete(rho). % Data data double x(1..n). x(I) ~ gauss(mu(z(I)),sigma(z(I))). % Goal max pr(x|{rho,mu,sigma}) for {rho,mu,sigma}. Code Model Generator Assurance Should you trust a code generator? • Correctness of the generated code depends oncorrectness of the generator • Correctness of the generator is difficult to show • very large • very complicated • very dynamic So what??? • Don’t care whether generator is buggy for other peopleas long as it works for me now!  Certifiable Code Generation

  21. Don’t trust me – I’m a computer scientist! Generator Assurance Approaches (III) Certifiable Code Generation: Generator is extended to support independent post-generation verification • certify generated programs, not the generator • product-oriented approach, not process-oriented Advantages • No need to re-certify generator • Minimizes trusted component base [...] // Initialization for(v44=0;v44<=n-1;v44++) for(v45=0;v45<=c-1;v45++ ) q(v44,v45)=0; for(v46=0;v46<=n-1;v46++) q(v46,z(v46))=1; [...] for(v12=0;v12<=n-1;pv12++) for(v13=0;v13<=c-1;pv13++) { pv68 = 0; for(v41=0;v41 <= c-1;pv41++ ) v68+=exp((x(v12)-mu(v41))* (x(v12)-mu(v41))/ (double)(-2)/ [...]); [...] model mog as 'Mixture of Gaussians'. [...] % Class probabilities double rho(1..c). where 1 = sum(I:=1..c, rho(I)). % Class parameters double mu(1..c). double sigma(1..c). where 0 < sigma(_). % Hidden variable nat z(1..n) ~ discrete(rho). % Data data double x(1..n). x(I) ~ gauss(mu(z(I)),sigma(z(I))). % Goal max pr(x|{rho,mu,sigma}) for {rho,mu,sigma}. Proofs Model Code

  22. Generator Assurance Approaches (III) Certifiable Code Generation: Generator is extended to support independent post-generation verification • generate code and annotations in parallel

  23. Generator Assurance Approaches (III) Certifiable Code Generation: Generator is extended to support independent post-generation verification • generate code and annotations in parallel Errm… Can you do this? What about errors in the extension? What about the effort?

  24. Generator Assurance Approaches (III) Certifiable Code Generation: Generator is extended to support independent post-generation verification • generate code and annotations in parallel • use standard program verification techniques • annotations (pre-/postconditions, loop invariants) give hints only • proofs are independently verifiable evidence (certificates) • keeps certification independent from code generation • focus on specific safety properties • language-specific: array bounds, partial operators, … • domain-specific: units, frames, vector norm, matrix symmetry, … • keeps annotation generation tractable • keeps certification tractable

  25. Certification Framework Safety property: formal characterization of intuitively safe programs “All automatic variables shall have been assigned a value before being used” (MISRA 9.1) Formal: • introduce “shadow variables” to record safety information • extend operational semantics by effects on shadow variables:

  26. Certification Framework Safety property: formal characterization of intuitively safe programs “All automatic variables shall have been assigned a value before being used” (MISRA 9.1) Formal: • introduce “shadow variables” to record safety information • extend operational semantics by effects on shadow variables • define semantic safety judgements on expressions/statements:

  27. Certification Framework Safety property: formal characterization of intuitively safe programs “All automatic variables shall have been assigned a value before being used” (MISRA 9.1) Formal: • introduce “shadow variables” to record safety information • extend operational semantics by effects on shadow variables • define semantic safety judgements on expressions/statements • prove safety reduction (i.e., consistency of safety property):  “safe programs don’t go wrong”

  28. Certification Framework Safety policy: proof rules designed to show that safety property holds for program • responsible for • maintenance of shadow variables • construction of safety obligations • extend Hoare-rules by safety predicate and shadow variables:

  29. Certification Framework Safety policy: proof rules designed to show that safety property holds for program • responsible for • maintenance of shadow variables • construction of safety obligations • extend Hoare-rules by safety predicate and shadow variables • define safety predicate corresponding to safety judgements:

  30. Certification Framework Safety policy: proof rules designed to show that safety property holds for program • responsible for • maintenance of shadow variables • construction of safety obligations • extend Hoare-rules by safety predicate and shadow variables • define safety predicate corresponding to safety judgements • prove soundness and completeness:

  31. Certification Architecture Approach: • generate and refine annotations in parallel with code • generate safety obligations from annotated code(i.e., VCG applies safety policy to program) • simplify, prove, & check

  32. Conclusions • Data Management = DBMS + Application Code • Most tailoring is on application side • Tailoring via source code generation from models • Generator assurance

More Related