1 / 25

Towards New Models and Languages for Data Mining and Integration

Towards New Models and Languages for Data Mining and Integration. Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation at the NeSC, Edinburgh August 13, 2008. Outline. Introduction  CRISP-DM Model and Methodology What is CRISP-DM Why update it

elroy
Télécharger la présentation

Towards New Models and Languages for Data Mining and Integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation at the NeSC, Edinburgh August 13, 2008

  2. Outline • Introduction •  CRISP-DM Model and Methodology • What is CRISP-DM • Why update it • From CRISP-DM to CRISP-DMI • Impact of CRISP-DMI on the DMI Workflow Language • State of the Art in Language Design • Discussion of the 1st Language Design Ideas • Conclusions and Future Work Edinburgh, 13 Aug, 2008

  3. What is CRISP-DM? Phases of the CRoss Industry Standard Process for Data Mining Edinburgh, 13 Aug, 2008

  4. CRISP-DM Phases • Business Understanding: the process of understanding the project objectives from a business perspective • Data Understanding: the process of collecting and becoming familiar with data • Data Preparation: the process of selecting and cleansing the data that will be fed into the modeling tools • Modeling: the process of applying modeling to manipulate the data so that conclusions can be drawn • Evaluation: the process of evaluating the model and its conclusions • Deployment: the process of applying the conclusions to a business Edinburgh, 13 Aug, 2008

  5. Why to Update CRISP-DM? • Support for large-scale data mining • a lot of distributed, heterogeneous and large datasets (primary data, derived data, background data, catalogs): from data to “space of data” • data integration is of great importance • new actors (domain expert, data analyst, data publisher, system administrator) • support by new components (e.g. provenance) • etc. • Our approach: from CRISP-DM to CRISP-DMI (Cross Research & Industry Standard Process for Data Mining and Integration ) Edinburgh, 13 Aug, 2008

  6. CRISP-DMI Model Edinburgh, 13 Aug, 2008

  7. Space of Data and Services Edinburgh, 13 Aug, 2008 Author: Ibrahim Elsayed

  8. TCM Workflow Edinburgh, 13 Aug, 2008

  9. Subworkflow Targeted by Provenance Edinburgh, 13 Aug, 2008

  10. Visualization of Provenance Data Edinburgh, 13 Aug, 2008 Authors: Y. Han & F.A. Khan

  11. Use case The fields in the data are: Age: Sex: M or F BP: Blood Pressure-High, Normal, or Low Cholesterol: Blood Cholesterol Level-Normal or High Na: Blood sodium concentration K: Blood potassium concentration Drug: The drug to which this patient responded The business question: Can we find which drug is appropriate for any future patient? (from P. Caron, C. Shearer, Interactive Visual Workflow: The Key to Streamlining the Data Mining Process) Edinburgh, 13 Aug, 2008

  12. DmiFlow: DMI Workflow Language • The emerging DMI applications lead to the demand of a powerful DMI workflow language • On top of it interactive GUIs can be developed • It should enable optimized implementation of language processors Edinburgh, 13 Aug, 2008

  13. DMI Process to be Composed by DmiFlow Space of Source and Destination Data and Services Composition DMI Process Edinburgh, 13 Aug, 2008

  14. A Possible Position of DmiFlow in the Workflow Management Systems Edinburgh, 13 Aug, 2008

  15. Principles for DMI Language Design • Programmer Responsibilities • Identification of Parallelism • Specifying communication mode between workflow components • Providing hints (sometimes based on domain knowledge) enabling advanced optimization • Language Desiderata • High abstraction level, not too complex (high productivity) • Advanced compositional features • Execution of data mining queries (support for the inductive database model) • Extendibility • Efficient implementation (high performance) Edinburgh, 13 Aug, 2008

  16. Related Work • Low-level workflow notations: • XML-based: BPEL4WS, DSCL, WSFL, etc. • Other: Sculf (Taverna), MoML (Kepler), etc. • High-level languages (only for workflows integrating business processes): • Workflow Prolog • Valmont: It includes, process model, information model, and organization model (It registers organizational structure and resources.) • C & Co: a C based language • F#: functional workflow specification at a script level (MicroSoft development) • Martlet: functional workflow specification • Compositional languages (Strand, PCN, etc.) Edinburgh, 13 Aug, 2008

  17. Workplan for the Language Design • Phase 1 (ongoing): proposing semantic structure and outlining compositional structure of programs while leaving open some aspects of their concrete representations as strings of symbols. • Phase 2: finalizing the 1st language definition version. Edinburgh, 13 Aug, 2008

  18. Basic Features of DmiFlow • Code modules – managing complexity • Activities: their types, parameters, locations • Virtual communication channels between activities, which can be represented by • Persistent explicit datasets • Internal datasets (implementation dependent) • Ports used for streaming data • Control structures: parallel & sequential statements, loop statements, conditional statements) • Embedded data mining query execution Edinburgh, 13 Aug, 2008

  19. Declaration of Activities and Datasets activity activity_name: ActivityTypeat (activity_location); ActivityType – predefined (type of parameters and semantics) activity_location ∊ {url, discover, default} this is optional dataset dataset_name represents (source = source_spec, hints_list); source_spec ∊ {url, internal, port} hint ∊ {org = dataset_organization, size = estimated_size, …} dataset_organization ∊ {set, sequence, bag, …} Edinburgh, 13 Aug, 2008

  20. Basic Control Structures Concurrent execution: cobegin { activity1(…); … activityn(…); } Sequential execution: block { activity1(…); … activityn(…); } Data mining query execution: execdmq (arguments) byactivity (activity_name){ dmq_query_specification } Edinburgh, 13 Aug, 2008

  21. Workflow Example – Graphical Form Edinburgh, 13 Aug, 2008

  22. DmiFlow Example (1) module WorkflowExample { const replaceMethod = "average", splitingMethod = "gini", //hint url1 = "/serverA/dmi/services/integrationService1", url2 = "/serverB/dmi/services/decisionTreeService1", url3 = "/serverB/dmi/services/neuralNetworkService3"; activity integrDS: dataIntegrationActTypeat (url1), missVals: MissingValuesActTypeat (discover), normalise: NormalisForNNActType at (default), dt: decisionTreeActType at (url2), nn:NeuralNetworkActTypeat (url3); dataset …. Edinburgh, 13 Aug, 2008

  23. DmiFlow Example (2) dataset ds1 represents (source = "http://www.myproject/d1.dat", org = set, size = [1.5, 2.0]), ds2 represents (source = "http://www.myproject/d2.dat", type = set), intConf represents (source = "/server/dmi/config/integr.conf); outIntegr represents (source = internal, org = set), cleaned represents (source = internal, org = set); normalised represents (source = internal, org = set); nnConf represents (source = "/server/dmi/configs/nn.conf); nnMod represents (source = "/server/dmi/models/nn.pmml); dtMod represents (source = "/server/dmi/models/dt.pmml); defworkflow { . . . } Edinburgh, 13 Aug, 2008

  24. DmiFlow Example (3) defworkflow main () { integrDSets (in ds1, ds2, intConf; out outItegr); missValues (in outIntegr, replaceMethod; out cleaned); cobegin { block { normalise (in cleaned; out normalised); nn (in normalised, nnConf; out nnMod); } dt (in cleaned, splittingMethod; out dtMod); } } Edinburgh, 13 Aug, 2008

  25. Future Work • Extend language functionality • Investigate DmiFlow execution model for the ADMIRE architecture • Define functional specification of the DmiFlow language processor • Specify concrete language syntax Edinburgh, 13 Aug, 2008

More Related