1 / 35

Earth Data Science

Earth Data Science. Lindsay Barbieri. Earth Science Data Analytics A Grad Student Perspective. Background Data Science - formalized (?) courses Lessons Learned ESDA directions → Bringing it all together. Earth Science Data Analytics A Grad Student Perspective. Background

consuelo
Télécharger la présentation

Earth Data Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Earth Data Science Lindsay Barbieri

  2. Earth Science Data Analytics A Grad Student Perspective Background Data Science - formalized (?) courses Lessons Learned ESDA directions → Bringing it all together

  3. Earth Science Data Analytics A Grad Student Perspective Background Data Science - formalized (?) courses Lessons Learned ESDA directions → Bringing it all together

  4. Rubenstein School of Environment & Natural Resources “Our mission is to understand, nurture, and enrich the interdependence of people with healthy ecological systems.” Natural Resources = Interdisciplinary Earth Sciences?

  5. American Geophysical Union: Sections and Focus Groups connect you with other scientists in your research area. Historically, Sections are disciplinary while Focus Groups are interdisciplinary.

  6. American Geophysical Union: Sections and Focus Groups connect you with other scientists in your research area. Historically, Sections are disciplinary while Focus Groups are interdisciplinary.

  7. I work with: • Water • Soil • Greenhouse Gas - at the interface of land and atmosphere) • Land Management, Land Use, Land Cover • GIS & Remote Sensing (goes along with all this monitoring!)

  8. Hydrology Surface Runoff & Collection Stations Soil Leaching & Lysimeters Modeling Field Inundation

  9. Biogeochemistry Nutrient Cycling Soil Chemistry Greenhouse Gas Emissions

  10. Atmospheric Science Meteorological Stations Greenhouse Gas Concentrations How Climate Affects Land

  11. GIS & Remote Sensing Drones NLCD & Landsat Imagery & Other “Remote” Data Collection

  12. “At the Gund Institute for Ecological Economics, we integrate natural and social sciences to understand the interactions between people and nature and to help build a sustainable future.” “In complex physical, biological, social and engineered systems, the self-organizing dynamics of interacting entities (be they molecules, cells, genes, bacteria, plants, birds, humans, nanobots, electrical substations, etc.) give rise to emergent system properties (such as consciousness, cancer, global warming, societies, etc.). Fortunately, many essential properties of suchsystems may be studied, modeled and understood using similar approaches, regardless of the application domain”

  13. Earth Science Data Analytics A Grad Student Perspective Background Data Science - formalized (?) courses Lessons Learned ESDA directions → Bringing it all together

  14. Data Science Masters Program Broad Training in computational and theoretical techniques for: (1) describing and understanding complex natural and sociotechnical systems, enabling them to then, as possible (2) predict, control, manage, and create such systems.

  15. Data Science Masters Program • Industry standard methods of data acquisition, storage, manipulation, and curation • Visualization techniqueswith a focus on building high quality web-based applications • Finding complex patterns and correlations through, for example, machine learning and data mining • Powerful ways of hypothesizing, searching for, and extracting explanatory, mechanistic storiesunderlying complex systems—not just how to use black box techniques • Combining the formulation of mechanistic models (e.g., toy physics models) with genetic programming

  16. Data Science: http://bagrow.com/ Transitions in climate and energy discourse between Hurricanes Katrina and Sandy Emily M. Cody · Jennie C. Stephens · James P. Bagrow · Peter Sheridan Dodds · Christopher M. Danforth Crowdsourcing Predictors of Residential Electric Energy Usage M. D. Wagy, J. C. Bongard, J. P. Bagrow, P. D. H. Hines (preprint, 2016)

  17. Data Science → Road Map Programming / Computers Unix, ssh, git Python - possibly controlling R in python or R and python Unit testing -> test driven development (software engineering) Data Cleaning Exploration / Checklist ETL --> extract, transform, load (“tidy data”) Penalized / Regularized Regression Best subsets Ridge Lasso - 1996 Elastic net - 2002 (like statisticians doing machine learning) Leading to: prediction error, bias vs variance, cross validation Clustering Classifiers Principal components (PCA) Inverse covariance matrix Network Data Representing networks Community detection Network reconstruction (graphical LASSO) Text Data Computational linguistics Preprocessing techniques Stemming algorithms TF IDF Laplace Smoothing Text Classifier --> spam detector Topic Models Vector embeddings. Neural networks. Semantics. Current Tools: Spark Hadoop / Map Reduce Tensor flow - November (open source google for machine learning algorithms)

  18. Data Science → Road Map Programming / Computers Unix, ssh, git Python - possibly controlling R in python or R and python Unit testing -> test driven development (software engineering) Data Cleaning Exploration / Checklist ETL --> extract, transform, load (“tidy data”) Penalized / Regularized Regression Best subsets Ridge Lasso - 1996 Elastic net - 2002 (like statisticians doing machine learning) Leading to: prediction error, bias vs variance, cross validation Clustering Classifiers Principal components (PCA) Inverse covariance matrix Network Data Representing networks Community detection Network reconstruction (graphical LASSO) Text Data Computational linguistics Preprocessing techniques Stemming algorithms TF IDF Laplace Smoothing Text Classifier --> spam detector Topic Models Vector embeddings. Neural networks. Semantics. Current Tools: Spark Hadoop / Map Reduce Tensor flow - November (open source google for machine learning algorithms)

  19. Earth Science Data Analytics A Grad Student Perspective Background Data Science - formalized (?) courses Lessons Learned ESDA directions → Bringing it all together

  20. Programming / Computers “Enough to be dangerous”

  21. Programming / Computers • Scientists spend 30% of time programming, but 90% are self-taught. • Careful to validate their laboratory and field equipment but don’t know how reliable their software is. • Computing errors can have disproportionate impacts on scientific process. • Fairly easy things to implement: Best Practices

  22. Summary Table of Best Practices • DRY principle: Don’t Repeat Yourself - modularizing rather than copy and pasting • every piece of data must have a SINGLE authoritative representation in the system (but not unique.. ie: BACKUPS.) -- one is none, two is one • Optimize Software Only after it works correctly -- make code that runs fast, but first focus on what it’s supposed to do. Only if it’s really slow, then optimize it. • Profiler to identify bottlenecks • Code commenting: • interfaces and reasons for the code… NOT what the code is doing. • Embed documentation for a piece of software in that software. • docstrings -- have inside the code itself. put documentation into the software. • software that can take that and extract from your code. • Javadoc, doxygen, sphinx • Collaborating • code reviews • git repository • Pair Programming: two people writing it together • Write Programs for People, Not Computers • Naming Conventions - Descriptive Variables (and non confusing) • make code style and formatting consistent: • see style guides • Python is PEP8 • Let the Computer do the Work: • Involves repetition of computational tasks • if you do the same thing over and over again: make the computer repeat tasks • SAVE recent commands in a file for re-use in a text file • SHELL SCRIPTS • tool called “Make” -- www.gnu.org/software/make • use a tool to automate workflow • Reproducibility • provenance of data -- track chain of custody of data • Work in small steps with frequent feedback and course corrections • “Agile Development” • Keeping track of changes: USE VERSION CONTROL (ie: git… but also could be dropbox)

  23. Lessons Learned • Exposure • Vocabulary • How to Search • Repetition • Community

  24. Network Data

  25. Network Data

  26. Network Data

  27. Network Data

  28. Network Data

  29. Network Data: Community Detection

  30. Network Data: Connectivity

  31. Network Data: Connectivity Forest Connectivity: Priority Reaches

  32. Lessons Learned • Exposure • Vocabulary • How to Search • Repetition • Community

  33. Earth Science Data Analytics A Grad Student Perspective Background Data Science - formalized (?) courses Lessons Learned ESDA directions → Bringing it all together

  34. What is “Data Analytics” anyway? Data Preparation – Preparing heterogeneous data so that they can be jointly analyzed Data Reduction – Correcting, ordering and simplifying data in support of analytic objectives Data Analysis – Applying techniques/methods to derive results

  35. Lessons Learned ESDA → What scales? ALL SCALES! • Individual Research Project (IE: 5 PI’s on an interdisciplinary grant?) • “Big Picture” and synthesis / analytics across all Earth Science data

More Related