1 / 28

Making Good Use of Data at Hand: Open Source Tools

Making Good Use of Data at Hand: Open Source Tools. Mark C. Cooke, Ph.D. Tax Management Associates, Inc. Overview. Open Data concept – Data is produced for various purposes but can be used to derive novel insights; i.e. “Business Intelligence (BI)”

vahe
Télécharger la présentation

Making Good Use of Data at Hand: Open Source Tools

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Making Good Use of Data at Hand: Open Source Tools Mark C. Cooke, Ph.D. Tax Management Associates, Inc.

  2. Overview • Open Data concept – Data is produced for various purposes but can be used to derive novel insights; i.e. “Business Intelligence (BI)” • Open Source tools exist for making good use of existing data sets • ETL (“Extract, Transform, Load”) + Analytics • Knime and the R language are two of the most powerful resources for leveraging data Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

  3. Open Data • Open Data concept – governments collect, through existing management systems, enormous quantities of data that can be leveraged in alternative and novel ways to find solutions. • The goal is often to leverage the broader community to develop solutions that governments may not have previously conceived. • Open Data and Business Intelligence should be used by internal consumers as well. Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

  4. Open Data Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

  5. “Data Scientist” Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

  6. Doing Data the Old Way • Data is locked inside systems :-( • Software systems are designed to wrap a Graphical User Interface (GUI) around data. • The GUI functionality, historically, has to be programmed to produce reports, views, and analysis. • The GUI is driven by the sole purpose of the software. But the data has many purposes… Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

  7. Open Data – Way Forward • Making data talk across platforms: AS400, SQL, XML, Excel, PDF’s, Text Files, Image Files (.png, .jpeg, etc.), Shape Files (ESRI), email archives, web-scraping, API’s from social media, etc. • Connecting data across multiple platforms • Using data for novel insight • Tools now exist for importing, cleaning, standardizing, and analyzing data using complex algorithms built into accessible packages Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

  8. Open Data • These systems are known as “Data Agnostic:” Database Agnostic - Database-agnostic is a term describing the capacity of software to function with any vendor’s database management system (DBMS). In information technology (IT), agnostic refers to the ability of something – such as software or hardware – to work with various systems, rather than being customized for a single system. • http://searchdatamanagement.techtarget.com/definition/database-agnostic Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

  9. Data Science • What is the breadth of the tool base? • Reading in data from various resources • Transforming data to merge various resources, translate data into a usable format or to add new data elements • Analyzing data from basic logical and statistical functions to higher level machine learning tools and algorithms “Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.” http://en.wikipedia.org/wiki/Machine_learning Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

  10. Data Science • What is the output? • “Business Intelligence” or actionable information that drives business decisions through insight • Creating new insights from existing data • Visualizations - representation of that BI in ways to make it consumable to a non-specialist audience “According to Friedman (2008) the "main goal of data visualization is to communicate information clearly and effectively through graphical means.” http://en.wikipedia.org/wiki/Data_visualization Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

  11. Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

  12. Knime is a GUI-based data agnostic tool for ETL, analytics, and visualization. • Knime is an open source platform for the desktop with commercial enterprise server layers including collaboration tools and web-services (web-portal). • Knime supports other analytics languages, including the R language for statistical computing www.Knime.org Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

  13. The advantages of Knime: • Rapid development environment • Very powerful processing handling large datasets on commodity hardware • Allows for 100% data samples up to millions of elements row-wise • Workflows can be saved, shared, and duplicated • nodes are stepwise allowing for quick revisions • nodes provide access to complex algorithms Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

  14. What is Knime? Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

  15. The Knime Workbench Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

  16. Knime Nodes • Nodes are the workers inside a workflow • Every node serves at least one function • Nodes can also be built as Meta-Nodes, which are a collection of nodes performing common functions • A collection of nodes is called a “workflow” • You can develop nodes with Java and the node development support Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

  17. Knime Nodes • For example, the file reader node is an intelligent file reader that can determine the type of file • However, it also allows for the end user to adjust parameters Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

  18. Knime Nodes • The Column Filter node allows users to filter columns from a table (conveniently named…) Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

  19. Knime Nodes (sample) Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

  20. Knime Integrates with R • R integration is key to expanding the data analysis and visualization capabilities of Knime • R supports data ingestion of complex files (including ESRI) • R supports complex data manipulation and statistical analysis • R supports a wide variety of highly customizable visualizations • So, what is R, exactly? Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

  21. R Project for Statistical Computing www.r-project.org • R is an open source scripting language which can be run inside Knime, but also within a command line environment independently • Several GUI interfaces for R exist such as R Studio, a group that provides software for using R as well as training and extension packages (www.rstudio.com) • Community contributions make up the bulk of R packages, which now total more than 4,700 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

  22. R Project for Statistical Computing www.r-project.org • The R base package (standard software) provides methods for reading data, ETL, analysis and visualizations • The community provided packages take this base and build on it depending on the interest of the producer • Packages stretch across all imaginable data uses, including advanced statistical analyses, machine learning and data mining, and advanced graphical visualizations (including sophisticated mapping) Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

  23. Popular R Packages • A (very) brief overview of popular packages: • Plyr – for advanced data manipulation • Maps – for mapping datasets onto georeferenced outputs • GGPlot2 – for advanced data visualizations • Rcurl – for reading data from webpages and repositories • TextMining – for text mining applications • SNA – for social network analysis Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

  24. R Inside Knime Basic Data Manipulation: Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

  25. R Inside Knime Basic Visual using Maps: Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

  26. Knime + R + TPP • Case examples for working with TPP: • Look at distribution of TPP accounts across a county, state, or region • Map entities or create a heatmap (choropleth) of the distribution of personal property values • Compare personal property reporting across schedules across industry sectors (m&e across manufacturing types) • Compare like-kind entity reporting (franchises, big-box) for consistency in values • Compare personal property accounts with other data resources (real property accounts, permits, etc.) Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

  27. Brief Demonstration • Data: • Florida • 67 Counties • More than 1.24 million personal property accounts • Goals: • Group all data by industry to illustrate the taxable value and exempted value by type • Subset the data to include only a particular industry • Map the state-wide exempt value in a choropleth Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

  28. Questions? Thank you for your time and attention. I am always happy to discuss data, so please feel free to contact me at any of the information below. Mark C Cooke Mark.Cooke@tma1.com 704.847.1234 (office) 704.953.6349 (cell) www.linkedin.com/in/markccooke Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

More Related