1 / 31

Preparing your Data using Python

Learn the importance of preparing your data for analysis, different data formats, and how Python can help you import, prepare, and save your data.

kendram
Télécharger la présentation

Preparing your Data using Python

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Preparing your Data using Python Samuel G. Mori, CISA Managing Partner, Analytics & Advisory Services Spyrion LLC April 12, 2018

  2. Background Samuel G. Mori, CISA, Six Sigma Green Belt • Managing Partner, Analytics and Advisory Services • Software Quality Assurance, Internal/External Audit, Business Intelligence and Reporting, Advisory Services (GRC and Analytics) • Subject matter expertise within commercial, manufacturing, healthcare, biomedical and entertainment sectors • B.S. Cognitive Science – Human Computer Interaction (UC San Diego) • M.S. Accountancy – Accounting Information Systems (San Diego State) • M.S. Data Science – Analytics & Modeling (Northwestern)

  3. Agenda • Learning Objectives • Why should I prepare my data? • What types of data might I encounter? • How can Python help me?

  4. Learning Objectives • Understand the importance of preparing your data for analysis • Understand different types of data formats you may encounter • Understand what Python is and why you should use it • Understand strategies and techniques for importing, preparing, and saving your data using Python

  5. Why should I prepare my data? • Garbage in, garbage out • Reduce errors • Remove duplicate records • Fix missing values • Correct range values • Fix formatting (i.e. date, text, number)

  6. Experience Check • How many people have experience with Python? • What types of data formats do you use in your organizations? • CSV, Excel, PDF, JSON, XML, SQL databases, etc • What types of tools do you use? • Excel, ACL, IDEA, SQL Server, Python, R, SAS, Cognos, etc

  7. What types of data formats might I encounter? • Comma Separated Value (CSV) • Excel • JavaScript Object Notation (JSON) • Structured Query Language (SQL) • And more! Python can help with these!

  8. CSV Example • SFO Airport Survey Results

  9. Excel Example • SFO Airport Survey Results

  10. JSON Examples • Trip Advisor JSON file • Yelp JSON file

  11. SQL Example • Sample Customer Data

  12. What is Python? Definition • Object-oriented, high-level programming language • Used as a scripting or glue language to connect existing components together • Simple, easy to learn syntax emphasizes readability • Supports modules and packages • Python interpreter and the extensive standard library are FREE!

  13. What is Python? (cont.) Key Python Package: • Pandas • Open source library that allows you to work with CSV, Excel, JSON, and SQL database files, pull them into tables (called dataframes), and perform various data analysis techniques.

  14. Coding Basics Some basic python syntax to keep in mind: • Declaring a variable (always to the left of equal sign) • File names (can use “ “ or ‘ ‘) • dataframe = pd.read_excel(‘file_name.xlsx', ‘sheet_name’) Or • file_name = ‘file_name.xlsx’ • sheet_name = ‘sheet_name’ • dataframe = pandas.read_excel(file_name, sheet_name)

  15. Coding Basics (cont.) Some basic python syntax to keep in mind: • Using library packages • Import pandas as pd#calling pandas library and creating reference ‘pd’ • dataframe = pd.read_excel(‘file_name.xlsx', ‘sheet_name’) Or • dataframe = pandas.read_excel(‘file_name.xlsx', ‘sheet_name’)

  16. Case Study SFO Airport Customer Survey Data – Excel & CSV files

  17. Importing the Data How do I import an Excel file?

  18. Data Characteristics What columns do we have?

  19. Data Characteristics What if I just want a subset of these columns?

  20. Data Characteristics What columns do I have and what are their data types?

  21. Data Characteristics How many columns and records do I have? Can I do a count of different values within a column?

  22. Modifying Data Values Lets look at the data dictionary How do I replace values to make them meaningful?

  23. Saving to Excel How do I save this new file? What does my file look like?

  24. Importing the Data How do I import a CSV file? What is NaN?

  25. Fixing Error Values How do I fix NaN values?

  26. Adding Custom Columns What if I want to add the Year in a column?

  27. Identifying Value Ranges How do I look at the data value ranges for multiple columns?

  28. Saving to CSV • How do I save this new file? • What does my file look like?

  29. Appendix

  30. Additional Information Python Development Environments • Enthought Canopy • https://www.enthought.com/product/canopy/ • Anaconda/Spyder • https://www.anaconda.com/download/ Python Libraries • Pandas • http://pandas.pydata.org/

  31. Questions?

More Related