0 likes | 2 Vues
The infographic also guides you on how to create a sample dataframe from GitHub to work on.<br>Check out this infographic and master Pandau2019s one-liners for data cleaning. https://www.usdsi.org/data-science-insights/top-5-automation-tools-for-data-cleaning
E N D
DATA CLEANING MADE EASY PANDAS ONE-LINERS TO CLEAN DATA FASTER The real-world data is often messy and not fit for analysis. To extract actionable insights, data must be cleaned and formatted properly. CREATE A SAMPLE DATAFRAME Run generate_df.py to generate a sample dataframe of customer orders OUTPUT: order_date customer_id email product_id quantity \ 0 2024-04-12 7376 user208@hotmail.com PROD-642 5.0 1 2024-12-14 Customer 3393 user349company.com PROD-626 9.0 2 2024-09-27 C8240 user958@company.com PROD-645 874.0 3 2024-04-16 C7190 user951@yahoo.com PROD-942 1.0 4 2024-03-12 CUS-7493 user519@yahoo.com PROD-115 7.0 price shipping_status 0 27.347234 DELIVERED 1 99.343948 Shipped 2 77.172318 In Transit 3 147.403597 Shipped 4 159 delivered This dataframe contains: Prices in various formats, such as containing negative (-) or $ signs Inconsistent customer Ids Incorrect email addresses and formatting issues Inconsistent shipping status values Missing product Ids Duplicate or completely blank rows Outliers in the column depicting quantity PANDA ONE LINERS TO CLEAN THE DATA 1. DROP ROWS WITH ALL MISSING VALUES df_clean = df.dropna(how='all') 2. STANDARDIZE TEXT CASE AND REMOVE WHITESPACE df_clean.loc[:, 'shipping_status'] = df_clean['shipping_status'].str.lower().str.strip() 3. EXTRACT PATTERNS WITH REGULAR EXPRESSIONS df_clean.loc[:,'customer_id'] = 'CUS-' + df_clean['customer_id'].str.extract(r'(\d+)').fillna('0000') 4. CONVERT MIXED DATA TYPES TO NUMERIC df_clean.loc[:,'price'] = pd.to_numeric(df_clean['price'].astype(str).str.replace('$', ''), errors='coerce') 5. FIX COMMON EMAIL FORMATTING ISSUES df_clean.loc[:,'email'] = df_clean['email'].str.strip().str.replace(r'([^@]+)([^@]*\.com)', r'\1@\2') 6. HANDLE OUTLIERS USING THE IQR METHOD df_clean.loc[:,'quantity'] = df_clean['quantity'].clip(upper=df_clean['quantity']. quantile(0.75) + 1.5 *(df_clean['quantity'].quantile(0.75) - df_clean['quantity'].quantile(0.25))) 7. STANDARDIZE CATEGORIES WITH MAPPING df_clean.loc[:,'shipping_status'] = df_clean['shipping_status'].replace({'in transit': 'in_transit', 'intransit': 'in_transit', 'shipped': 'shipped', 'delivered': ‘delivered',’pending': 'pending'}) 8. REMOVE DUPLICATES BASED ON SPECIFIC COLUMNS df_clean = df_clean.drop_duplicates(subset=['customer_id', ‘order_date',’product_id']) 9. CREATE VALIDATION FLAGS FOR DATA QUALITY df_clean['is_valid_email'] = df_clean['email'].str.contains (r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$') 10. FILL MISSING VALUES WITH FORWARD FILL df_clean = df_clean.sort_values('order_date').groupby('customer_id').ffill() Use these one-liners to address common data quality issues faster and effectively. Learn how to properly understand your data and process it for analysis with top data science certifications. Get Certified today. REGISTER NOW