Data Cleaning Python Cheat Sheet

Overview

Python Data Cleaning Cheat Sheet Pdf
Data Cleaning Python Cheat Sheet Pdf
Python Cheat Sheet Download
Data Structures Cheat Sheet Python

Data Mining process is a sequence of the following steps:

Data Cleaning – removing noise and outliers
Data Integration – combine data from various sources
Data Selection – select relevant variables
Data Transformation – transform or consolidate data into forms appropriate for mining
Data Mining – apply methods to extract patterns
Pattern Evaluation – identify interesting patterns
Presentation – use visualization to present the knowledge

Types of Patterns

The data scientist can only clean, visualize, wrangle, and build predictive models only after importing the data. In this cheat sheet, you will learn the tips and techniques to import data like CSV Files, Text Files, Excel Data, Data from URL, and SQL Database into Python.
Data Cleaning df.set in dex ('c olu mn ‐ one'). PYTHON PANDAS Cheat Sheet by sanjeev95 - Cheatography.com Created Date: 0656Z.

Data Mining tasks can be classified into two categories: Descriptive and Predictive

Characterization and Discrimination
Association and Correlation (frequent patterns)
Classification and Regression for prediction
Cluster Analysis
Outlier Analysis

Python Data Cleaning Cheat Sheet Pdf

Interesting Patterns

Depending on the type of data mining task (as listed above), interesting patterns can be extracted based on some threshold. For instance, in association mining, measures such as ‘Support’ and ‘Confidence’ are used. In classification techniques, measures such as ‘accuracy’, ‘precision’, ‘recall’, etc. are used. Subjective interestingness measures based on our knowledge of data is also used.

Python Case Studies: Python Inheritance: Networking in python 3: Project - COVID-19 Spread Analysis: Data Operations and Data Cleansing: Python Machine Learning Algorithms: Python Speech Recognition- AI.

1.1 Data Cleaning

Data may be incomplete, noisy and inconsistent. Data cleaning is required to deal with these issues.

Missing Values – One of the following solutions can be applied
- Ignore the tuple
- Fill in the missing value manually
- Use a global constant to fill in the missing value (such as ‘unknown’)
- Use a measure of central tendency (e.g. mean or median) to fill
- Use attribute mean or median for all samples belonging to the same class as the given tuple
- Use the most probable value (can be determined with regression or decision tree induction)
Noisy Data – Noise is a random error or variance in a variable. Outliers can represent noise. The goal is to smooth out the data to remove the noise. Some smoothing techniques are given below.
- Binning – Sort the data and divide it into bins (equal frequency, bin means, bin medians)
- Regression
- Outlier Analysis – Identify outliers by way of clustering.

1.2 Data Integration

Combining data from multiple sources may be a necessary step in the data mining process. While integrating data from multiple sources, avoid redundancies and inconsistencies.

1.3 Data Selection/ Reduction

Data Cleaning Python Cheat Sheet Pdf

If the data set is huge, data reduction techniques such as dimensionality reduction, numerosity reduction, and data compression.

Dimensionality Reduction – process of reducing the number of random variables or attributes under consideration. The following techniques can be applied
- Wavelet transforms: Linear signal processing technique to transform a data vector to another vector.
- Principal Component Analysis: Searches for dimensions that represents the data best.
- Attribute subset selection: Removing irrelevant/ redundant attributes. Some techniques for attribute selection are – stepwise forward selection, stepwise backward selection, combination of forward and backward, decision tree induction (attributes that do not appear in the tree are considered to be irrelevant).
Numerosity Reduction – replace the original data by smaller forms of data representation.
- Parametric techniques such as Regression and Log-Linear models are used to approximate the data and hence reduce it.
- Non parametric techniques such as histograms, clustering, sampling and data cube aggregation (e.g. total sales per quarter instead of monthly).

1.4 Data Transformation and Discretization

Transform the data into forms appropriate for mining.

Smoothing: To remove noise. Techniques such as Regression, Clustering and Binning can be applied.
Attribute Construction: Add new attributes
Aggregation: E.g converting daily sales to monthly
Normalization: Scale attributes so that they fall within a smaller range
Discretization: raw values are replaced by intervals or labels.
Concept Hierarchy: attributes such as street can be replaced by higher levels such as city or country.

1.5 Data Mining

Classification Techniques:

Decision Tree
Naive Bayes
Rule-Based Classification
- Bagging
- Boosting
- Random Forests
Neural Network
Support Vector Machines
K-Nearest-Neighbor

Clustering Techniques:

Python Cheat Sheet Download

K-Means
Single Link (Min)
Complete Link (Max)
Group Average
DBSCAN

1.6 Pattern Evaluation

Classification Evaluation Criteria:

Data Structures Cheat Sheet Python

Confusion Matrix – True Positive Rate, True Negative Rate, False Positive Rate, False Negative Rate
Precision
Recall
F1 Measure (combination of precision and recall)

Association Evaluation Criteria:

Lift
Correlation Analysis
IS Measure

Clustering Evaluation Criteria:

Cohesion
Separation
Silhouette Coefficient

References: Introduction to Data Mining, Data Mining Concepts and Techniques

Comments are closed.