Overview
- Python Data Cleaning Cheat Sheet Pdf
- Data Cleaning Python Cheat Sheet Pdf
- Python Cheat Sheet Download
- Data Structures Cheat Sheet Python
Data Mining process is a sequence of the following steps:
- Data Cleaning – removing noise and outliers
- Data Integration – combine data from various sources
- Data Selection – select relevant variables
- Data Transformation – transform or consolidate data into forms appropriate for mining
- Data Mining – apply methods to extract patterns
- Pattern Evaluation – identify interesting patterns
- Presentation – use visualization to present the knowledge
Types of Patterns
- The data scientist can only clean, visualize, wrangle, and build predictive models only after importing the data. In this cheat sheet, you will learn the tips and techniques to import data like CSV Files, Text Files, Excel Data, Data from URL, and SQL Database into Python.
- Data Cleaning df.set in dex ('c olu mn ‐ one'). PYTHON PANDAS Cheat Sheet by sanjeev95 - Cheatography.com Created Date: 0656Z.
Data Mining tasks can be classified into two categories: Descriptive and Predictive
- Characterization and Discrimination
- Association and Correlation (frequent patterns)
- Classification and Regression for prediction
- Cluster Analysis
- Outlier Analysis
Python Data Cleaning Cheat Sheet Pdf
Interesting Patterns
Depending on the type of data mining task (as listed above), interesting patterns can be extracted based on some threshold. For instance, in association mining, measures such as ‘Support’ and ‘Confidence’ are used. In classification techniques, measures such as ‘accuracy’, ‘precision’, ‘recall’, etc. are used. Subjective interestingness measures based on our knowledge of data is also used.
Python Case Studies: Python Inheritance: Networking in python 3: Project - COVID-19 Spread Analysis: Data Operations and Data Cleansing: Python Machine Learning Algorithms: Python Speech Recognition- AI.
1.1 Data Cleaning
Data may be incomplete, noisy and inconsistent. Data cleaning is required to deal with these issues.
- Missing Values – One of the following solutions can be applied
- Ignore the tuple
- Fill in the missing value manually
- Use a global constant to fill in the missing value (such as ‘unknown’)
- Use a measure of central tendency (e.g. mean or median) to fill
- Use attribute mean or median for all samples belonging to the same class as the given tuple
- Use the most probable value (can be determined with regression or decision tree induction)
- Ignore the tuple
- Noisy Data – Noise is a random error or variance in a variable. Outliers can represent noise. The goal is to smooth out the data to remove the noise. Some smoothing techniques are given below.
- Binning – Sort the data and divide it into bins (equal frequency, bin means, bin medians)
- Regression
- Outlier Analysis – Identify outliers by way of clustering.
- Binning – Sort the data and divide it into bins (equal frequency, bin means, bin medians)
1.2 Data Integration
Combining data from multiple sources may be a necessary step in the data mining process. While integrating data from multiple sources, avoid redundancies and inconsistencies.
1.3 Data Selection/ Reduction
Data Cleaning Python Cheat Sheet Pdf
If the data set is huge, data reduction techniques such as dimensionality reduction, numerosity reduction, and data compression.
- Dimensionality Reduction – process of reducing the number of random variables or attributes under consideration. The following techniques can be applied
- Wavelet transforms: Linear signal processing technique to transform a data vector to another vector.
- Principal Component Analysis: Searches for dimensions that represents the data best.
- Attribute subset selection: Removing irrelevant/ redundant attributes. Some techniques for attribute selection are – stepwise forward selection, stepwise backward selection, combination of forward and backward, decision tree induction (attributes that do not appear in the tree are considered to be irrelevant).
- Wavelet transforms: Linear signal processing technique to transform a data vector to another vector.
- Numerosity Reduction – replace the original data by smaller forms of data representation.
- Parametric techniques such as Regression and Log-Linear models are used to approximate the data and hence reduce it.
- Non parametric techniques such as histograms, clustering, sampling and data cube aggregation (e.g. total sales per quarter instead of monthly).
- Parametric techniques such as Regression and Log-Linear models are used to approximate the data and hence reduce it.
1.4 Data Transformation and Discretization
Transform the data into forms appropriate for mining.
- Smoothing: To remove noise. Techniques such as Regression, Clustering and Binning can be applied.
- Attribute Construction: Add new attributes
- Aggregation: E.g converting daily sales to monthly
- Normalization: Scale attributes so that they fall within a smaller range
- Discretization: raw values are replaced by intervals or labels.
- Concept Hierarchy: attributes such as street can be replaced by higher levels such as city or country.
1.5 Data Mining
Classification Techniques:
- Decision Tree
- Naive Bayes
- Rule-Based Classification
- Bagging
- Boosting
- Random Forests
- Bagging
- Neural Network
- Support Vector Machines
- K-Nearest-Neighbor
Clustering Techniques:
Python Cheat Sheet Download
- K-Means
- Single Link (Min)
- Complete Link (Max)
- Group Average
- DBSCAN
1.6 Pattern Evaluation
Classification Evaluation Criteria:
Data Structures Cheat Sheet Python
- Confusion Matrix – True Positive Rate, True Negative Rate, False Positive Rate, False Negative Rate
- Precision
- Recall
- F1 Measure (combination of precision and recall)
Association Evaluation Criteria:
- Lift
- Correlation Analysis
- IS Measure
Clustering Evaluation Criteria:
- Cohesion
- Separation
- Silhouette Coefficient
References: Introduction to Data Mining, Data Mining Concepts and Techniques