Imputation techniques

An Introduction

In statistics, imputation is the process of replacing missing data with substituted values. When substituting for a data point, it is known as "unit imputation"; when substituting for a component of a data point, it is known as "item imputation". Imputation is a technique used for replacing the missing data with some substitute value to retain most of the data/information of the dataset. These techniques are used because removing the data from the dataset every time is not feasible and can lead to a reduction in the size of the dataset to a large extend, which not only raises concerns for biasing the dataset but also leads to incorrect analysis.

Importance of Data Imputation

Now that we learned what Data imputation is, let us see why exactly it is important. We employ imputation since missing data can lead to the following problems:
Distorts Dataset:
Large amounts of missing data can lead to anomalies in the variable distribution, which can change the relative importance of different categories in the dataset.
Unable to work with the majority of machine learning-related Python libraries:
When utilizing ML libraries (SkLearn is the most popular), mistakes may occur because there is no automatic handling of these missing data.
Impacts on the Final Model:
Missing data may lead to bias in the dataset, which could affect the final model's analysis.
Desire to restore the entire dataset:
This typically occurs when we don't want to lose any (or any more) of the data in our dataset because all of it is crucial. Additionally, while the dataset is not very large, eliminating a portion of it could have a substantial effect on the final model.

Classification of Missing Values

The following classification is based on the order of increasing confidence behind the logical reason of missing values or decreasing randomness in the dataset.

Missing completely at Random (MCAR):
- When data points are missing completely at random which means we have no clue why it is missing.
- Relatively less common occurrence in the industry.
- On application of any imputation technique we are not sure how it would work out.

Missing at Random (MAR):
- This is different from first one in the way that although the data point is missing at random, we are able to find a logical approach to impute that particular value.
- For example, using the similar set of available data points in the dataset.

Not Missing at Random (NMAR):
- The difference is you are able to find a way more logical reason ( i.e. with more confidence) that why those data points are missing.
- Also known as intentional missing.
- For example: When the missing set of values belong to a certain group in the dataset.

Structured Missing (SM):
- 100% surity on the logical reason behind the missing values.
- The missing is structured/has a certain fixed pattern throughout the dataset.

IMPUTATION TECHNIQUES

An Introduction

Importance of Data Imputation

Classification of Missing Values

Different Imputation techniques

KNN Imputation

MissForest