In statistics, imputation is the process of replacing missing data with substituted values. When substituting for a data point, it is known as "unit imputation"; when substituting for a component of a data point, it is known as "item imputation". Imputation is a technique used for replacing the missing data with some substitute value to retain most of the data/information of the dataset. These techniques are used because removing the data from the dataset every time is not feasible and can lead to a reduction in the size of the dataset to a large extend, which not only raises concerns for biasing the dataset but also leads to incorrect analysis.
Now that we learned what Data imputation is, let us see why exactly it is important.
We employ imputation since missing data can lead to the following problems:
Distorts Dataset:
Large amounts of missing data can lead to anomalies in the variable distribution, which can change the relative importance of different categories in the dataset.
Unable to work with the majority of machine learning-related Python libraries:
When utilizing ML libraries (SkLearn is the most popular), mistakes may occur because there is no automatic handling of these missing data.
Impacts on the Final Model:
Missing data may lead to bias in the dataset, which could affect the final model's analysis.
Desire to restore the entire dataset:
This typically occurs when we don't want to lose any (or any more) of the data in our dataset because all of it is crucial. Additionally, while the dataset is not very large, eliminating a portion of it could have a substantial effect on the final model.
The following classification is based on the order of increasing confidence behind the logical reason of missing values or decreasing randomness in the dataset.