Basic Data Cleaning — Removing Unknowns
As a beginning data scientist, I’m learning that most of my time is spent preparing data for analysis. Much as writing is about clarifying and polishing ideas, before we can tell any compelling stories with data, it must be thoroughly cleaned and prepared for analysis.
This might not seem very interesting, but it is necessary if we want to extract any interesting stories from it.
Here are some example datasets for us to work with:
import numpy as np import pandas as pd example_df = pd.DataFrame(np.random.randint(0,100, size=(100,4)), columns=list('ABCD')) example_df.head()
We’ve created a dataframe with 4 columns, and 100 rows (zero is counted as the first index marker so 99 will be our last row) that is populated with random integers between 0 and 100.
NaN (not a number) variables are gaps in data, when an observation has been missed or perhaps the data isn’t available. They make it impossible to run most analysis tools on a dataset, but if you are recording data and simply throw out all NaN observations, you will end up losing a lot of potentially useful data. Therefore, as datascientists, we want to observer how many NaN values there are, where they are in the data, and the most appropriate way to “clean” them.
You can verify that there are no NaN variables by running:
A 0 B 0 C 0 D 0 dtype: int64
The .isna() function searches for NaN values and returns a boolean True when it finds a NaN, then the .sum() function counts each True and returns the sum for each column. If we wanted to check the number of NaN values organized by line instead of column we would type:
But I won’t run it because it’s a series of 100 zeros.
Since, we don’t actually have any NaN values yet, so lets add some:
nan_df = example_df.applymap(lambda x: np.nan if np.random.randint(0,100) <= 15 else x) nan_df.head()
What we’ve done is create a new dataset that has 15% NaN variables by mapping a function over every element of the dataframe. The function replaces the existing cell with a NaN value if the random number it generates is <= 15.
We can verify that the NaNs exist and their distribution as follows:
A 18 B 12 C 12 D 19 dtype: int64
import statistics statistics.mean(nan_df.isna().sum())
So if we take the mean of all the NaN’s in our four columns we get 15.25. This isn’t exactly 15%, probably due to the way that numpy interprets <= 15.
Feel free to let me know if you understand how that function works and I can add a clarification.
Now that we have some NaN datapoints, a fairly standard cleaning algorithm is as follows:
1) run df.isna().sum() to confirm the presence of NaN values (which we’ve done)
2) determine what is the appropriate measure to take with your NaN values.
For step 2, we should consider the characteristics of this dataframe. It’s an array of integers, so it will have a shape, and some statistical properties, which we can see below:
(100, 4) A B C D count 82.000000 88.000000 88.000000 81.000000 mean 52.048780 52.079545 50.568182 40.172840 std 28.533672 26.836337 28.489077 26.579969 min 0.000000 2.000000 3.000000 1.000000 25% 26.000000 31.750000 27.750000 19.000000 50% 51.000000 48.000000 52.500000 35.000000 75% 77.000000 73.250000 73.500000 60.000000 max 98.000000 99.000000 99.000000 99.000000
*I use the print() function here because when you have two functions in a Jupyter (or colab) notebook code cell, only the last function will produce an output when you run the cell.
You can see that our dataframe’s shape is 100 rows by 4 columns, and the 4 colums have a certain mean, standard deviation, and distribution listed by the describe function. You can also see that the count is the number of rows minus the NaN values listed in our columns.
Since datascience is concerned with rapidly deploying predictive models and descriptive statistics, there is an “art” to this science (this may cause some readers to see red and I think I know what you’re going to say, but please hear me out).
If we were to delete all columns with NaN variables, we would lose the dataset, so that’s not an option. If we were to delete all rows with NaN values, we would lose more than 15% of our data, which is an unattractive option.
A better option might be to mask the NaN values, without changing the shape of the dataset. We can replace each NaN with the mean of the respective column and should preserve the general shape of the data.
Let’s see what happens:
masked_df = nan_df.fillna(nan_df.mean()) print(masked_df.shape) print(masked_df.describe())
(100, 4) A B C D count 100.000000 100.000000 100.000000 100.000000 mean 52.048780 52.079545 50.568182 40.172840 std 25.809677 25.157372 26.706711 23.893615 min 0.000000 2.000000 3.000000 1.000000 25% 36.000000 35.500000 31.000000 23.000000 50% 52.048780 52.079545 50.568182 40.172840 75% 71.250000 68.750000 70.000000 53.000000 max 98.000000 99.000000 99.000000 99.000000
The mean is preserved, as are the min and max. However, our standard deviation and quartiles have shifted towards the mean. This is to be expected as all of the “un-counted” NaN’s that weren’t included in this description in our nan_df have been stuck into the mean of this dataset, shifting the relative weight of observations to the mean.
This is a tradeoff, but perhaps a small one since now we get to keep all our data.
I will complicate this situation a little more in my next post.