import pandas as pd
import numpy as np
Some people work with data where the meaning of features (columns) is very clear because only common sense is required. For instance, even without a schema, in a housing price dataset, a column called “number of rooms” would be the number of rooms in a housing unit, and it’s very likely that the values of this column will be integers.
In hardware (microprocessor) verification, it’s often impossible to understand the meaning of the columns. If you are an ML practitioner without hardware engineering background, you can nag verification engineers to explain it but it’s very likely that you wouldn’t completely understand, and there are hundreds and thousands of columns that need explanation. Even if you do have the background, depending on the product type, it’s likely that you can’t have full understanding of all columns.
Besides, sometimes you need to work with so called “mixed data type” arrays. An example would be an array of boolean and float such as [True, 0.0]
. If you use pandas to read this type of data, you should know that it infers the data type of an array like this as object
quite often. This inference is done by the pandas.DataFrame.infer_objects
method. However, a lot of different types of mixed arrays can be inferred as object
dtype. This “blanket” approach might be useful for practical data handling but it is not suitable for more accurate and granular type inference. If the goal is to understand the actual content of the arrays.
You may not have known that pandas has another type inference method in their api: pandas.api.types.infer_dtype
, which provides granular type inference and allows to ignore null values (skipna=True
). This method returns a name of inferred type as a string such as "boolean"
or "floating"
. For the comprehensive list of the type names, see the pandas documentation.
This notebook compares the two type inference methods of pandas (pandas.DataFrame.infer_objects
and pandas.api.types.infer_dtype
) when they are faced with various cases of mixed data type arrays. For the comparison, I used exhaustive combination of None, array(list), str, bool, float, int
data to generate various mixed arrays, and then applied the two inference methods to compare the results.
Testing data: generating arrays of mixed data types
Here I generated a dataframe with various mixed types: "nan"(
np.nan), "none", "array" (
list), "str", "bool", "float", "int"
. Using their exhaustive combinations (\(N_{type}=2\)), I created a 2-element array for each combination. For fair comparison, I assigned object
dtypes to all columns.
= pd.DataFrame(
example
{'nan': [np.nan, np.nan],
'nan_none': [np.nan, None],
'nan_array': [np.nan, []],
'nan_str': [np.nan, "a"],
'nan_bool': [np.nan, True],
'nan_float': [np.nan, 1.0],
'nan_int': [np.nan, 1],
'none': [None, None],
'none_array': [None, []],
'none_str': [None, "a"],
'none_bool': [None, True],
'none_float': [None, 0.0],
'none_int': [None, 1],
'array': [[], []],
'array_str': [[], "a"],
'array_bool': [[], True],
'array_float': [[], 1.0],
'array_int': [[], 1],
'str': ["a", "b"],
'str_bool': ["a", True],
'str_float': ["a", 1.0],
'str_int': ["a", 1],
'bool': [True, False],
'bool_float': [True, 0.0],
'bool_int': [True, 1],
'float': [1.0, 0.0],
'float_int': [1.0, 0],
'int': [1, 0],
},=object
dtype
)print(example.dtypes.value_counts())
object 28
dtype: int64
example.head()
nan | nan_none | nan_array | nan_str | nan_bool | nan_float | nan_int | none | none_array | none_str | ... | str | str_bool | str_float | str_int | bool | bool_float | bool_int | float | float_int | int | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | None | None | None | ... | a | a | a | a | True | True | True | 1.0 | 1.0 | 1 |
1 | NaN | None | [] | a | True | 1.0 | 1 | None | [] | a | ... | b | True | 1.0 | 1 | False | 0.0 | 1 | 0.0 | 0 | 0 |
2 rows × 28 columns
Type inference with pandas.DataFrame.infer_objects
= example.T
example_results 'pd_infer_objects'] = example.infer_objects().dtypes
example_results[ example_results
0 | 1 | pd_infer_objects | |
---|---|---|---|
nan | NaN | NaN | float64 |
nan_none | NaN | None | float64 |
nan_array | NaN | [] | object |
nan_str | NaN | a | object |
nan_bool | NaN | True | object |
nan_float | NaN | 1.0 | float64 |
nan_int | NaN | 1 | float64 |
none | None | None | object |
none_array | None | [] | object |
none_str | None | a | object |
none_bool | None | True | object |
none_float | None | 0.0 | float64 |
none_int | None | 1 | float64 |
array | [] | [] | object |
array_str | [] | a | object |
array_bool | [] | True | object |
array_float | [] | 1.0 | object |
array_int | [] | 1 | object |
str | a | b | object |
str_bool | a | True | object |
str_float | a | 1.0 | object |
str_int | a | 1 | object |
bool | True | False | bool |
bool_float | True | 0.0 | object |
bool_int | True | 1 | object |
float | 1.0 | 0.0 | float64 |
float_int | 1.0 | 0 | float64 |
int | 1 | 0 | int64 |
At a glance, this method infers most of these mixed arrays as object
, which naturally doesn’t deliver much information about what exact mixture of types the arrays have. Plus, some object
arrays can receive int
or float
casting (e.g., [True, 1]
), but some can’t (e.g., ['a', 1]
).
Type inference with pandas.api.types.infer_dtype
This method allows two variants: with skipping na values and without. Let’s get inference results from the both.
'pd_infer_dtype'] = example.apply(lambda x:pd.api.types.infer_dtype(x, skipna=False))
example_results['pd_infer_dtype_skipna'] = example.apply(lambda x:pd.api.types.infer_dtype(x, skipna=True)) example_results[
example_results
0 | 1 | pd_infer_objects | pd_infer_dtype | pd_infer_dtype_skipna | |
---|---|---|---|---|---|
nan | NaN | NaN | float64 | floating | empty |
nan_none | NaN | None | float64 | mixed | empty |
nan_array | NaN | [] | object | mixed | mixed |
nan_str | NaN | a | object | mixed | string |
nan_bool | NaN | True | object | mixed | boolean |
nan_float | NaN | 1.0 | float64 | floating | floating |
nan_int | NaN | 1 | float64 | integer-na | integer |
none | None | None | object | mixed | empty |
none_array | None | [] | object | mixed | mixed |
none_str | None | a | object | mixed | string |
none_bool | None | True | object | mixed | boolean |
none_float | None | 0.0 | float64 | mixed | mixed-integer-float |
none_int | None | 1 | float64 | mixed-integer | integer |
array | [] | [] | object | mixed | mixed |
array_str | [] | a | object | mixed | mixed |
array_bool | [] | True | object | mixed | mixed |
array_float | [] | 1.0 | object | mixed | mixed |
array_int | [] | 1 | object | mixed-integer | mixed-integer |
str | a | b | object | string | string |
str_bool | a | True | object | mixed | mixed |
str_float | a | 1.0 | object | mixed | mixed |
str_int | a | 1 | object | mixed-integer | mixed-integer |
bool | True | False | bool | boolean | boolean |
bool_float | True | 0.0 | object | mixed | mixed |
bool_int | True | 1 | object | mixed-integer | mixed-integer |
float | 1.0 | 0.0 | float64 | floating | floating |
float_int | 1.0 | 0 | float64 | mixed-integer-float | mixed-integer-float |
int | 1 | 0 | int64 | integer | integer |
Comparison: with vs. without na values in pandas.api.types.infer_dtype
When we don’t skip na values (skipna=False
), we often get "mixed"
results from pandas.api.types.infer_dtype
for arrays that are inferred as object
by pandas.DataFrame.infer_objects
. This means, the inference results are not granular, like we just saw from pandas.DataFrame.infer_objects
. For instance, in the table above, "nan_array", "nan_str", "nan_bool"
are all identified as "mixed"
when we don’t ignore nan.
"nan_array", "nan_str", "nan_bool"], "pd_infer_dtype"] example_results.loc[[
nan_array mixed
nan_str mixed
nan_bool mixed
Name: pd_infer_dtype, dtype: object
However, when we ignore na values, we get more granular results, which identify the correct data types (without missing).
"nan_array", "nan_str", "nan_bool"], "pd_infer_dtype_skipna"] example_results.loc[[
nan_array mixed
nan_str string
nan_bool boolean
Name: pd_infer_dtype_skipna, dtype: object
Comparison: pandas.DataFrame.infer_objects
vs. pandas.api.types.infer_dtype(skipna=True)
Because pandas.DataFrame.infer_objects
has a blanket approach to mixed data arrays, using this method to various mixed arrays, we get a lot of object
columns. Let’s take a deeper look at the columns inferred as object
by pandas.DataFrame.infer_objects
, and examine the inference result from pandas.api.types.infer_dtype(skipna=True)
.
'pd_infer_objects'] == object].drop('pd_infer_dtype', axis=1).sort_values(by='pd_infer_dtype_skipna') example_results[example_results[
0 | 1 | pd_infer_objects | pd_infer_dtype_skipna | |
---|---|---|---|---|
nan_bool | NaN | True | object | boolean |
none_bool | None | True | object | boolean |
none | None | None | object | empty |
nan_array | NaN | [] | object | mixed |
str_float | a | 1.0 | object | mixed |
str_bool | a | True | object | mixed |
array_float | [] | 1.0 | object | mixed |
array_bool | [] | True | object | mixed |
array_str | [] | a | object | mixed |
array | [] | [] | object | mixed |
none_array | None | [] | object | mixed |
bool_float | True | 0.0 | object | mixed |
array_int | [] | 1 | object | mixed-integer |
str_int | a | 1 | object | mixed-integer |
bool_int | True | 1 | object | mixed-integer |
none_str | None | a | object | string |
str | a | b | object | string |
nan_str | NaN | a | object | string |
This shows that a variety of mixed arrays is inferred as object
by pandas.DataFrame.infer_objects
but pandas.api.types.infer_dtype(skipna=True)
can often identify true types. It’s true that the latter returns a lot of different arrays as "mixed"
but most of them have non-numerical values such as string or array.
One interesting observation is that [True, 0.0]
is inferred as "mixed"
but [True, 1]
as "mixed-integer"
, which implies that pandas.api.types.infer_dtype
method is designed to highlight the presence of integers in inferred type information.
Finally, we can compare the returned values of two methods:
for val in set(example_results['pd_infer_objects']):
print(val, type(val))
int64 <class 'numpy.dtype[int64]'>
float64 <class 'numpy.dtype[float64]'>
bool <class 'numpy.dtype[bool_]'>
object <class 'numpy.dtype[object_]'>
for val in set(example_results['pd_infer_dtype_skipna']):
print(val, type(val))
mixed <class 'str'>
floating <class 'str'>
boolean <class 'str'>
mixed-integer <class 'str'>
string <class 'str'>
integer <class 'str'>
mixed-integer-float <class 'str'>
empty <class 'str'>
This shows that pandas.DataFrame.infer_objects
returns a readily usable python types as inference results but pandas.api.types.infer_dtype
returns string values, that need to be further processed or mapped if we want to cast more granular data types to these mixed arrays.
Conclusions
Hardware verification datasets often do not have schema and the feature meanings cannot be understood without extremely specialized domain knowledge. The fact that the datasets often have mixed data type arrays makes it difficult for ML practitioners to understand the content of the datasets. Therefore, type inference becomes an important step in the data digestion stage.
We can use pandas for type inference. But it has two methods: pandas.DataFrame.infer_objects
, pandas.api.types.infer_dtype
. The former (pandas.DataFrame.infer_objects
) is designed to return practical python data types that can be easily cast on arrays. Thus its type inference tends to adopt a blanket approach where inferred type should work without any further steps to handle the data immediately.
On the other hand, pandas.api.types.infer_dtype
does a more granular type inference job where it can also ignore na values. However, it returns string values as a result, not python types. Therefore, we need a further process to use this information for type casting such as "boolean" -> bool
.