Comparing type inference methods for mixed data arrays

Pandas have two type inference methods. Let’s compare the methods by inferring data types for mixed data type arrays.
ML
data preprocessing
Author

Hongsup Shin

Published

March 2, 2023

Some people work with data where the meaning of features (columns) is very clear because only common sense is required. For instance, even without a schema, in a housing price dataset, a column called “number of rooms” would be the number of rooms in a housing unit, and it’s very likely that the values of this column will be integers.

In hardware (microprocessor) verification, it’s often impossible to understand the meaning of the columns. If you are an ML practitioner without hardware engineering background, you can nag verification engineers to explain it but it’s very likely that you wouldn’t completely understand, and there are hundreds and thousands of columns that need explanation. Even if you do have the background, depending on the product type, it’s likely that you can’t have full understanding of all columns.

Besides, sometimes you need to work with so called “mixed data type” arrays. An example would be an array of boolean and float such as [True, 0.0]. If you use pandas to read this type of data, you should know that it infers the data type of an array like this as object quite often. This inference is done by the pandas.DataFrame.infer_objects method. However, a lot of different types of mixed arrays can be inferred as object dtype. This “blanket” approach might be useful for practical data handling but it is not suitable for more accurate and granular type inference. If the goal is to understand the actual content of the arrays.

You may not have known that pandas has another type inference method in their api: pandas.api.types.infer_dtype, which provides granular type inference and allows to ignore null values (skipna=True). This method returns a name of inferred type as a string such as "boolean" or "floating". For the comprehensive list of the type names, see the pandas documentation.

This notebook compares the two type inference methods of pandas (pandas.DataFrame.infer_objects and pandas.api.types.infer_dtype) when they are faced with various cases of mixed data type arrays. For the comparison, I used exhaustive combination of None, array(list), str, bool, float, int data to generate various mixed arrays, and then applied the two inference methods to compare the results.

import pandas as pd
import numpy as np

Testing data: generating arrays of mixed data types

Here I generated a dataframe with various mixed types: "nan"(np.nan), "none", "array" (list), "str", "bool", "float", "int". Using their exhaustive combinations (\(N_{type}=2\)), I created a 2-element array for each combination. For fair comparison, I assigned object dtypes to all columns.

example = pd.DataFrame(
    {
        'nan': [np.nan, np.nan],
        'nan_none': [np.nan, None],
        'nan_array': [np.nan, []],
        'nan_str': [np.nan, "a"],
        'nan_bool': [np.nan, True],        
        'nan_float': [np.nan, 1.0],        
        'nan_int': [np.nan, 1],
        'none': [None, None],        
        'none_array': [None, []],
        'none_str': [None, "a"],        
        'none_bool': [None, True],
        'none_float': [None, 0.0],
        'none_int': [None, 1],
        'array': [[], []],
        'array_str': [[], "a"],
        'array_bool': [[], True],
        'array_float': [[], 1.0],
        'array_int': [[], 1],
        'str': ["a", "b"],
        'str_bool': ["a", True],
        'str_float': ["a", 1.0],
        'str_int': ["a", 1],
        'bool': [True, False],
        'bool_float': [True, 0.0],
        'bool_int': [True, 1],
        'float': [1.0, 0.0],
        'float_int': [1.0, 0],
        'int': [1, 0],
    },
    dtype=object
)
print(example.dtypes.value_counts())
object    28
dtype: int64
example.head()
nan nan_none nan_array nan_str nan_bool nan_float nan_int none none_array none_str ... str str_bool str_float str_int bool bool_float bool_int float float_int int
0 NaN NaN NaN NaN NaN NaN NaN None None None ... a a a a True True True 1.0 1.0 1
1 NaN None [] a True 1.0 1 None [] a ... b True 1.0 1 False 0.0 1 0.0 0 0

2 rows × 28 columns

Type inference with pandas.DataFrame.infer_objects

example_results = example.T
example_results['pd_infer_objects'] = example.infer_objects().dtypes
example_results
0 1 pd_infer_objects
nan NaN NaN float64
nan_none NaN None float64
nan_array NaN [] object
nan_str NaN a object
nan_bool NaN True object
nan_float NaN 1.0 float64
nan_int NaN 1 float64
none None None object
none_array None [] object
none_str None a object
none_bool None True object
none_float None 0.0 float64
none_int None 1 float64
array [] [] object
array_str [] a object
array_bool [] True object
array_float [] 1.0 object
array_int [] 1 object
str a b object
str_bool a True object
str_float a 1.0 object
str_int a 1 object
bool True False bool
bool_float True 0.0 object
bool_int True 1 object
float 1.0 0.0 float64
float_int 1.0 0 float64
int 1 0 int64

At a glance, this method infers most of these mixed arrays as object, which naturally doesn’t deliver much information about what exact mixture of types the arrays have. Plus, some object arrays can receive int or float casting (e.g., [True, 1]), but some can’t (e.g., ['a', 1]).

Type inference with pandas.api.types.infer_dtype

This method allows two variants: with skipping na values and without. Let’s get inference results from the both.

example_results['pd_infer_dtype'] = example.apply(lambda x:pd.api.types.infer_dtype(x, skipna=False))
example_results['pd_infer_dtype_skipna'] = example.apply(lambda x:pd.api.types.infer_dtype(x, skipna=True))
example_results
0 1 pd_infer_objects pd_infer_dtype pd_infer_dtype_skipna
nan NaN NaN float64 floating empty
nan_none NaN None float64 mixed empty
nan_array NaN [] object mixed mixed
nan_str NaN a object mixed string
nan_bool NaN True object mixed boolean
nan_float NaN 1.0 float64 floating floating
nan_int NaN 1 float64 integer-na integer
none None None object mixed empty
none_array None [] object mixed mixed
none_str None a object mixed string
none_bool None True object mixed boolean
none_float None 0.0 float64 mixed mixed-integer-float
none_int None 1 float64 mixed-integer integer
array [] [] object mixed mixed
array_str [] a object mixed mixed
array_bool [] True object mixed mixed
array_float [] 1.0 object mixed mixed
array_int [] 1 object mixed-integer mixed-integer
str a b object string string
str_bool a True object mixed mixed
str_float a 1.0 object mixed mixed
str_int a 1 object mixed-integer mixed-integer
bool True False bool boolean boolean
bool_float True 0.0 object mixed mixed
bool_int True 1 object mixed-integer mixed-integer
float 1.0 0.0 float64 floating floating
float_int 1.0 0 float64 mixed-integer-float mixed-integer-float
int 1 0 int64 integer integer

Comparison: with vs. without na values in pandas.api.types.infer_dtype

When we don’t skip na values (skipna=False), we often get "mixed" results from pandas.api.types.infer_dtype for arrays that are inferred as object by pandas.DataFrame.infer_objects. This means, the inference results are not granular, like we just saw from pandas.DataFrame.infer_objects. For instance, in the table above, "nan_array", "nan_str", "nan_bool" are all identified as "mixed" when we don’t ignore nan.

example_results.loc[["nan_array", "nan_str", "nan_bool"], "pd_infer_dtype"]
nan_array    mixed
nan_str      mixed
nan_bool     mixed
Name: pd_infer_dtype, dtype: object

However, when we ignore na values, we get more granular results, which identify the correct data types (without missing).

example_results.loc[["nan_array", "nan_str", "nan_bool"], "pd_infer_dtype_skipna"]
nan_array      mixed
nan_str       string
nan_bool     boolean
Name: pd_infer_dtype_skipna, dtype: object

Comparison: pandas.DataFrame.infer_objects vs. pandas.api.types.infer_dtype(skipna=True)

Because pandas.DataFrame.infer_objects has a blanket approach to mixed data arrays, using this method to various mixed arrays, we get a lot of object columns. Let’s take a deeper look at the columns inferred as object by pandas.DataFrame.infer_objects, and examine the inference result from pandas.api.types.infer_dtype(skipna=True).

example_results[example_results['pd_infer_objects'] == object].drop('pd_infer_dtype', axis=1).sort_values(by='pd_infer_dtype_skipna')
0 1 pd_infer_objects pd_infer_dtype_skipna
nan_bool NaN True object boolean
none_bool None True object boolean
none None None object empty
nan_array NaN [] object mixed
str_float a 1.0 object mixed
str_bool a True object mixed
array_float [] 1.0 object mixed
array_bool [] True object mixed
array_str [] a object mixed
array [] [] object mixed
none_array None [] object mixed
bool_float True 0.0 object mixed
array_int [] 1 object mixed-integer
str_int a 1 object mixed-integer
bool_int True 1 object mixed-integer
none_str None a object string
str a b object string
nan_str NaN a object string

This shows that a variety of mixed arrays is inferred as object by pandas.DataFrame.infer_objects but pandas.api.types.infer_dtype(skipna=True) can often identify true types. It’s true that the latter returns a lot of different arrays as "mixed" but most of them have non-numerical values such as string or array.

One interesting observation is that [True, 0.0] is inferred as "mixed" but [True, 1] as "mixed-integer", which implies that pandas.api.types.infer_dtype method is designed to highlight the presence of integers in inferred type information.

Finally, we can compare the returned values of two methods:

for val in set(example_results['pd_infer_objects']):
    print(val, type(val))
int64 <class 'numpy.dtype[int64]'>
float64 <class 'numpy.dtype[float64]'>
bool <class 'numpy.dtype[bool_]'>
object <class 'numpy.dtype[object_]'>
for val in set(example_results['pd_infer_dtype_skipna']):
    print(val, type(val))
mixed <class 'str'>
floating <class 'str'>
boolean <class 'str'>
mixed-integer <class 'str'>
string <class 'str'>
integer <class 'str'>
mixed-integer-float <class 'str'>
empty <class 'str'>

This shows that pandas.DataFrame.infer_objects returns a readily usable python types as inference results but pandas.api.types.infer_dtype returns string values, that need to be further processed or mapped if we want to cast more granular data types to these mixed arrays.

Conclusions

Hardware verification datasets often do not have schema and the feature meanings cannot be understood without extremely specialized domain knowledge. The fact that the datasets often have mixed data type arrays makes it difficult for ML practitioners to understand the content of the datasets. Therefore, type inference becomes an important step in the data digestion stage.

We can use pandas for type inference. But it has two methods: pandas.DataFrame.infer_objects, pandas.api.types.infer_dtype. The former (pandas.DataFrame.infer_objects) is designed to return practical python data types that can be easily cast on arrays. Thus its type inference tends to adopt a blanket approach where inferred type should work without any further steps to handle the data immediately.

On the other hand, pandas.api.types.infer_dtype does a more granular type inference job where it can also ignore na values. However, it returns string values as a result, not python types. Therefore, we need a further process to use this information for type casting such as "boolean" -> bool.