Comparing type inference methods for mixed data arrays

Some people work with data where the meaning of features (columns) is very clear because only common sense is required. For instance, even without a schema, in a housing price dataset, a column called “number of rooms” would be the number of rooms in a housing unit, and it’s very likely that the values of this column will be integers.

In hardware (microprocessor) verification, it’s often impossible to understand the meaning of the columns. If you are an ML practitioner without hardware engineering background, you can nag verification engineers to explain it but it’s very likely that you wouldn’t completely understand, and there are hundreds and thousands of columns that need explanation. Even if you do have the background, depending on the product type, it’s likely that you can’t have full understanding of all columns.

Besides, sometimes you need to work with so called “mixed data type” arrays. An example would be an array of boolean and float such as [True, 0.0]. If you use pandas to read this type of data, you should know that it infers the data type of an array like this as object quite often. This inference is done by the pandas.DataFrame.infer_objects method. However, a lot of different types of mixed arrays can be inferred as object dtype. This “blanket” approach might be useful for practical data handling but it is not suitable for more accurate and granular type inference. If the goal is to understand the actual content of the arrays.

You may not have known that pandas has another type inference method in their api: pandas.api.types.infer_dtype, which provides granular type inference and allows to ignore null values (skipna=True). This method returns a name of inferred type as a string such as "boolean" or "floating". For the comprehensive list of the type names, see the pandas documentation.

This notebook compares the two type inference methods of pandas (pandas.DataFrame.infer_objects and pandas.api.types.infer_dtype) when they are faced with various cases of mixed data type arrays. For the comparison, I used exhaustive combination of None, array(list), str, bool, float, int data to generate various mixed arrays, and then applied the two inference methods to compare the results.

import pandas as pd
import numpy as np

Testing data: generating arrays of mixed data types

Here I generated a dataframe with various mixed types: "nan"(np.nan), "none", "array" (list), "str", "bool", "float", "int". Using their exhaustive combinations (\(N_{type}=2\)), I created a 2-element array for each combination. For fair comparison, I assigned object dtypes to all columns.

example = pd.DataFrame(
    {
        'nan': [np.nan, np.nan],
        'nan_none': [np.nan, None],
        'nan_array': [np.nan, []],
        'nan_str': [np.nan, "a"],
        'nan_bool': [np.nan, True],        
        'nan_float': [np.nan, 1.0],        
        'nan_int': [np.nan, 1],
        'none': [None, None],        
        'none_array': [None, []],
        'none_str': [None, "a"],        
        'none_bool': [None, True],
        'none_float': [None, 0.0],
        'none_int': [None, 1],
        'array': [[], []],
        'array_str': [[], "a"],
        'array_bool': [[], True],
        'array_float': [[], 1.0],
        'array_int': [[], 1],
        'str': ["a", "b"],
        'str_bool': ["a", True],
        'str_float': ["a", 1.0],
        'str_int': ["a", 1],
        'bool': [True, False],
        'bool_float': [True, 0.0],
        'bool_int': [True, 1],
        'float': [1.0, 0.0],
        'float_int': [1.0, 0],
        'int': [1, 0],
    },
    dtype=object
)
print(example.dtypes.value_counts())

object    28
dtype: int64

example.head()

	nan	nan_none	nan_array	nan_str	nan_bool	nan_float	nan_int	none	none_array	none_str	...	str	str_bool	str_float	str_int	bool	bool_float	bool_int	float	float_int	int
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	None	None	None	...	a	a	a	a	True	True	True	1.0	1.0	1
1	NaN	None	[]	a	True	1.0	1	None	[]	a	...	b	True	1.0	1	False	0.0	1	0.0	0	0

2 rows × 28 columns

Type inference with `pandas.DataFrame.infer_objects`

example_results = example.T
example_results['pd_infer_objects'] = example.infer_objects().dtypes
example_results

	0	1	pd_infer_objects
nan	NaN	NaN	float64
nan_none	NaN	None	float64
nan_array	NaN	[]	object
nan_str	NaN	a	object
nan_bool	NaN	True	object
nan_float	NaN	1.0	float64
nan_int	NaN	1	float64
none	None	None	object
none_array	None	[]	object
none_str	None	a	object
none_bool	None	True	object
none_float	None	0.0	float64
none_int	None	1	float64
array	[]	[]	object
array_str	[]	a	object
array_bool	[]	True	object
array_float	[]	1.0	object
array_int	[]	1	object
str	a	b	object
str_bool	a	True	object
str_float	a	1.0	object
str_int	a	1	object
bool	True	False	bool
bool_float	True	0.0	object
bool_int	True	1	object
float	1.0	0.0	float64
float_int	1.0	0	float64
int	1	0	int64

At a glance, this method infers most of these mixed arrays as object, which naturally doesn’t deliver much information about what exact mixture of types the arrays have. Plus, some object arrays can receive int or float casting (e.g., [True, 1]), but some can’t (e.g., ['a', 1]).

Type inference with `pandas.api.types.infer_dtype`

This method allows two variants: with skipping na values and without. Let’s get inference results from the both.

example_results['pd_infer_dtype'] = example.apply(lambda x:pd.api.types.infer_dtype(x, skipna=False))
example_results['pd_infer_dtype_skipna'] = example.apply(lambda x:pd.api.types.infer_dtype(x, skipna=True))

example_results

	0	1	pd_infer_objects	pd_infer_dtype	pd_infer_dtype_skipna
nan	NaN	NaN	float64	floating	empty
nan_none	NaN	None	float64	mixed	empty
nan_array	NaN	[]	object	mixed	mixed
nan_str	NaN	a	object	mixed	string
nan_bool	NaN	True	object	mixed	boolean
nan_float	NaN	1.0	float64	floating	floating
nan_int	NaN	1	float64	integer-na	integer
none	None	None	object	mixed	empty
none_array	None	[]	object	mixed	mixed
none_str	None	a	object	mixed	string
none_bool	None	True	object	mixed	boolean
none_float	None	0.0	float64	mixed	mixed-integer-float
none_int	None	1	float64	mixed-integer	integer
array	[]	[]	object	mixed	mixed
array_str	[]	a	object	mixed	mixed
array_bool	[]	True	object	mixed	mixed
array_float	[]	1.0	object	mixed	mixed
array_int	[]	1	object	mixed-integer	mixed-integer
str	a	b	object	string	string
str_bool	a	True	object	mixed	mixed
str_float	a	1.0	object	mixed	mixed
str_int	a	1	object	mixed-integer	mixed-integer
bool	True	False	bool	boolean	boolean
bool_float	True	0.0	object	mixed	mixed
bool_int	True	1	object	mixed-integer	mixed-integer
float	1.0	0.0	float64	floating	floating
float_int	1.0	0	float64	mixed-integer-float	mixed-integer-float
int	1	0	int64	integer	integer

Comparison: with vs. without na values in `pandas.api.types.infer_dtype`

When we don’t skip na values (skipna=False), we often get "mixed" results from pandas.api.types.infer_dtype for arrays that are inferred as object by pandas.DataFrame.infer_objects. This means, the inference results are not granular, like we just saw from pandas.DataFrame.infer_objects. For instance, in the table above, "nan_array", "nan_str", "nan_bool" are all identified as "mixed" when we don’t ignore nan.

example_results.loc[["nan_array", "nan_str", "nan_bool"], "pd_infer_dtype"]

nan_array    mixed
nan_str      mixed
nan_bool     mixed
Name: pd_infer_dtype, dtype: object

However, when we ignore na values, we get more granular results, which identify the correct data types (without missing).

example_results.loc[["nan_array", "nan_str", "nan_bool"], "pd_infer_dtype_skipna"]

nan_array      mixed
nan_str       string
nan_bool     boolean
Name: pd_infer_dtype_skipna, dtype: object

Comparison: `pandas.DataFrame.infer_objects` vs. `pandas.api.types.infer_dtype(skipna=True)`

Because pandas.DataFrame.infer_objects has a blanket approach to mixed data arrays, using this method to various mixed arrays, we get a lot of object columns. Let’s take a deeper look at the columns inferred as object by pandas.DataFrame.infer_objects, and examine the inference result from pandas.api.types.infer_dtype(skipna=True).

example_results[example_results['pd_infer_objects'] == object].drop('pd_infer_dtype', axis=1).sort_values(by='pd_infer_dtype_skipna')

	0	1	pd_infer_objects	pd_infer_dtype_skipna
nan_bool	NaN	True	object	boolean
none_bool	None	True	object	boolean
none	None	None	object	empty
nan_array	NaN	[]	object	mixed
str_float	a	1.0	object	mixed
str_bool	a	True	object	mixed
array_float	[]	1.0	object	mixed
array_bool	[]	True	object	mixed
array_str	[]	a	object	mixed
array	[]	[]	object	mixed
none_array	None	[]	object	mixed
bool_float	True	0.0	object	mixed
array_int	[]	1	object	mixed-integer
str_int	a	1	object	mixed-integer
bool_int	True	1	object	mixed-integer
none_str	None	a	object	string
str	a	b	object	string
nan_str	NaN	a	object	string

This shows that a variety of mixed arrays is inferred as object by pandas.DataFrame.infer_objects but pandas.api.types.infer_dtype(skipna=True) can often identify true types. It’s true that the latter returns a lot of different arrays as "mixed" but most of them have non-numerical values such as string or array.

One interesting observation is that [True, 0.0] is inferred as "mixed" but [True, 1] as "mixed-integer", which implies that pandas.api.types.infer_dtype method is designed to highlight the presence of integers in inferred type information.

Finally, we can compare the returned values of two methods:

for val in set(example_results['pd_infer_objects']):
    print(val, type(val))

int64 <class 'numpy.dtype[int64]'>
float64 <class 'numpy.dtype[float64]'>
bool <class 'numpy.dtype[bool_]'>
object <class 'numpy.dtype[object_]'>

for val in set(example_results['pd_infer_dtype_skipna']):
    print(val, type(val))

mixed <class 'str'>
floating <class 'str'>
boolean <class 'str'>
mixed-integer <class 'str'>
string <class 'str'>
integer <class 'str'>
mixed-integer-float <class 'str'>
empty <class 'str'>

This shows that pandas.DataFrame.infer_objects returns a readily usable python types as inference results but pandas.api.types.infer_dtype returns string values, that need to be further processed or mapped if we want to cast more granular data types to these mixed arrays.

Conclusions

Hardware verification datasets often do not have schema and the feature meanings cannot be understood without extremely specialized domain knowledge. The fact that the datasets often have mixed data type arrays makes it difficult for ML practitioners to understand the content of the datasets. Therefore, type inference becomes an important step in the data digestion stage.

We can use pandas for type inference. But it has two methods: pandas.DataFrame.infer_objects, pandas.api.types.infer_dtype. The former (pandas.DataFrame.infer_objects) is designed to return practical python data types that can be easily cast on arrays. Thus its type inference tends to adopt a blanket approach where inferred type should work without any further steps to handle the data immediately.

On the other hand, pandas.api.types.infer_dtype does a more granular type inference job where it can also ignore na values. However, it returns string values as a result, not python types. Therefore, we need a further process to use this information for type casting such as "boolean" -> bool.

Testing data: generating arrays of mixed data types

Type inference with pandas.DataFrame.infer_objects

Type inference with pandas.api.types.infer_dtype

Comparison: with vs. without na values in pandas.api.types.infer_dtype

Comparison: pandas.DataFrame.infer_objects vs. pandas.api.types.infer_dtype(skipna=True)

Conclusions

Type inference with `pandas.DataFrame.infer_objects`

Type inference with `pandas.api.types.infer_dtype`

Comparison: with vs. without na values in `pandas.api.types.infer_dtype`

Comparison: `pandas.DataFrame.infer_objects` vs. `pandas.api.types.infer_dtype(skipna=True)`