niimpy.preprocessing.util module

niimpy.preprocessing.util.aggregate(df, freq, method_numerical='mean', method_categorical='first', groups=['user'], **resample_kwargs)[source]

Grouping and resampling the data. This function performs separated resampling for different types of columns: numerical and categorical.

Parameters:

dfpandas Dataframe: Dataframe to resample
freqstring: Frequency to resample the data. Requires the dataframe to have datetime-like index.
method_numericalstr: Resampling method for numerical columns. Possible values: ‘sum’, ‘mean’, ‘median’. Default value is ‘mean’.
method_categoricalstr: Resampling method for categorical columns. Possible values: ‘first’, ‘mode’, ‘last’.
groupslist: Columns used for groupby operation.
resample_kwargsdict: keywords to pass pandas resampling function

Returns:

An aggregated and resampled multi-index dataframe.

niimpy.preprocessing.util.date_range(df, start, end)[source]

Extract out a certain date range from a DataFrame.

Extract out a certain data range from a dataframe. The index must be the dates, and the index must be sorted.

niimpy.preprocessing.util.df_normalize(df, tz=None, old_tz=None)[source]

Normalize a df (from sql) before presenting it to the user.

This sets the dataframe index to the time values, and converts times to pandas.TimeStamp:s. Modifies the data frame inplace.

niimpy.preprocessing.util.ensure_dataframe(df)[source]

niimpy.preprocessing.util.group_data(df, additional_columns=None, id_columns=['user', 'device', 'group'])[source]: Group the dataframe by Niimpy standard user identifier columns present in the dataframe. The columns are ‘user’, ‘device’, and ‘group’. An addional column may be added and used for grouping.

niimpy.preprocessing.util.identifier_columns(df, id_columns=['user', 'device', 'group'])[source]: build a list of standard Niimpy identifier columns in the dataframe.

niimpy.preprocessing.util.install_extensions()[source]

Automatically install sqlite extension functions.

Only works on Linux for now, improvements welcome.

niimpy.preprocessing.util.occurrence(series, bins=5, interval='1h')[source]

Resamples by grouping_width and aggregates by the number of bins with data.

With default options, this reproduces the logic of the “occurrence” database function, without needing the database.

Parameters:

seriespandas.Series: A pandas series of pandas.Timestamps.
binsint: The number of bins each time interval is divided into.
intervalstr: Length of the time interval. Default is “1h”.

Returns:

pandas.DataFrame: Dataframe with timestamp index and ‘occurance’ column.

niimpy.preprocessing.util.read_preprocess(df, add_group=None)[source]

Standard preprocessing arguments when reading.

This is a preprocessing filter which handles some standard arguments when reading files. This should be considered a private, unstable function.

Parameters:

df: pandas.DataFrame: Input data frame
add_group: string, optional: If given, add a new ‘group’ column with all values set to this given identifier.

Returns:

df: dataframe: Resulting dataframe (modified in-place if possible, but may also be a copy)

niimpy.preprocessing.util.reset_groups(df, additional_columns=None, id_columns=['user', 'device', 'group'])[source]: Reset id columns and optional addional columns in the dataframe index.

niimpy.preprocessing.util.select_columns(df, columns, id_columns=['user', 'device', 'group'])[source]: Select Niimpy identifier columns and listed feature columns

niimpy.preprocessing.util.set_conserved_index(df, additional_columns=None, id_columns=['user', 'device', 'group'])[source]: Set standard id columns as index. This allows concatenating dataframes with different measurements.

niimpy.preprocessing.util.set_encoding(df, to_encoding='utf-8', from_encoding='iso-8859-1')[source]

Recode the dataframe to a different encoding. This is useful when the encoding in a data file is set incorrectly and utf characters are garbled.

Parameters:

dfpandas.DataFrame: Dataframe to recode
to_encodingstr: Encoding to convert to. Default is ‘utf-8’.
from_encodingstr: Encoding to convert from. Default is ‘iso-8859-1’.

Returns:

pandas.DataFrame: Recoded dataframe.

niimpy.preprocessing.util.set_tz(tz)[source]: Globally set the preferred local timezone

niimpy.preprocessing.util.tmp_timezone(new_tz)[source]

Temporarily override the global timezone for a black.

This is used as a context manager:

with tmp_timezone('Europe/Berlin'):
    ....

Note: this overrides the global timezone. In the future, there will be a way to handle timezones as non-global variables, which should be preferred.

niimpy.preprocessing.util.to_datetime(value)[source]

niimpy.preprocessing.util.uninstall_extensions()[source]: Uninstall any installed extensions