niimpy.preprocessing.util module

niimpy.preprocessing.util.aggregate(df, freq, method_numerical='mean', method_categorical='first', groups=['user'], **resample_kwargs)[source]

Grouping and resampling the data. This function performs separated resampling for different types of columns: numerical and categorical.

Parameters:
dfpandas Dataframe

Dataframe to resample

freqstring

Frequency to resample the data. Requires the dataframe to have datetime-like index.

method_numericalstr

Resampling method for numerical columns. Possible values: ‘sum’, ‘mean’, ‘median’. Default value is ‘mean’.

method_categoricalstr

Resampling method for categorical columns. Possible values: ‘first’, ‘mode’, ‘last’.

groupslist

Columns used for groupby operation.

resample_kwargsdict

keywords to pass pandas resampling function

Returns:
An aggregated and resampled multi-index dataframe.
niimpy.preprocessing.util.date_range(df, start, end)[source]

Extract out a certain date range from a DataFrame.

Extract out a certain data range from a dataframe. The index must be the dates, and the index must be sorted.

niimpy.preprocessing.util.df_normalize(df, tz=None, old_tz=None)[source]

Normalize a df (from sql) before presenting it to the user.

This sets the dataframe index to the time values, and converts times to pandas.TimeStamp:s. Modifies the data frame inplace.

niimpy.preprocessing.util.ensure_dataframe(df)[source]
niimpy.preprocessing.util.group_data(df, additional_columns=None, id_columns=['user', 'device', 'group'])[source]

Group the dataframe by Niimpy standard user identifier columns present in the dataframe. The columns are ‘user’, ‘device’, and ‘group’. An addional column may be added and used for grouping.

niimpy.preprocessing.util.identifier_columns(df, id_columns=['user', 'device', 'group'])[source]

build a list of standard Niimpy identifier columns in the dataframe.

niimpy.preprocessing.util.install_extensions()[source]

Automatically install sqlite extension functions.

Only works on Linux for now, improvements welcome.

niimpy.preprocessing.util.occurrence(series, bins=5, interval='1h')[source]

Resamples by grouping_width and aggregates by the number of bins with data.

With default options, this reproduces the logic of the “occurrence” database function, without needing the database.

Parameters:
seriespandas.Series

A pandas series of pandas.Timestamps.

binsint

The number of bins each time interval is divided into.

intervalstr

Length of the time interval. Default is “1h”.

Returns:
pandas.DataFrame

Dataframe with timestamp index and ‘occurance’ column.

niimpy.preprocessing.util.read_preprocess(df, add_group=None)[source]

Standard preprocessing arguments when reading.

This is a preprocessing filter which handles some standard arguments when reading files. This should be considered a private, unstable function.

Parameters:
df: pandas.DataFrame

Input data frame

add_group: string, optional

If given, add a new ‘group’ column with all values set to this given identifier.

Returns:
df: dataframe

Resulting dataframe (modified in-place if possible, but may also be a copy)

niimpy.preprocessing.util.reset_groups(df, additional_columns=None, id_columns=['user', 'device', 'group'])[source]

Reset id columns and optional addional columns in the dataframe index.

niimpy.preprocessing.util.select_columns(df, columns, id_columns=['user', 'device', 'group'])[source]

Select Niimpy identifier columns and listed feature columns

niimpy.preprocessing.util.set_conserved_index(df, additional_columns=None, id_columns=['user', 'device', 'group'])[source]

Set standard id columns as index. This allows concatenating dataframes with different measurements.

niimpy.preprocessing.util.set_encoding(df, to_encoding='utf-8', from_encoding='iso-8859-1')[source]

Recode the dataframe to a different encoding. This is useful when the encoding in a data file is set incorrectly and utf characters are garbled.

Parameters:
dfpandas.DataFrame

Dataframe to recode

to_encodingstr

Encoding to convert to. Default is ‘utf-8’.

from_encodingstr

Encoding to convert from. Default is ‘iso-8859-1’.

Returns:
pandas.DataFrame

Recoded dataframe.

niimpy.preprocessing.util.set_tz(tz)[source]

Globally set the preferred local timezone

niimpy.preprocessing.util.tmp_timezone(new_tz)[source]

Temporarily override the global timezone for a black.

This is used as a context manager:

with tmp_timezone('Europe/Berlin'):
    ....

Note: this overrides the global timezone. In the future, there will be a way to handle timezones as non-global variables, which should be preferred.

niimpy.preprocessing.util.to_datetime(value)[source]
niimpy.preprocessing.util.uninstall_extensions()[source]

Uninstall any installed extensions