niimpy.preprocessing.util module

niimpy.preprocessing.util.aggregate(df, freq, method_numerical='mean', method_categorical='first', groups=['user'], **resample_kwargs)[source]

Grouping and resampling the data. This function performs separated resampling for different types of columns: numerical and categorical.

Parameters:
dfpandas Dataframe

Dataframe to resample

freqstring

Frequency to resample the data. Requires the dataframe to have datetime-like index.

method_numericalstr

Resampling method for numerical columns. Possible values: ‘sum’, ‘mean’, ‘median’. Default value is ‘mean’.

method_categoricalstr

Resampling method for categorical columns. Possible values: ‘first’, ‘mode’, ‘last’.

groupslist

Columns used for groupby operation.

resample_kwargsdict

keywords to pass pandas resampling function

Returns:
An aggregated and resampled multi-index dataframe.
niimpy.preprocessing.util.date_range(df, start, end)[source]

Extract out a certain date range from a DataFrame.

Extract out a certain data range from a dataframe. The index must be the dates, and the index must be sorted.

niimpy.preprocessing.util.df_normalize(df, tz=None, old_tz=None)[source]

Normalize a df (from sql) before presenting it to the user.

This sets the dataframe index to the time values, and converts times to pandas.TimeStamp:s. Modifies the data frame inplace.

niimpy.preprocessing.util.format_column_names(df)[source]
niimpy.preprocessing.util.install_extensions()[source]

Automatically install sqlite extension functions.

Only works on Linux for now, improvements welcome.

niimpy.preprocessing.util.occurrence(series, bins=5, interval='1h')[source]

Resamples by grouping_width and aggregates by the number of bins with data.

With default options, this reproduces the logic of the “occurrence” database function, without needing the database.

Parameters:
seriespandas.Series

A pandas series of pandas.Timestamps.

binsint

The number of bins each time interval is divided into.

intervalstr

Length of the time interval. Default is “1h”.

Returns:
pandas.DataFrame

Dataframe with timestamp index and ‘occurance’ column.

niimpy.preprocessing.util.set_encoding(df, to_encoding='utf-8', from_encoding='iso-8859-1')[source]

Recode the dataframe to a different encoding. This is useful when the encoding in a data file is set incorrectly and utf characters are garbled.

Parameters:
dfpandas.DataFrame

Dataframe to recode

to_encodingstr

Encoding to convert to. Default is ‘utf-8’.

from_encodingstr

Encoding to convert from. Default is ‘iso-8859-1’.

Returns:
pandas.DataFrame

Recoded dataframe.

niimpy.preprocessing.util.set_tz(tz)[source]

Globally set the preferred local timezone

niimpy.preprocessing.util.tmp_timezone(new_tz)[source]

Temporarily override the global timezone for a black.

This is used as a context manager:

with tmp_timezone('Europe/Berlin'):
    ....

Note: this overrides the global timezone. In the future, there will be a way to handle timezones as non-global variables, which should be preferred.

niimpy.preprocessing.util.to_datetime(value)[source]
niimpy.preprocessing.util.uninstall_extensions()[source]

Uninstall any installed extensions