Audio Data
Introduction
Audio data - as recorded by smartphones or other portable devices - can carry important information about individuals’ environments. This may give insights about the activity, sleep, and social interaction. However, using these data can be tricky due to privacy concerns, for example, conversations are highly identifiable. A possible solution is to compute more general characteristics (e.g. frequency) and use those instead to extract features. To address this last part, niimpy
includes the
function extract_features_audio
to clean, downsample, and extract features from audio snippets that have been already anonymized. This function employs other functions to extract the following features:
audio_count_silent
: number of times when there has been some sound in the environmentaudio_count_speech
: number of times when there has been some sound in the environment that matches the range of human speech frequency (65 - 255Hz)audio_count_loud
: number of times when there has been some sound in the environment above 70dBaudio_min_freq
: minimum frequency of the recorded audio snippetsaudio_max_freq
: maximum frequency of the recorded audio snippetsaudio_mean_freq
: mean frequency of the recorded audio snippetsaudio_median_freq
: median frequency of the recorded audio snippetsaudio_std_freq
: standard deviation of the frequency of the recorded audio snippetsaudio_min_db
: minimum decibels of the recorded audio snippetsaudio_max_db
: maximum decibels of the recorded audio snippetsaudio_mean_db
: mean decibels of the recorded audio snippetsaudio_median_db
: median decibels of the recorded audio snippetsaudio_std_db
: standard deviations of the recorded audio snippets decibels
In the following, we will analyze audio snippets provided by niimpy
as an example to illustrate the use of niimpy’s audio preprocessing functions.
2. Read data
Let’s start by reading the example data provided in niimpy
. These data have already been shaped in a format that meets the requirements of the data schema. Let’s start by importing the needed modules. Firstly we will import the niimpy
package and then we will import the module we will use (audio) and give it a short name for use convinience.
[1]:
import niimpy
from niimpy import config
import niimpy.preprocessing.audio as au
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
Now let’s read the example data provided in niimpy
. The example data is in csv
format, so we need to use the read_csv
function. When reading the data, we can specify the timezone where the data was collected. This will help us handle daylight saving times easier. We can specify the timezone with the argument tz. The output is a dataframe. We can also check the number of rows and columns in the dataframe.
[2]:
data = niimpy.read_csv(config.MULTIUSER_AWARE_AUDIO_PATH, tz='Europe/Helsinki')
data.shape
[2]:
(33, 7)
The data was succesfully read. We can see that there are 33 datapoints with 7 columns in the dataset. However, we do not know yet what the data really looks like, so let’s have a quick look:
[3]:
data.head()
[3]:
user | device | time | is_silent | double_decibels | double_frequency | datetime | |
---|---|---|---|---|---|---|---|
2020-01-09 02:08:03.896000+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578528e+09 | 0 | 84 | 4935 | 2020-01-09 02:08:03.896000+02:00 |
2020-01-09 02:38:03.896000+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578530e+09 | 0 | 89 | 8734 | 2020-01-09 02:38:03.896000+02:00 |
2020-01-09 03:08:03.896000+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578532e+09 | 0 | 99 | 1710 | 2020-01-09 03:08:03.896000+02:00 |
2020-01-09 03:38:03.896000+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578534e+09 | 0 | 77 | 9054 | 2020-01-09 03:38:03.896000+02:00 |
2020-01-09 04:08:03.896000+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578536e+09 | 0 | 80 | 12265 | 2020-01-09 04:08:03.896000+02:00 |
[4]:
data.tail()
[4]:
user | device | time | is_silent | double_decibels | double_frequency | datetime | |
---|---|---|---|---|---|---|---|
2019-08-13 15:02:17.657999872+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 1.565698e+09 | 1 | 44 | 2914 | 2019-08-13 15:02:17.657999872+03:00 |
2019-08-13 15:28:59.657999872+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 1.565699e+09 | 1 | 49 | 7195 | 2019-08-13 15:28:59.657999872+03:00 |
2019-08-13 15:59:01.657999872+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 1.565701e+09 | 0 | 55 | 91 | 2019-08-13 15:59:01.657999872+03:00 |
2019-08-13 16:29:03.657999872+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 1.565703e+09 | 0 | 76 | 3853 | 2019-08-13 16:29:03.657999872+03:00 |
2019-08-13 16:59:05.657999872+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 1.565705e+09 | 0 | 84 | 7419 | 2019-08-13 16:59:05.657999872+03:00 |
By exploring the head and tail of the dataframe we can form an idea of its entirety. From the data, we can see that:
rows are observations, indexed by timestamps, i.e. each row represents a snippet that has been recorded at a given time and date
columns are characteristics for each observation, for example, the user whose data we are analyzing
there are at least two different users in the dataframe
there are two main columns:
double_decibels
anddouble_frequency
.
In fact, we can check the first three elements for each user
[5]:
data.drop_duplicates(['user','time']).groupby('user').head(3)
[5]:
user | device | time | is_silent | double_decibels | double_frequency | datetime | |
---|---|---|---|---|---|---|---|
2020-01-09 02:08:03.896000+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578528e+09 | 0 | 84 | 4935 | 2020-01-09 02:08:03.896000+02:00 |
2020-01-09 02:38:03.896000+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578530e+09 | 0 | 89 | 8734 | 2020-01-09 02:38:03.896000+02:00 |
2020-01-09 03:08:03.896000+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578532e+09 | 0 | 99 | 1710 | 2020-01-09 03:08:03.896000+02:00 |
2019-08-13 07:28:27.657999872+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 1.565671e+09 | 0 | 51 | 7735 | 2019-08-13 07:28:27.657999872+03:00 |
2019-08-13 07:58:29.657999872+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 1.565672e+09 | 0 | 90 | 13609 | 2019-08-13 07:58:29.657999872+03:00 |
2019-08-13 08:28:31.657999872+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 1.565674e+09 | 0 | 81 | 7690 | 2019-08-13 08:28:31.657999872+03:00 |
Sometimes the data may come in a disordered manner, so just to make sure, let’s order the dataframe and compare the results. We will use the columns “user” and “datetime” since we would like to order the information according to firstly, participants, and then, by time in order of happening. Luckily, in our dataframe, the index and datetime are the same.
[6]:
data.sort_values(by=['user', 'datetime'], inplace=True)
data.drop_duplicates(['user','time']).groupby('user').head(3)
[6]:
user | device | time | is_silent | double_decibels | double_frequency | datetime | |
---|---|---|---|---|---|---|---|
2019-08-13 07:28:27.657999872+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 1.565671e+09 | 0 | 51 | 7735 | 2019-08-13 07:28:27.657999872+03:00 |
2019-08-13 07:58:29.657999872+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 1.565672e+09 | 0 | 90 | 13609 | 2019-08-13 07:58:29.657999872+03:00 |
2019-08-13 08:28:31.657999872+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 1.565674e+09 | 0 | 81 | 7690 | 2019-08-13 08:28:31.657999872+03:00 |
2020-01-09 02:08:03.896000+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578528e+09 | 0 | 84 | 4935 | 2020-01-09 02:08:03.896000+02:00 |
2020-01-09 02:38:03.896000+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578530e+09 | 0 | 89 | 8734 | 2020-01-09 02:38:03.896000+02:00 |
2020-01-09 03:08:03.896000+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578532e+09 | 0 | 99 | 1710 | 2020-01-09 03:08:03.896000+02:00 |
Ok, it seems like our dataframe was in order. We can start extracting features. However, we need to understand the data format requirements first.
* TIP! Data format requirements (or what should our data look like)
Data can take other shapes and formats. However, the niimpy
data schema requires it to be in a certain shape. This means the dataframe needs to have at least the following characteristics: 1. One row per call. Each row should store information about one call only 2. Each row’s index should be a timestamp 3. The following five columns are required: - index: date and time when the event happened (timestamp) - user: stores the user name whose data is analyzed. Each user should have a unique
name or hash (i.e. one hash for each unique user) - is_silent: stores whether the decibel level is above a set threshold (usually 50dB) - double_decibels: stores the decibels of the recorded snippet - double_frequency: the frequency of the recorded snippet in Hz - NOTE: most of our audio examples come from data recorded with the Aware Framework, if you want to know more about the frequency and decibels, please read https://github.com/denzilferreira/com.aware.plugin.ambient_noise 4. Additional
columns are allowed. 5. The names of the columns do not need to be exactly “user”, “is_silent”, “double_decibels” or “double_frequency” as we can pass our own names in an argument (to be explained later).
Below is an example of a dataframe that complies with these minimum requirements
[7]:
example_dataschema = data[['user','is_silent','double_decibels','double_frequency']]
example_dataschema.head(3)
[7]:
user | is_silent | double_decibels | double_frequency | |
---|---|---|---|---|
2019-08-13 07:28:27.657999872+03:00 | iGyXetHE3S8u | 0 | 51 | 7735 |
2019-08-13 07:58:29.657999872+03:00 | iGyXetHE3S8u | 0 | 90 | 13609 |
2019-08-13 08:28:31.657999872+03:00 | iGyXetHE3S8u | 0 | 81 | 7690 |
4. Extracting features
There are two ways to extract features. We could use each function separately or we could use niimpy
’s ready-made wrapper. Both ways will require us to specify arguments to pass to the functions/wrapper in order to customize the way the functions work. These arguments are specified in dictionaries. Let’s first understand how to extract features using stand-alone functions.
4.1 Extract features using stand-alone functions
We can use niimpy
’s functions to compute communication features. Each function will require two inputs: - (mandatory) dataframe that must comply with the minimum requirements (see ‘* TIP! Data requirements above) - (optional) an argument dictionary for stand-alone functions
4.1.1 The argument dictionary for stand-alone functions (or how we specify the way a function works)
In this dictionary, we can input two main features to customize the way a stand-alone function works: - the name of the columns to be preprocessed: Since the dataframe may have different columns, we need to specify which column has the data we would like to be preprocessed. To do so, we can simply pass the name of the column to the argument audio_column_name
.
the way we resample: resampling options are specified in
niimpy
as a dictionary.niimpy
’s resampling and aggregating relies onpandas.DataFrame.resample
, so mastering the use of this pandas function will help us greatly inniimpy
’s preprocessing. Please familiarize yourself with the pandas resample function before continuing. Briefly, to use thepandas.DataFrame.resample
function, we need a rule. This rule states the intervals we would like to use to resample our data (e.g., 15-seconds, 30-minutes, 1-hour). Neverthless, we can input more details into the function to specify the exact sampling we would like. For example, we could use the close argument if we would like to specify which side of the interval is closed, or we could use the offset argument if we would like to start our binning with an offset, etc. There are plenty of options to use this command, so we strongly recommend havingpandas.DataFrame.resample
documentation at hand. All features for thepandas.DataFrame.resample
will be specified in a dictionary where keys are the arguments’ names for thepandas.DataFrame.resample
, and the dictionary’s values are the values for each of these selected arguments. This dictionary will be passed as a value to the keyresample_args
inniimpy
.
Let’s see some basic examples of these dictionaries:
[8]:
feature_dict1:{"audio_column_name":"double_frequency","resample_args":{"rule":"1D"}}
feature_dict2:{"audio_column_name":"random_name","resample_args":{"rule":"30T"}}
feature_dict3:{"audio_column_name":"other_name","resample_args":{"rule":"45T","origin":"end"}}
Here, we have three basic feature dictionaries.
feature_dict1
will be used to analyze the data stored in the columndouble_frequency
in our dataframe. The data will be binned in one day periodsfeature_dict2
will be used to analyze the data stored in the columnrandom_name
in our dataframe. The data will be aggregated in 30-minutes binsfeature_dict3
will be used to analyze the data stored in the columnother_name
in our dataframe. The data will be binned in 45-minutes bins, but the binning will start from the last timestamp in the dataframe.
Default values: if no arguments are passed, niimpy
’s will aggregate the data in 30-min bins, and will select the audio_column_name according to the most suitable column. For example, if we are computing the minimum frequency, niimpy
will select double_frquency as the column name.
4.1.2 Using the functions
Now that we understand how the functions are customized, it is time we compute our first audio feature. Suppose that we are interested in extracting the total number of times our recordings were loud every 50 minutes. We will need niimpy
’s audio_count_loud
function, the data, and we will also need to create a dictionary to customize our function. Let’s create the dictionary first
[9]:
function_features={"audio_column_name":"double_decibels","resample_args":{"rule":"50T"}}
Now let’s use the function to preprocess the data.
[10]:
my_loud_times = au.audio_count_loud(data, function_features)
Let’s look at some values for one of the subjects.
[11]:
my_loud_times[my_loud_times["user"]=="jd9INuQ5BBlW"]
[11]:
user | device | audio_count_loud | |
---|---|---|---|
datetime | |||
2020-01-09 01:40:00+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1 |
2020-01-09 02:30:00+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 2 |
2020-01-09 03:20:00+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 2 |
2020-01-09 04:10:00+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1 |
2020-01-09 05:00:00+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 2 |
2020-01-09 05:50:00+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 2 |
2020-01-09 06:40:00+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 1 |
2020-01-09 07:30:00+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 1 |
2020-01-09 08:20:00+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 1 |
2020-01-09 09:10:00+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 1 |
2020-01-09 10:00:00+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 2 |
Let’s remember how the original data looks like for this subject
[12]:
data[data["user"]=="jd9INuQ5BBlW"].head(7)
[12]:
user | device | time | is_silent | double_decibels | double_frequency | datetime | |
---|---|---|---|---|---|---|---|
2020-01-09 02:08:03.896000+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578528e+09 | 0 | 84 | 4935 | 2020-01-09 02:08:03.896000+02:00 |
2020-01-09 02:38:03.896000+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578530e+09 | 0 | 89 | 8734 | 2020-01-09 02:38:03.896000+02:00 |
2020-01-09 03:08:03.896000+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578532e+09 | 0 | 99 | 1710 | 2020-01-09 03:08:03.896000+02:00 |
2020-01-09 03:38:03.896000+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578534e+09 | 0 | 77 | 9054 | 2020-01-09 03:38:03.896000+02:00 |
2020-01-09 04:08:03.896000+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578536e+09 | 0 | 80 | 12265 | 2020-01-09 04:08:03.896000+02:00 |
2020-01-09 04:38:03.896000+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578537e+09 | 0 | 52 | 7281 | 2020-01-09 04:38:03.896000+02:00 |
2020-01-09 05:08:03.896000+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578539e+09 | 0 | 63 | 14408 | 2020-01-09 05:08:03.896000+02:00 |
We see that the bins are indeed 50-minutes bins, however, they are adjusted to fixed, predetermined intervals, i.e. the bin does not start on the time of the first datapoint. Instead, pandas
starts the binning at 00:00:00 of everyday and counts 50-minutes intervals from there.
If we want the binning to start from the first datapoint in our dataset, we need the origin parameter and a for loop.
[13]:
users = list(data['user'].unique())
results = []
for user in users:
start_time = data[data["user"]==user].index.min()
function_features={"audio_column_name":"double_decibels","resample_args":{"rule":"50T","origin":start_time}}
results.append(au.audio_count_loud(data[data["user"]==user], function_features))
my_loud_times = pd.concat(results)
[14]:
my_loud_times
[14]:
user | device | audio_count_loud | |
---|---|---|---|
datetime | |||
2019-08-13 07:28:27.657999872+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 2 |
2019-08-13 08:18:27.657999872+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 2 |
2019-08-13 09:08:27.657999872+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 1 |
2019-08-13 09:58:27.657999872+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 2 |
2019-08-13 10:48:27.657999872+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 2 |
2019-08-13 11:38:27.657999872+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 1 |
2019-08-13 12:28:27.657999872+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 1 |
2019-08-13 13:18:27.657999872+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 0 |
2019-08-13 14:08:27.657999872+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 1 |
2019-08-13 14:58:27.657999872+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 2 |
2019-08-13 15:48:27.657999872+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 2 |
2019-08-13 16:38:27.657999872+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 1 |
2020-01-09 02:08:03.896000+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 2 |
2020-01-09 02:58:03.896000+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 2 |
2020-01-09 03:48:03.896000+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1 |
2020-01-09 04:38:03.896000+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 2 |
2020-01-09 05:28:03.896000+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 2 |
2020-01-09 06:18:03.896000+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1 |
2020-01-09 07:08:03.896000+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 2 |
2020-01-09 07:58:03.896000+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 0 |
2020-01-09 08:48:03.896000+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 1 |
2020-01-09 09:38:03.896000+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 2 |
2020-01-09 10:28:03.896000+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 1 |
4.2 Extract features using the wrapper
We can use niimpy
’s ready-made wrapper to extract one or several features at the same time. The wrapper will require two inputs: - (mandatory) dataframe that must comply with the minimum requirements (see ‘* TIP! Data requirements above) - (optional) an argument dictionary for wrapper
4.2.1 The argument dictionary for wrapper (or how we specify the way the wrapper works)
This argument dictionary will use dictionaries created for stand-alone functions. If you do not know how to create those argument dictionaries, please read the section 4.1.1 The argument dictionary for stand-alone functions (or how we specify the way a function works) first.
The wrapper dictionary is simple. Its keys are the names of the features we want to compute. Its values are argument dictionaries created for each stand-alone function we will employ. Let’s see some examples of wrapper dictionaries:
[15]:
wrapper_features1 = {au.audio_count_loud:{"audio_column_name":"double_decibels","resample_args":{"rule":"1D"}},
au.audio_max_freq:{"audio_column_name":"double_frequency","resample_args":{"rule":"1D"}}}
wrapper_features1
will be used to analyze two features,audio_count_loud
andaudio_max_freq
. For the feature audio_count_loud, we will use the data stored in the columndouble_decibels
in our dataframe and the data will be binned in one day periods. For the feature audio_max_freq, we will use the data stored in the columndouble_frequency
in our dataframe and the data will be binned in one day periods.
[16]:
wrapper_features2 = {au.audio_mean_db:{"audio_column_name":"random_name","resample_args":{"rule":"1D"}},
au.audio_count_speech:{"audio_column_name":"double_decibels", "audio_freq_name":"double_frequency", "resample_args":{"rule":"5H","offset":"5min"}}}
wrapper_features2
will be used to analyze two features,audio_mean_db
andaudio_count_speech
. For the feature audio_mean_db, we will use the data stored in the columnrandom_name
in our dataframe and the data will be binned in one day periods. For the feature audio_count_speech, we will use the data stored in the columndouble_decibels
in our dataframe and the data will be binned in 5-hour periods with a 5-minute offset. Note that for this feature we will also need another column named “audio_freq_column”, this is because the speech is not only defined by the amplitude of the recording, but the frequency range.
[17]:
wrapper_features3 = {au.audio_mean_db:{"audio_column_name":"one_name","resample_args":{"rule":"1D","offset":"5min"}},
au.audio_min_freq:{"audio_column_name":"one_name","resample_args":{"rule":"5H"}},
au.audio_count_silent:{"audio_column_name":"another_name","resample_args":{"rule":"30T","origin":"end_day"}}}
wrapper_features3
will be used to analyze three features,audio_mean_db
,audio_min_freq
, andaudio_count_silent
. For the feature audio_mean_db, we will use the data stored in the columnone_name
and the data will be binned in one day periods with a 5-min offset. For the feature audio_min_freq, we will use the data stored in the columnone_name
in our dataframe and the data will be binned in 5-hour periods. Finally, for the feature audio_count_silent, we will use the data stored in the columnanother_name
in our dataframe and the data will be binned in 30-minute periods and the origin of the bins will be the ceiling midnight of the last day.
Default values: if no arguments are passed, niimpy
’s default values are either “double_decibels”, “double_frequency”, or “is_silent” for the communication_column_name, and 30-min aggregation bins. The column name depends on the function to be called. Moreover, the wrapper will compute all the available functions in absence of the argument dictionary.
4.2.2 Using the wrapper
Now that we understand how the wrapper is customized, it is time we compute our first communication feature using the wrapper. Suppose that we are interested in extracting the audio_count_loud duration every 50 minutes. We will need niimpy
’s extract_features_audio
function, the data, and we will also need to create a dictionary to customize our function. Let’s create the dictionary first
[18]:
wrapper_features1 = {au.audio_count_loud:{"audio_column_name":"double_decibels","resample_args":{"rule":"50T"}}}
Now, let’s use the wrapper
[19]:
results_wrapper = au.extract_features_audio(data, features=wrapper_features1)
results_wrapper.head(5)
computing <function audio_count_loud at 0x7f5c3a65f560>...
[19]:
user | device | audio_count_loud | |
---|---|---|---|
datetime | |||
2019-08-13 06:40:00+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 1 |
2019-08-13 07:30:00+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 1 |
2019-08-13 08:20:00+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 2 |
2019-08-13 09:10:00+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 2 |
2019-08-13 10:00:00+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 1 |
Our first attempt was succesful. Now, let’s try something more. Let’s assume we want to compute the audio_count_loud and audio_min_freq in 1-hour bins.
[20]:
wrapper_features2 = {au.audio_count_loud:{"audio_column_name":"double_decibels","resample_args":{"rule":"1H"}},
au.audio_min_freq:{"audio_column_name":"double_frequency", "resample_args":{"rule":"1H"}}}
results_wrapper = au.extract_features_audio(data, features=wrapper_features2)
results_wrapper.head(5)
computing <function audio_count_loud at 0x7f5c3a65f560>...
computing <function audio_min_freq at 0x7f5c3a65f600>...
[20]:
user | device | audio_count_loud | audio_min_freq | |
---|---|---|---|---|
datetime | ||||
2019-08-13 07:00:00+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 2 | 7735.0 |
2019-08-13 08:00:00+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 2 | 7690.0 |
2019-08-13 09:00:00+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 2 | 756.0 |
2019-08-13 10:00:00+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 2 | 3059.0 |
2019-08-13 11:00:00+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 2 | 12278.0 |
Great! Another successful attempt. We see from the results that more columns were added with the required calculations. This is how the wrapper works when all features are computed with the same bins. Now, let’s see how the wrapper performs when each function has different binning requirements. Let’s assume we need to compute the audio_count_loud every day, and the audio_min_freq every 5 hours with an offset of 5 minutes.
[21]:
wrapper_features3 = {au.audio_count_loud:{"audio_column_name":"double_decibels","resample_args":{"rule":"1D"}},
au.audio_min_freq:{"audio_column_name":"double_frequency", "resample_args":{"rule":"5H", "offset":"5min"}}}
results_wrapper = au.extract_features_audio(data, features=wrapper_features3)
results_wrapper.head(5)
computing <function audio_count_loud at 0x7f5c3a65f560>...
computing <function audio_min_freq at 0x7f5c3a65f600>...
[21]:
user | device | audio_count_loud | audio_min_freq | |
---|---|---|---|---|
datetime | ||||
2019-08-13 00:00:00+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 17.0 | NaN |
2020-01-09 00:00:00+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 10.0 | NaN |
2020-01-09 00:00:00+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 6.0 | NaN |
2019-08-13 05:05:00+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | NaN | 756.0 |
2019-08-13 10:05:00+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | NaN | 2914.0 |
The output is once again a dataframe. In this case, two aggregations are shown. The first one is the daily aggregation computed for the audio_count_loud
feature. The second one is the 5-hour aggregation period with 5-min offset for the audio_min_freq
. We must note that because the audio_min_freq
feature is not required to be aggregated daily, the daily aggregation timestamps have a NaN value. Similarly, because the audio_count_loud
is not required to be aggregated in 5-hour
windows, its values are NaN for all subjects.
4.2.3 Wrapper and its default option
The default option will compute all features in 30-minute aggregation windows. To use the extract_features_audio
function with its default options, simply call the function.
[22]:
default = au.extract_features_audio(data, features=None)
computing <function audio_count_silent at 0x7f5c3a65f420>...
computing <function audio_count_speech at 0x7f5c3a65f4c0>...
computing <function audio_count_loud at 0x7f5c3a65f560>...
computing <function audio_min_freq at 0x7f5c3a65f600>...
computing <function audio_max_freq at 0x7f5c3a65f6a0>...
computing <function audio_mean_freq at 0x7f5c3a65f740>...
computing <function audio_median_freq at 0x7f5c3a65f7e0>...
computing <function audio_std_freq at 0x7f5c3a65f880>...
computing <function audio_min_db at 0x7f5c3a65f920>...
computing <function audio_max_db at 0x7f5c3a65f9c0>...
computing <function audio_mean_db at 0x7f5c3a65fa60>...
computing <function audio_median_db at 0x7f5c3a65fb00>...
computing <function audio_std_db at 0x7f5c3a65fba0>...
[23]:
default.head()
[23]:
user | device | audio_count_silent | audio_count_speech | audio_count_loud | audio_min_freq | audio_max_freq | audio_mean_freq | audio_median_freq | audio_std_freq | audio_min_db | audio_max_db | audio_mean_db | audio_median_db | audio_std_db | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
datetime | |||||||||||||||
2019-08-13 07:00:00+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 0 | NaN | 1 | 7735.0 | 7735.0 | 7735.0 | 7735.0 | NaN | 51.0 | 51.0 | 51.0 | 51.0 | NaN |
2019-08-13 07:30:00+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 0 | NaN | 1 | 13609.0 | 13609.0 | 13609.0 | 13609.0 | NaN | 90.0 | 90.0 | 90.0 | 90.0 | NaN |
2019-08-13 08:00:00+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 0 | NaN | 1 | 7690.0 | 7690.0 | 7690.0 | 7690.0 | NaN | 81.0 | 81.0 | 81.0 | 81.0 | NaN |
2019-08-13 08:30:00+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 0 | NaN | 1 | 8347.0 | 8347.0 | 8347.0 | 8347.0 | NaN | 58.0 | 58.0 | 58.0 | 58.0 | NaN |
2019-08-13 09:00:00+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 1 | NaN | 1 | 13592.0 | 13592.0 | 13592.0 | 13592.0 | NaN | 36.0 | 36.0 | 36.0 | 36.0 | NaN |
5. Implementing own features
If none of the provided functions suits well, We can implement our own customized features easily. To do so, we need to define a function that accepts a dataframe and returns a dataframe. The returned object should be indexed by user and timestamps (multiindex). To make the feature readily available in the default options, we need add the audio prefix to the new function (e.g. audio_my-new-feature
). Let’s assume we need a new function that counts sums all frequencies. Let’s first define
the function
[24]:
def audio_sum_freq(df,config=None):
if not "audio_column_name" in config:
col_name = "double_frequency"
else:
col_name = config["audio_column_name"]
if not "resample_args" in config.keys():
config["resample_args"] = {"rule":"30T"}
if len(df)>0:
result = df.groupby('user')[col_name].resample(**config["resample_args"]).sum()
result = result.to_frame(name='audio_sum_freq')
result = result.reset_index("user")
result.index.rename("datetime", inplace=True)
return result
return None
Then, we can call our new function in the stand-alone way or using the extract_features_audio
function. Because the stand-alone way is the common way to call functions in python, we will not show it. Instead, we will show how to integrate this new function to the wrapper. Let’s read again the data and assume we want the default behavior of the wrapper.
[25]:
customized_features = au.extract_features_audio(data, features={audio_sum_freq: {}})
computing <function audio_sum_freq at 0x7f5c683977e0>...
[26]:
customized_features.head()
[26]:
user | audio_sum_freq | |
---|---|---|
datetime | ||
2019-08-13 07:00:00+03:00 | iGyXetHE3S8u | 7735 |
2019-08-13 07:30:00+03:00 | iGyXetHE3S8u | 13609 |
2019-08-13 08:00:00+03:00 | iGyXetHE3S8u | 7690 |
2019-08-13 08:30:00+03:00 | iGyXetHE3S8u | 8347 |
2019-08-13 09:00:00+03:00 | iGyXetHE3S8u | 13592 |