{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Application data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction\n",
"\n",
"Application data refers to the information about which apps are open at a certain time. These data can reveal important information about people's circadian rhythm, social patterns, and activity. Application data is an event data; this means it cannot be sampled at a regular frequency. Instead, we just have information about the events that occured. \n",
"\n",
"An application data dataframe contains a list of application events. Each row describes a single event and is indexed by a time stamp. The dataframe contains the following columns (column names can be different, but in that case they must be provided as parameters):\n",
"- `user`: Subject ID.\n",
"- `device`: Device ID.\n",
"- `app_column`: Contains the name of an application used.\n",
"\n",
"There are two main issues with application data (1) missing data detection, and (2) privacy concerns.\n",
"\n",
"Regarding missing data detection, we may never know if all events were detected and reported. Unfortunately there is little we can do. Nevertheless, we can take into account some factors that may interfere with the correct detection of all events (e.g. when the phone's battery is depleted). Therefore, to correctly process application data, we need to consider other information like the battery status of the phone. \n",
"Regarding the privacy concerns, application names can reveal too much about a subject, for example, an uncommon app use may help identify a subject. Consequently, we try anonimizing the data by grouping the apps. \n",
"\n",
"To address both of these issues, `niimpy` includes the function `extract_features_app` to clean, downsample, and extract features from application data while taking into account factors like the battery level and naming groups. In addition, `niimpy` provides a map with some of the common apps for pseudo-anonymization. This function employs other functions to extract the following features:\n",
"\n",
"- `app_count`: number of times an app group has been used \n",
"- `app_duration`: how long an app group has been used\n",
"\n",
"The app module has one internal function that help classify the apps into groups. \n",
"\n",
"In the following, we will analyze screen data provided by `niimpy` as an example to illustrate the use of application data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Read data\n",
"\n",
"Let's start by reading the example data provided in `niimpy`. These data have already been shaped in a format that meets the requirements of the data schema. Let's start by importing the needed modules. Firstly we will import the `niimpy` package and then we will import the module we will use (application) and give it a short name for use convenience. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/u/24/rantahj1/unix/miniconda3/envs/niimpy/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
" from .autonotebook import tqdm as notebook_tqdm\n"
]
}
],
"source": [
"%load_ext autoreload\n",
"%autoreload 2\n",
"\n",
"import niimpy\n",
"from niimpy import config\n",
"import niimpy.preprocessing.application as app\n",
"import pandas as pd\n",
"import warnings\n",
"warnings.filterwarnings(\"ignore\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's read the example data provided in `niimpy`. The example data is in `csv` format, so we need to use the `read_csv` function. When reading the data, we can specify the timezone where the data was collected. This will help us handle daylight saving times easier. We can specify the timezone with the argument **tz**. The output is a dataframe. We can also check the number of rows and columns in the dataframe."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(99, 6)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = niimpy.read_csv(config.SINGLEUSER_AWARE_APP_PATH, tz='Europe/Helsinki')\n",
"data.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The data was succesfully read. We can see that there are 132 datapoints with 6 columns in the dataset. However, we do not know yet what the data really looks like, so let's have a quick look:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user | \n",
" device | \n",
" time | \n",
" application_name | \n",
" package_name | \n",
" datetime | \n",
"
\n",
" \n",
" \n",
" \n",
" 2019-08-05 14:02:51.009999990+03:00 | \n",
" dvWdLQesv21a | \n",
" i8jmoIuoe12Mo | \n",
" 1.565003e+09 | \n",
" Android System | \n",
" android | \n",
" 2019-08-05 14:02:51.009999990+03:00 | \n",
"
\n",
" \n",
" 2019-08-05 14:02:58.009999990+03:00 | \n",
" dvWdLQesv21a | \n",
" i8jmoIuoe12Mo | \n",
" 1.565003e+09 | \n",
" Android System | \n",
" android | \n",
" 2019-08-05 14:02:58.009999990+03:00 | \n",
"
\n",
" \n",
" 2019-08-05 14:19:57.009999990+03:00 | \n",
" dvWdLQesv21a | \n",
" i8jmoIuoe12Mo | \n",
" 1.565004e+09 | \n",
" Google Play Music | \n",
" com.google.android.music | \n",
" 2019-08-05 14:19:57.009999990+03:00 | \n",
"
\n",
" \n",
" 2019-08-05 14:19:35.009999990+03:00 | \n",
" dvWdLQesv21a | \n",
" i8jmoIuoe12Mo | \n",
" 1.565004e+09 | \n",
" Google Play Music | \n",
" com.google.android.music | \n",
" 2019-08-05 14:19:35.009999990+03:00 | \n",
"
\n",
" \n",
" 2019-08-05 14:20:12.009999990+03:00 | \n",
" dvWdLQesv21a | \n",
" i8jmoIuoe12Mo | \n",
" 1.565004e+09 | \n",
" Gmail | \n",
" com.google.android.gm | \n",
" 2019-08-05 14:20:12.009999990+03:00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user device \\\n",
"2019-08-05 14:02:51.009999990+03:00 dvWdLQesv21a i8jmoIuoe12Mo \n",
"2019-08-05 14:02:58.009999990+03:00 dvWdLQesv21a i8jmoIuoe12Mo \n",
"2019-08-05 14:19:57.009999990+03:00 dvWdLQesv21a i8jmoIuoe12Mo \n",
"2019-08-05 14:19:35.009999990+03:00 dvWdLQesv21a i8jmoIuoe12Mo \n",
"2019-08-05 14:20:12.009999990+03:00 dvWdLQesv21a i8jmoIuoe12Mo \n",
"\n",
" time application_name \\\n",
"2019-08-05 14:02:51.009999990+03:00 1.565003e+09 Android System \n",
"2019-08-05 14:02:58.009999990+03:00 1.565003e+09 Android System \n",
"2019-08-05 14:19:57.009999990+03:00 1.565004e+09 Google Play Music \n",
"2019-08-05 14:19:35.009999990+03:00 1.565004e+09 Google Play Music \n",
"2019-08-05 14:20:12.009999990+03:00 1.565004e+09 Gmail \n",
"\n",
" package_name \\\n",
"2019-08-05 14:02:51.009999990+03:00 android \n",
"2019-08-05 14:02:58.009999990+03:00 android \n",
"2019-08-05 14:19:57.009999990+03:00 com.google.android.music \n",
"2019-08-05 14:19:35.009999990+03:00 com.google.android.music \n",
"2019-08-05 14:20:12.009999990+03:00 com.google.android.gm \n",
"\n",
" datetime \n",
"2019-08-05 14:02:51.009999990+03:00 2019-08-05 14:02:51.009999990+03:00 \n",
"2019-08-05 14:02:58.009999990+03:00 2019-08-05 14:02:58.009999990+03:00 \n",
"2019-08-05 14:19:57.009999990+03:00 2019-08-05 14:19:57.009999990+03:00 \n",
"2019-08-05 14:19:35.009999990+03:00 2019-08-05 14:19:35.009999990+03:00 \n",
"2019-08-05 14:20:12.009999990+03:00 2019-08-05 14:20:12.009999990+03:00 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By exploring the head of the dataframe we can form an idea of its entirety. From the data, we can see that:\n",
"\n",
"- rows are observations, indexed by timestamps, i.e. each row represents that an app has been prompted to the smartphone screen\n",
"- columns are characteristics for each observation, for example, the user whose data we are analyzing\n",
"- there is one main column: `application_name`, which stores the Android name for the application.\n",
"\n",
"#### A few words on missing data\n",
"Missing data for application is difficult to detect. Firstly, this sensor is triggered by events (i.e. not sampled at a fixed frequency). Secondly, different phones, OS, and settings change how easy it is to detect apps. Thirdly, events not related to the application sensor may affect its behavior, e.g. battery running out. Unfortunately, we can only correct missing data for events such as the screen turning off by using data from the screen sensor and the battery level. These can be taken into account in `niimpy` if we provide the screen and battery data. We will see some examples below.\n",
"\n",
"#### A few words on grouping the apps\n",
"As previously mentioned, the application name may reveal too much about a subject and privacy problems may arise. A possible solution to this problem is to classify the apps into more generic groups. For example, apps like WhatsApp, Signal, Telegram, etc. are commonly used for texting, so we can group them under the label *texting*. `niimpy` provides a default map, but this should be adapted to the characteristics of the sample, since apps are available depending on countries and populations. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### A few words on the role of the battery and screen\n",
"As mentioned before, sometimes the screen may be OFF and these events will not be caught by the application data sensor. For example, we can open an app and let it remain open until the phone screen turns off automatically. Another example is when the battery is depleted and the phone is shut down automatically. Having this information is crucial for correctly computing how long a subject used each app group. `niimpy`'s screen module is adapted to take into account both, the screen and battery data. \n",
"For this example, we have both, so let's load the screen and battery data."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"bat_data = niimpy.read_csv(config.MULTIUSER_AWARE_BATTERY_PATH, tz='Europe/Helsinki')\n",
"screen_data = niimpy.read_csv(config.MULTIUSER_AWARE_SCREEN_PATH, tz='Europe/Helsinki')"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user | \n",
" device | \n",
" time | \n",
" battery_level | \n",
" battery_status | \n",
" battery_health | \n",
" battery_adaptor | \n",
" datetime | \n",
"
\n",
" \n",
" \n",
" \n",
" 2020-01-09 02:20:02.924999952+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578529e+09 | \n",
" 74 | \n",
" 3 | \n",
" 2 | \n",
" 0 | \n",
" 2020-01-09 02:20:02.924999952+02:00 | \n",
"
\n",
" \n",
" 2020-01-09 02:21:30.405999899+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578529e+09 | \n",
" 73 | \n",
" 3 | \n",
" 2 | \n",
" 0 | \n",
" 2020-01-09 02:21:30.405999899+02:00 | \n",
"
\n",
" \n",
" 2020-01-09 02:24:12.805999994+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578529e+09 | \n",
" 72 | \n",
" 3 | \n",
" 2 | \n",
" 0 | \n",
" 2020-01-09 02:24:12.805999994+02:00 | \n",
"
\n",
" \n",
" 2020-01-09 02:35:38.561000109+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578530e+09 | \n",
" 72 | \n",
" 2 | \n",
" 2 | \n",
" 0 | \n",
" 2020-01-09 02:35:38.561000109+02:00 | \n",
"
\n",
" \n",
" 2020-01-09 02:35:38.953000069+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3p83yASkOb_B | \n",
" 1.578530e+09 | \n",
" 72 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2020-01-09 02:35:38.953000069+02:00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user device time \\\n",
"2020-01-09 02:20:02.924999952+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 \n",
"2020-01-09 02:21:30.405999899+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 \n",
"2020-01-09 02:24:12.805999994+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 \n",
"2020-01-09 02:35:38.561000109+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 \n",
"2020-01-09 02:35:38.953000069+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 \n",
"\n",
" battery_level battery_status \\\n",
"2020-01-09 02:20:02.924999952+02:00 74 3 \n",
"2020-01-09 02:21:30.405999899+02:00 73 3 \n",
"2020-01-09 02:24:12.805999994+02:00 72 3 \n",
"2020-01-09 02:35:38.561000109+02:00 72 2 \n",
"2020-01-09 02:35:38.953000069+02:00 72 2 \n",
"\n",
" battery_health battery_adaptor \\\n",
"2020-01-09 02:20:02.924999952+02:00 2 0 \n",
"2020-01-09 02:21:30.405999899+02:00 2 0 \n",
"2020-01-09 02:24:12.805999994+02:00 2 0 \n",
"2020-01-09 02:35:38.561000109+02:00 2 0 \n",
"2020-01-09 02:35:38.953000069+02:00 2 2 \n",
"\n",
" datetime \n",
"2020-01-09 02:20:02.924999952+02:00 2020-01-09 02:20:02.924999952+02:00 \n",
"2020-01-09 02:21:30.405999899+02:00 2020-01-09 02:21:30.405999899+02:00 \n",
"2020-01-09 02:24:12.805999994+02:00 2020-01-09 02:24:12.805999994+02:00 \n",
"2020-01-09 02:35:38.561000109+02:00 2020-01-09 02:35:38.561000109+02:00 \n",
"2020-01-09 02:35:38.953000069+02:00 2020-01-09 02:35:38.953000069+02:00 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bat_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dataframe looks fine. In this case, we are interested in the battery_status information. This is standard information provided by Android. However, if the dataframe stores this information in a column with a different name, we can use the argument `battery_column_name` and input our custom battery column name (again, we will have an example below)."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user | \n",
" device | \n",
" time | \n",
" screen_status | \n",
" datetime | \n",
"
\n",
" \n",
" \n",
" \n",
" 2020-01-09 02:06:41.573999882+02:00 | \n",
" jd9INuQ5BBlW | \n",
" OWd1Uau8POix | \n",
" 1.578528e+09 | \n",
" 0 | \n",
" 2020-01-09 02:06:41.573999882+02:00 | \n",
"
\n",
" \n",
" 2020-01-09 02:09:29.151999950+02:00 | \n",
" jd9INuQ5BBlW | \n",
" OWd1Uau8POix | \n",
" 1.578529e+09 | \n",
" 1 | \n",
" 2020-01-09 02:09:29.151999950+02:00 | \n",
"
\n",
" \n",
" 2020-01-09 02:09:32.790999889+02:00 | \n",
" jd9INuQ5BBlW | \n",
" OWd1Uau8POix | \n",
" 1.578529e+09 | \n",
" 3 | \n",
" 2020-01-09 02:09:32.790999889+02:00 | \n",
"
\n",
" \n",
" 2020-01-09 02:11:41.996000051+02:00 | \n",
" jd9INuQ5BBlW | \n",
" OWd1Uau8POix | \n",
" 1.578529e+09 | \n",
" 0 | \n",
" 2020-01-09 02:11:41.996000051+02:00 | \n",
"
\n",
" \n",
" 2020-01-09 02:16:19.010999918+02:00 | \n",
" jd9INuQ5BBlW | \n",
" OWd1Uau8POix | \n",
" 1.578529e+09 | \n",
" 1 | \n",
" 2020-01-09 02:16:19.010999918+02:00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user device time \\\n",
"2020-01-09 02:06:41.573999882+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578528e+09 \n",
"2020-01-09 02:09:29.151999950+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578529e+09 \n",
"2020-01-09 02:09:32.790999889+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578529e+09 \n",
"2020-01-09 02:11:41.996000051+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578529e+09 \n",
"2020-01-09 02:16:19.010999918+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578529e+09 \n",
"\n",
" screen_status \\\n",
"2020-01-09 02:06:41.573999882+02:00 0 \n",
"2020-01-09 02:09:29.151999950+02:00 1 \n",
"2020-01-09 02:09:32.790999889+02:00 3 \n",
"2020-01-09 02:11:41.996000051+02:00 0 \n",
"2020-01-09 02:16:19.010999918+02:00 1 \n",
"\n",
" datetime \n",
"2020-01-09 02:06:41.573999882+02:00 2020-01-09 02:06:41.573999882+02:00 \n",
"2020-01-09 02:09:29.151999950+02:00 2020-01-09 02:09:29.151999950+02:00 \n",
"2020-01-09 02:09:32.790999889+02:00 2020-01-09 02:09:32.790999889+02:00 \n",
"2020-01-09 02:11:41.996000051+02:00 2020-01-09 02:11:41.996000051+02:00 \n",
"2020-01-09 02:16:19.010999918+02:00 2020-01-09 02:16:19.010999918+02:00 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"screen_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This dataframe looks fine too. In this case, we are interested in the screen_status information, which is also standardized values provided by Android. The column does not need to be name \"screen_status\" as we can pass the name later on. We will see an example later. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## * TIP! Data format requirements (or what our data should look like)\n",
"\n",
"Data can take other shapes and formats. However, the `niimpy` data scheme requires it to be in a certain shape. This means the application dataframe needs to have at least the following characteristics:\n",
"1. One row per app prompt. Each row should store information about one app prompt only\n",
"2. Each row's index should be a timestamp\n",
"3. There should be at least three columns: \n",
" - index: date and time when the event happened (timestamp)\n",
" - user: stores the user name whose data is analyzed. Each user should have a unique name or hash (i.e. one hash for each unique user)\n",
" - application_name: stores the Android application name\n",
"4. Columns additional to those listed in item 3 are allowed\n",
"5. The names of the columns do not need to be exactly \"user\", and \"application_name\" as we can pass our own names in an argument (to be explained later).\n",
"\n",
"Below is an example of a dataframe that complies with these minimum requirements"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user | \n",
" application_name | \n",
"
\n",
" \n",
" \n",
" \n",
" 2019-08-05 14:02:51.009999990+03:00 | \n",
" dvWdLQesv21a | \n",
" Android System | \n",
"
\n",
" \n",
" 2019-08-05 14:02:58.009999990+03:00 | \n",
" dvWdLQesv21a | \n",
" Android System | \n",
"
\n",
" \n",
" 2019-08-05 14:19:57.009999990+03:00 | \n",
" dvWdLQesv21a | \n",
" Google Play Music | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user application_name\n",
"2019-08-05 14:02:51.009999990+03:00 dvWdLQesv21a Android System\n",
"2019-08-05 14:02:58.009999990+03:00 dvWdLQesv21a Android System\n",
"2019-08-05 14:19:57.009999990+03:00 dvWdLQesv21a Google Play Music"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example_dataschema = data[['user','application_name']]\n",
"example_dataschema.head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similarly, if we employ screen and battery data, we need to fulfill minimum data scheme requirements. We will briefly show examples of these dataframes that comply with the minimum requirements."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user | \n",
" screen_status | \n",
"
\n",
" \n",
" \n",
" \n",
" 2020-01-09 02:06:41.573999882+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 0 | \n",
"
\n",
" \n",
" 2020-01-09 02:09:29.151999950+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 1 | \n",
"
\n",
" \n",
" 2020-01-09 02:09:32.790999889+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user screen_status\n",
"2020-01-09 02:06:41.573999882+02:00 jd9INuQ5BBlW 0\n",
"2020-01-09 02:09:29.151999950+02:00 jd9INuQ5BBlW 1\n",
"2020-01-09 02:09:32.790999889+02:00 jd9INuQ5BBlW 3"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example_screen_dataschema = screen_data[['user','screen_status']]\n",
"example_screen_dataschema.head(3)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user | \n",
" battery_status | \n",
"
\n",
" \n",
" \n",
" \n",
" 2020-01-09 02:20:02.924999952+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3 | \n",
"
\n",
" \n",
" 2020-01-09 02:21:30.405999899+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3 | \n",
"
\n",
" \n",
" 2020-01-09 02:24:12.805999994+02:00 | \n",
" jd9INuQ5BBlW | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user battery_status\n",
"2020-01-09 02:20:02.924999952+02:00 jd9INuQ5BBlW 3\n",
"2020-01-09 02:21:30.405999899+02:00 jd9INuQ5BBlW 3\n",
"2020-01-09 02:24:12.805999994+02:00 jd9INuQ5BBlW 3"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example_battery_dataschema = bat_data[['user','battery_status']]\n",
"example_battery_dataschema.head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Extracting features\n",
"There are two ways to extract features. We could use each function separately or we could use `niimpy`'s ready-made wrapper. Both ways will require us to specify arguments to pass to the functions/wrapper in order to customize the way the functions work. These arguments are specified in dictionaries. Let's first understand how to extract features using stand-alone functions.\n",
"\n",
"### 4.1 Extract features using stand-alone functions\n",
"We can use `niimpy`'s functions to compute communication features. Each function will require two inputs:\n",
"- (mandatory) dataframe that must comply with the minimum requirements (see '* TIP! Data requirements above)\n",
"- (optional) arguments for stand-alone functions\n",
"\n",
"#### 4.1.1 The arguments for stand-alone functions (or how we specify the way a function works)\n",
"We can input two types of arguments to customize the way a stand-alone function works:\n",
"- the name of the columns to be preprocessed: Since the dataframe may have different columns, we need to specify which column has the data we would like to be preprocessed. To do so, we can simply pass the name of the column to the argument `app_column_name`. \n",
"\n",
"- the way we resample: resampling options are specified in `niimpy` as a dictionary. `niimpy`'s resampling and aggregating relies on `pandas.DataFrame.resample`, so mastering the use of this pandas function will help us greatly in `niimpy`'s preprocessing. Please familiarize yourself with the pandas resample function before continuing. \n",
" Briefly, to use the `pandas.DataFrame.resample` function, we need a rule. This rule states the intervals we would like to use to resample our data (e.g., 15-seconds, 30-minutes, 1-hour). Neverthless, we can input more details into the function to specify the exact sampling we would like. For example, we could use the *close* argument if we would like to specify which side of the interval is closed, or we could use the *offset* argument if we would like to start our binning with an offset, etc. There are plenty of options to use this command, so we strongly recommend having `pandas.DataFrame.resample` documentation at hand. All features for the `pandas.DataFrame.resample` will be specified in a dictionary where keys are the arguments' names for the `pandas.DataFrame.resample` function, and the dictionary's values are the values for each of these selected arguments. This dictionary will be passed as a value to the key `resample_args` in `niimpy`.\n",
"\n",
"Let's see some examples of these parameters:"
]
},
{
"cell_type": "markdown",
"id": "db2d0c37",
"metadata": {},
"source": [
"```python\n",
"app.app_count(data, app_column_name = \"application_name\", resample_args = {\"rule\":\"1D\"})\n",
"app.app_count(data, app_column_name = \"other_name\", screen_column_name = \"screen_name\", resample_args = {\"rule\":\"1D\"})\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, we have two feature function calls. \n",
"\n",
"- The first example will analyze the data stored in the column `application_name` in our dataframe. The data will be binned in one day periods\n",
"- The second example will analyze the data stored in the column `other_name` in our dataframe. In addition, we will provide some screen data in the column \"screen_name\". The data will be binned in 45-minutes bins, but the binning will start from the last timestamp in the dataframe. \n",
"\n",
"**Default values:** if no arguments are passed, `niimpy`'s default values are \"application_name\" for the app_column_name, \"screen_status\" for the screen_column_name, and \"battery_status\" for the battery_column_name. We will also use the default 30-min aggregation bins."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.1.2 Using the functions\n",
"Now that we understand how the functions are customized, it is time we compute our first application feature. Suppose that we are interested in extracting the number of times each app group has been used within 1-minutes bins. We will need `niimpy`'s `app_count` function."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" app_group | \n",
" device | \n",
" user | \n",
" count | \n",
"
\n",
" \n",
" datetime | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2019-08-05 14:20:00+03:00 | \n",
" comm | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" 2 | \n",
"
\n",
" \n",
" 2019-08-05 14:21:00+03:00 | \n",
" comm | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" 0 | \n",
"
\n",
" \n",
" 2019-08-05 14:22:00+03:00 | \n",
" comm | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" 0 | \n",
"
\n",
" \n",
" 2019-08-05 14:23:00+03:00 | \n",
" comm | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" 0 | \n",
"
\n",
" \n",
" 2019-08-05 14:24:00+03:00 | \n",
" comm | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" app_group device user count\n",
"datetime \n",
"2019-08-05 14:20:00+03:00 comm i8jmoIuoe12Mo dvWdLQesv21a 2\n",
"2019-08-05 14:21:00+03:00 comm i8jmoIuoe12Mo dvWdLQesv21a 0\n",
"2019-08-05 14:22:00+03:00 comm i8jmoIuoe12Mo dvWdLQesv21a 0\n",
"2019-08-05 14:23:00+03:00 comm i8jmoIuoe12Mo dvWdLQesv21a 0\n",
"2019-08-05 14:24:00+03:00 comm i8jmoIuoe12Mo dvWdLQesv21a 0"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"my_app_count = app.app_count(\n",
" data,\n",
" bat_data,\n",
" screen_data,\n",
" app_column_name = \"application_name\",\n",
" resample_args = {\"rule\":\"1T\"}\n",
")\n",
"my_app_count.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that the bins are indeed 1-minutes bins, however, they are adjusted to fixed, predetermined intervals, i.e. the bin does not start on the time of the first datapoint. Instead, `pandas` starts the binning at 00:00:00 of everyday and counts 1-minutes intervals from there. \n",
"\n",
"If we want the binning to start from the first datapoint in our dataset, we need the origin parameter and a for loop."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" app_group | \n",
" device | \n",
" user | \n",
" count | \n",
"
\n",
" \n",
" datetime | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2019-08-05 14:20:00+03:00 | \n",
" comm | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" 2 | \n",
"
\n",
" \n",
" 2019-08-05 14:21:00+03:00 | \n",
" comm | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" 0 | \n",
"
\n",
" \n",
" 2019-08-05 14:22:00+03:00 | \n",
" comm | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" 0 | \n",
"
\n",
" \n",
" 2019-08-05 14:23:00+03:00 | \n",
" comm | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" 0 | \n",
"
\n",
" \n",
" 2019-08-05 14:24:00+03:00 | \n",
" comm | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" 0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 2019-08-07 06:49:00+03:00 | \n",
" work | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" 0 | \n",
"
\n",
" \n",
" 2019-08-07 06:50:00+03:00 | \n",
" work | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" 0 | \n",
"
\n",
" \n",
" 2019-08-07 06:51:00+03:00 | \n",
" work | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" 0 | \n",
"
\n",
" \n",
" 2019-08-07 06:52:00+03:00 | \n",
" work | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" 0 | \n",
"
\n",
" \n",
" 2019-08-07 06:53:00+03:00 | \n",
" work | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
10872 rows × 4 columns
\n",
"
"
],
"text/plain": [
" app_group device user count\n",
"datetime \n",
"2019-08-05 14:20:00+03:00 comm i8jmoIuoe12Mo dvWdLQesv21a 2\n",
"2019-08-05 14:21:00+03:00 comm i8jmoIuoe12Mo dvWdLQesv21a 0\n",
"2019-08-05 14:22:00+03:00 comm i8jmoIuoe12Mo dvWdLQesv21a 0\n",
"2019-08-05 14:23:00+03:00 comm i8jmoIuoe12Mo dvWdLQesv21a 0\n",
"2019-08-05 14:24:00+03:00 comm i8jmoIuoe12Mo dvWdLQesv21a 0\n",
"... ... ... ... ...\n",
"2019-08-07 06:49:00+03:00 work i8jmoIuoe12Mo dvWdLQesv21a 0\n",
"2019-08-07 06:50:00+03:00 work i8jmoIuoe12Mo dvWdLQesv21a 0\n",
"2019-08-07 06:51:00+03:00 work i8jmoIuoe12Mo dvWdLQesv21a 0\n",
"2019-08-07 06:52:00+03:00 work i8jmoIuoe12Mo dvWdLQesv21a 0\n",
"2019-08-07 06:53:00+03:00 work i8jmoIuoe12Mo dvWdLQesv21a 1\n",
"\n",
"[10872 rows x 4 columns]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"users = list(data['user'].unique())\n",
"results = []\n",
"for user in users:\n",
" start_time = data[data[\"user\"]==user].index.min()\n",
" results.append(app.app_count(\n",
" data[data[\"user\"]==user],\n",
" bat_data[bat_data[\"user\"]==user],\n",
" screen_data[screen_data[\"user\"]==user],\n",
" app_column_name = \"application_name\",\n",
" resample_args = {\"rule\":\"1T\"}\n",
" ))\n",
"my_app_count = pd.concat(results)\n",
"my_app_count"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Compare the timestamps and notice the small difference in this example. In the cell 21, the first timestamp is at 14:02:00, whereas in the new app_count, the first timestamp is at 14:02:42"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The functions can also be called in absence of battery or screen data. In this case the function does not account for when the screen is turned off or then the battery is depleted."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"no_bat = app.app_count(data, screen=screen_data) #no battery information\n",
"no_screen = app.app_count(data, bat=bat_data) #no screen information\n",
"no_bat_no_screen = app.app_count(data) #no battery and no screen information"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" app_group | \n",
" device | \n",
" user | \n",
" count | \n",
"
\n",
" \n",
" datetime | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2019-08-05 14:00:00+03:00 | \n",
" comm | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" 2 | \n",
"
\n",
" \n",
" 2019-08-05 14:30:00+03:00 | \n",
" comm | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" 1 | \n",
"
\n",
" \n",
" 2019-08-05 15:00:00+03:00 | \n",
" comm | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" 0 | \n",
"
\n",
" \n",
" 2019-08-05 15:30:00+03:00 | \n",
" comm | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" 0 | \n",
"
\n",
" \n",
" 2019-08-05 16:00:00+03:00 | \n",
" comm | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" app_group device user count\n",
"datetime \n",
"2019-08-05 14:00:00+03:00 comm i8jmoIuoe12Mo dvWdLQesv21a 2\n",
"2019-08-05 14:30:00+03:00 comm i8jmoIuoe12Mo dvWdLQesv21a 1\n",
"2019-08-05 15:00:00+03:00 comm i8jmoIuoe12Mo dvWdLQesv21a 0\n",
"2019-08-05 15:30:00+03:00 comm i8jmoIuoe12Mo dvWdLQesv21a 0\n",
"2019-08-05 16:00:00+03:00 comm i8jmoIuoe12Mo dvWdLQesv21a 0"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"no_bat.head()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" app_group | \n",
" device | \n",
" user | \n",
" count | \n",
"
\n",
" \n",
" datetime | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2019-08-05 14:00:00+03:00 | \n",
" comm | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" 2 | \n",
"
\n",
" \n",
" 2019-08-05 14:30:00+03:00 | \n",
" comm | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" 1 | \n",
"
\n",
" \n",
" 2019-08-05 15:00:00+03:00 | \n",
" comm | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" 0 | \n",
"
\n",
" \n",
" 2019-08-05 15:30:00+03:00 | \n",
" comm | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" 0 | \n",
"
\n",
" \n",
" 2019-08-05 16:00:00+03:00 | \n",
" comm | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" app_group device user count\n",
"datetime \n",
"2019-08-05 14:00:00+03:00 comm i8jmoIuoe12Mo dvWdLQesv21a 2\n",
"2019-08-05 14:30:00+03:00 comm i8jmoIuoe12Mo dvWdLQesv21a 1\n",
"2019-08-05 15:00:00+03:00 comm i8jmoIuoe12Mo dvWdLQesv21a 0\n",
"2019-08-05 15:30:00+03:00 comm i8jmoIuoe12Mo dvWdLQesv21a 0\n",
"2019-08-05 16:00:00+03:00 comm i8jmoIuoe12Mo dvWdLQesv21a 0"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"no_screen.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see some small differences between these two dataframes. For example, the no_screen dataframe includes the app_group \"off\", as it has taken into account the battery data and knows when the phone has been shut down. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.2 Extract features using the wrapper\n",
"We can use `niimpy`'s ready-made wrapper to extract one or several features at the same time. The wrapper will require two inputs:\n",
"- (mandatory) dataframe that must comply with the minimum requirements (see '* TIP! Data requirements above)\n",
"- (optional) an argument dictionary for wrapper\n",
"\n",
"#### 4.2.1 The argument dictionary for wrapper (or how we specify the way the wrapper works)\n",
"The argument dictionary contains the arguments for each stand-alone function we would like to employ. Its keys are the feature functions we want to compute. Its values are argument dictionaries created for each stand-alone function we will employ. \n",
"Let's see some examples of wrapper dictionaries:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"wrapper_features1 = {app.app_count:{\"app_column_name\":\"application_name\", \"resample_args\":{\"rule\":\"1T\"}},\n",
" app.app_duration:{\"app_column_name\":\"some_name\", \"screen_column_name\":\"screen_name\", \"battery_column_name\":\"battery_name\", \"resample_args\":{\"rule\":\"1T\"}}}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- `wrapper_features1` will be used to analyze two features, `app_count` and `app_duration`. For the feature app_count, we will use the data stored in the column `application_name` in our dataframe and the data will be binned in one-minute periods. For the feature app_duration, we will use the data stored in the column `some_name` in our dataframe and the data will be binned in one day periods. In addition, we will also employ screen and battery data which are stored in the columns `screen_name` and `battery_name`."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"wrapper_features2 = {app.app_count:{\"app_column_name\":\"application_name\", \"resample_args\":{\"rule\":\"1T\", \"offset\":\"15S\"}},\n",
" app.app_duration:{\"app_column_name\":\"some_name\", \"screen_column_name\":\"screen_name\", \"battery_column_name\":\"battery_name\", \"resample_args\":{\"rule\":\"30S\"}}}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- `wrapper_features2` will be used to analyze two features, `app_count` and `app_duration`. For the feature app_count, we will use the data stored in the column `application_name` in our dataframe and the data will be binned in one-minute periods with a 15-seconds offset. For the feature app_duration, we will use the data stored in the column `some_name` in our dataframe and the data will be binned in 30-second periods. In addition, we will also employ screen and battery data which are stored in the columns `screen_name` and `battery_name`.\n",
"\n",
"**Default values:** if no arguments are passed, `niimpy`'s default values are \"application_name\" for the app_column_name, \"screen_status\" for the screen_column_name, \"battery_status\" for the battery_column_name, and 30-min aggregation bins. Moreover, the wrapper will compute all the available functions in absence of the argument dictionary. Similarly to the use of functions, we may input empty dataframes if we do not have screen or battery data. \n",
"\n",
"#### 4.2.2 Using the wrapper\n",
"Now that we understand how the wrapper is customized, it is time we compute our first application feature using the wrapper. Suppose that we are interested in extracting the call total duration every 30 seconds. We will need `niimpy`'s `extract_features_apps` function, the data, and we will also need to create a dictionary to customize our function. Let's create the dictionary first"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"wrapper_features1 = {app.app_count:{\"app_column_name\":\"application_name\", \"group_by_columns\":['user','device','app_group'], \"resample_args\":{\"rule\":\"30S\"}}}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's use the wrapper"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" device | \n",
" user | \n",
" app_group | \n",
" count | \n",
"
\n",
" \n",
" datetime | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2019-08-05 14:20:00+03:00 | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" comm | \n",
" 2 | \n",
"
\n",
" \n",
" 2019-08-05 14:20:30+03:00 | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" comm | \n",
" 0 | \n",
"
\n",
" \n",
" 2019-08-05 14:21:00+03:00 | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" comm | \n",
" 0 | \n",
"
\n",
" \n",
" 2019-08-05 14:21:30+03:00 | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" comm | \n",
" 0 | \n",
"
\n",
" \n",
" 2019-08-05 14:22:00+03:00 | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" comm | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" device user app_group count\n",
"datetime \n",
"2019-08-05 14:20:00+03:00 i8jmoIuoe12Mo dvWdLQesv21a comm 2\n",
"2019-08-05 14:20:30+03:00 i8jmoIuoe12Mo dvWdLQesv21a comm 0\n",
"2019-08-05 14:21:00+03:00 i8jmoIuoe12Mo dvWdLQesv21a comm 0\n",
"2019-08-05 14:21:30+03:00 i8jmoIuoe12Mo dvWdLQesv21a comm 0\n",
"2019-08-05 14:22:00+03:00 i8jmoIuoe12Mo dvWdLQesv21a comm 0"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results_wrapper = app.extract_features_app(data, bat_data, screen_data, features=wrapper_features1)\n",
"results_wrapper.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our first attempt was succesful. Now, let's try something more. Let's assume we want to compute the app_count and app_duration in 20-seconds bins. Moreover, let's assume we do not want to use the screen or battery data this time. Note that the app_duration values are in seconds. "
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(141, 5)\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" device | \n",
" user | \n",
" app_group | \n",
" count | \n",
" duration | \n",
"
\n",
" \n",
" datetime | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2019-08-05 14:20:00+03:00 | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" comm | \n",
" 2.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 2019-08-05 14:20:20+03:00 | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" comm | \n",
" 0.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 2019-08-05 14:20:40+03:00 | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" comm | \n",
" 0.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 2019-08-05 14:21:00+03:00 | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" comm | \n",
" 0.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 2019-08-05 14:21:20+03:00 | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" comm | \n",
" 0.0 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" device user app_group count \\\n",
"datetime \n",
"2019-08-05 14:20:00+03:00 i8jmoIuoe12Mo dvWdLQesv21a comm 2.0 \n",
"2019-08-05 14:20:20+03:00 i8jmoIuoe12Mo dvWdLQesv21a comm 0.0 \n",
"2019-08-05 14:20:40+03:00 i8jmoIuoe12Mo dvWdLQesv21a comm 0.0 \n",
"2019-08-05 14:21:00+03:00 i8jmoIuoe12Mo dvWdLQesv21a comm 0.0 \n",
"2019-08-05 14:21:20+03:00 i8jmoIuoe12Mo dvWdLQesv21a comm 0.0 \n",
"\n",
" duration \n",
"datetime \n",
"2019-08-05 14:20:00+03:00 NaN \n",
"2019-08-05 14:20:20+03:00 NaN \n",
"2019-08-05 14:20:40+03:00 NaN \n",
"2019-08-05 14:21:00+03:00 NaN \n",
"2019-08-05 14:21:20+03:00 NaN "
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wrapper_features2 = {app.app_count:{\"app_column_name\":\"application_name\", \"resample_args\":{\"rule\":\"20S\"}},\n",
" app.app_duration:{\"app_column_name\":\"application_name\", \"resample_args\":{\"rule\":\"1H\"}}}\n",
"results_wrapper = app.extract_features_app(data, features=wrapper_features2)\n",
"results_wrapper.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Great! Another successful attempt. We see from the results that more columns were added with the required calculations. We also see that some durations are in negative numbers, this may be due to the lack of screen and battery data. This is how the wrapper works when all features are computed with the same bins. Now, let's see how the wrapper performs when each function has different binning requirements. Let's assume we need to compute the app_count every 20 seconds, and the app_duration every 10 seconds."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(161705, 5)\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" device | \n",
" user | \n",
" app_group | \n",
" count | \n",
" duration | \n",
"
\n",
" \n",
" datetime | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2019-08-05 14:20:00+03:00 | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" comm | \n",
" 2.0 | \n",
" 14.99 | \n",
"
\n",
" \n",
" 2019-08-05 14:20:20+03:00 | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" comm | \n",
" 0.0 | \n",
" 20.00 | \n",
"
\n",
" \n",
" 2019-08-05 14:20:40+03:00 | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" comm | \n",
" 0.0 | \n",
" 20.00 | \n",
"
\n",
" \n",
" 2019-08-05 14:21:00+03:00 | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" comm | \n",
" 0.0 | \n",
" 20.00 | \n",
"
\n",
" \n",
" 2019-08-05 14:21:20+03:00 | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" comm | \n",
" 0.0 | \n",
" 20.00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" device user app_group count \\\n",
"datetime \n",
"2019-08-05 14:20:00+03:00 i8jmoIuoe12Mo dvWdLQesv21a comm 2.0 \n",
"2019-08-05 14:20:20+03:00 i8jmoIuoe12Mo dvWdLQesv21a comm 0.0 \n",
"2019-08-05 14:20:40+03:00 i8jmoIuoe12Mo dvWdLQesv21a comm 0.0 \n",
"2019-08-05 14:21:00+03:00 i8jmoIuoe12Mo dvWdLQesv21a comm 0.0 \n",
"2019-08-05 14:21:20+03:00 i8jmoIuoe12Mo dvWdLQesv21a comm 0.0 \n",
"\n",
" duration \n",
"datetime \n",
"2019-08-05 14:20:00+03:00 14.99 \n",
"2019-08-05 14:20:20+03:00 20.00 \n",
"2019-08-05 14:20:40+03:00 20.00 \n",
"2019-08-05 14:21:00+03:00 20.00 \n",
"2019-08-05 14:21:20+03:00 20.00 "
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wrapper_features3 = {app.app_count:{\"app_column_name\":\"application_name\", \"group_by_columns\": ['user','group', 'device', 'app_group'], \"resample_args\":{\"rule\":\"20S\"}},\n",
" app.app_duration:{\"app_column_name\":\"application_name\", \"group_by_columns\": ['user','group', 'device'], \"resample_args\":{\"rule\":\"20S\"}}}\n",
"results_wrapper = app.extract_features_app(data, bat_data, screen_data, features=wrapper_features3)\n",
"results_wrapper.head(5)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" device | \n",
" user | \n",
" app_group | \n",
" count | \n",
" duration | \n",
"
\n",
" \n",
" datetime | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2019-08-07 03:48:40+03:00 | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" utility | \n",
" NaN | \n",
" 20.00 | \n",
"
\n",
" \n",
" 2019-08-07 03:49:00+03:00 | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" utility | \n",
" NaN | \n",
" 20.00 | \n",
"
\n",
" \n",
" 2019-08-07 03:49:20+03:00 | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" utility | \n",
" NaN | \n",
" 20.00 | \n",
"
\n",
" \n",
" 2019-08-07 03:49:40+03:00 | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" utility | \n",
" NaN | \n",
" 18.01 | \n",
"
\n",
" \n",
" 2019-08-07 06:53:20+03:00 | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" work | \n",
" NaN | \n",
" 15.01 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" device user app_group count \\\n",
"datetime \n",
"2019-08-07 03:48:40+03:00 i8jmoIuoe12Mo dvWdLQesv21a utility NaN \n",
"2019-08-07 03:49:00+03:00 i8jmoIuoe12Mo dvWdLQesv21a utility NaN \n",
"2019-08-07 03:49:20+03:00 i8jmoIuoe12Mo dvWdLQesv21a utility NaN \n",
"2019-08-07 03:49:40+03:00 i8jmoIuoe12Mo dvWdLQesv21a utility NaN \n",
"2019-08-07 06:53:20+03:00 i8jmoIuoe12Mo dvWdLQesv21a work NaN \n",
"\n",
" duration \n",
"datetime \n",
"2019-08-07 03:48:40+03:00 20.00 \n",
"2019-08-07 03:49:00+03:00 20.00 \n",
"2019-08-07 03:49:20+03:00 20.00 \n",
"2019-08-07 03:49:40+03:00 18.01 \n",
"2019-08-07 06:53:20+03:00 15.01 "
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results_wrapper.tail(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The output is once again a dataframe. In this case, two aggregations are shown. The first one is the 20-seconds aggregation computed for the `app_count` feature (head). The second one is the 10-seconds aggregation period with 5-seconds offset for the `app_duration` (tail). Because the `app_count` feature is not required to be aggregated every 10 seconds, the aggregation timestamps have a NaN value. Similarly, because the `app_duration` is not required to be aggregated in 20-seconds windows, its values are NaN for all subjects. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4.2.3 Wrapper and its default option\n",
"The default option will compute all features in 30-minute aggregation windows. To use the `extract_features_apps` function with its default options, simply call the function. "
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(2004, 5)\n"
]
}
],
"source": [
"default = app.extract_features_app(data, bat_data, screen_data, features=None)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The function prints the computed features so you can track its process. Now, let's have a look at the outputs"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" device | \n",
" user | \n",
" app_group | \n",
" count | \n",
" duration | \n",
"
\n",
" \n",
" datetime | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2019-08-05 14:00:00+03:00 | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" comm | \n",
" 2.0 | \n",
" 594.99 | \n",
"
\n",
" \n",
" 2019-08-05 14:30:00+03:00 | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" comm | \n",
" 1.0 | \n",
" 1800.00 | \n",
"
\n",
" \n",
" 2019-08-05 15:00:00+03:00 | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" comm | \n",
" 0.0 | \n",
" 1800.00 | \n",
"
\n",
" \n",
" 2019-08-05 15:30:00+03:00 | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" comm | \n",
" 0.0 | \n",
" 1800.00 | \n",
"
\n",
" \n",
" 2019-08-05 16:00:00+03:00 | \n",
" i8jmoIuoe12Mo | \n",
" dvWdLQesv21a | \n",
" comm | \n",
" 0.0 | \n",
" 1800.00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" device user app_group count \\\n",
"datetime \n",
"2019-08-05 14:00:00+03:00 i8jmoIuoe12Mo dvWdLQesv21a comm 2.0 \n",
"2019-08-05 14:30:00+03:00 i8jmoIuoe12Mo dvWdLQesv21a comm 1.0 \n",
"2019-08-05 15:00:00+03:00 i8jmoIuoe12Mo dvWdLQesv21a comm 0.0 \n",
"2019-08-05 15:30:00+03:00 i8jmoIuoe12Mo dvWdLQesv21a comm 0.0 \n",
"2019-08-05 16:00:00+03:00 i8jmoIuoe12Mo dvWdLQesv21a comm 0.0 \n",
"\n",
" duration \n",
"datetime \n",
"2019-08-05 14:00:00+03:00 594.99 \n",
"2019-08-05 14:30:00+03:00 1800.00 \n",
"2019-08-05 15:00:00+03:00 1800.00 \n",
"2019-08-05 15:30:00+03:00 1800.00 \n",
"2019-08-05 16:00:00+03:00 1800.00 "
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"default.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Implementing own features"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If none of the provided functions suits well, We can implement our own customized features easily. To do so, we need to define a function that accepts a dataframe and returns a dataframe. The returned object should be indexed by user and app_groups (multiindex).\n",
"Let's assume we need a new function that computes the maximum duration. Let's first define the function."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"def app_max_duration(df, bat=None, screen=None, group_map=app.MAP_APP, resample_args = {\"rule\":\"30T\"}):\n",
" if group_map is None:\n",
" group_map = app.MAP_APP\n",
" \n",
" df2 = app.classify_app(df, group_map=group_map)\n",
" df2['duration']=np.nan\n",
" df2['duration']=df2['datetime'].diff()\n",
" df2['duration'] = df2['duration'].shift(-1)\n",
" thr = pd.Timedelta('10 hours')\n",
" df2 = df2[~(df2.duration>thr)]\n",
" df2 = df2[~(df2.duration>thr)]\n",
" df2[\"duration\"] = df2[\"duration\"].dt.total_seconds()\n",
" \n",
" df2.dropna(inplace=True)\n",
" \n",
" if len(df2)>0:\n",
" df2['datetime'] = pd.to_datetime(df2['datetime'])\n",
" df2.set_index('datetime', inplace=True)\n",
" result = df2.groupby([\"user\",\"app_group\"])[\"duration\"].resample(**resample_args).max()\n",
" \n",
" return result.reset_index([\"user\",\"app_group\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, we can call our new function in the stand-alone way or using the `extract_features_app` function. Alternatively, we can pass the feature function to the wrapper. Let's read again the data and assume we want the default behavior of the wrapper. "
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"customized_features = app.extract_features_app(data, bat_data, screen_data, features={app_max_duration: {}})"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user | \n",
" app_group | \n",
" duration | \n",
"
\n",
" \n",
" datetime | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2019-08-05 14:00:00+03:00 | \n",
" dvWdLQesv21a | \n",
" comm | \n",
" 997.0 | \n",
"
\n",
" \n",
" 2019-08-05 14:30:00+03:00 | \n",
" dvWdLQesv21a | \n",
" comm | \n",
" 7018.0 | \n",
"
\n",
" \n",
" 2019-08-05 15:00:00+03:00 | \n",
" dvWdLQesv21a | \n",
" comm | \n",
" NaN | \n",
"
\n",
" \n",
" 2019-08-05 15:30:00+03:00 | \n",
" dvWdLQesv21a | \n",
" comm | \n",
" NaN | \n",
"
\n",
" \n",
" 2019-08-05 16:00:00+03:00 | \n",
" dvWdLQesv21a | \n",
" comm | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user app_group duration\n",
"datetime \n",
"2019-08-05 14:00:00+03:00 dvWdLQesv21a comm 997.0\n",
"2019-08-05 14:30:00+03:00 dvWdLQesv21a comm 7018.0\n",
"2019-08-05 15:00:00+03:00 dvWdLQesv21a comm NaN\n",
"2019-08-05 15:30:00+03:00 dvWdLQesv21a comm NaN\n",
"2019-08-05 16:00:00+03:00 dvWdLQesv21a comm NaN"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"customized_features.head()"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" {}\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" device | \n",
" user | \n",
" battery_gap | \n",
"
\n",
" \n",
" datetime | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2019-01-17 09:30:00+02:00 | \n",
" 3Zkk0bhWmyny | \n",
" Afxzi7oI0yyp | \n",
" 0 days 00:04:26.149666560 | \n",
"
\n",
" \n",
" 2019-01-17 09:00:00+02:00 | \n",
" iMTB2alwYk1B | \n",
" wAzQNrdKZZax | \n",
" 0 days 00:00:36.003333376 | \n",
"
\n",
" \n",
" 2019-01-17 09:30:00+02:00 | \n",
" n8rndM6J5_4B | \n",
" lb983ODxEFUD | \n",
" 0 days 00:01:00.453499904 | \n",
"
\n",
" \n",
" 2019-01-17 10:00:00+02:00 | \n",
" n8rndM6J5_4B | \n",
" lb983ODxEFUD | \n",
" 0 days 00:01:16.099000064 | \n",
"
\n",
" \n",
" 2019-01-17 10:30:00+02:00 | \n",
" n8rndM6J5_4B | \n",
" lb983ODxEFUD | \n",
" 0 days 00:31:43.900999936 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" device user \\\n",
"datetime \n",
"2019-01-17 09:30:00+02:00 3Zkk0bhWmyny Afxzi7oI0yyp \n",
"2019-01-17 09:00:00+02:00 iMTB2alwYk1B wAzQNrdKZZax \n",
"2019-01-17 09:30:00+02:00 n8rndM6J5_4B lb983ODxEFUD \n",
"2019-01-17 10:00:00+02:00 n8rndM6J5_4B lb983ODxEFUD \n",
"2019-01-17 10:30:00+02:00 n8rndM6J5_4B lb983ODxEFUD \n",
"\n",
" battery_gap \n",
"datetime \n",
"2019-01-17 09:30:00+02:00 0 days 00:04:26.149666560 \n",
"2019-01-17 09:00:00+02:00 0 days 00:00:36.003333376 \n",
"2019-01-17 09:30:00+02:00 0 days 00:01:00.453499904 \n",
"2019-01-17 10:00:00+02:00 0 days 00:01:16.099000064 \n",
"2019-01-17 10:30:00+02:00 0 days 00:31:43.900999936 "
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"from pandas import Timestamp\n",
"import numpy as np\n",
"import pytest\n",
"\n",
"import niimpy\n",
"from niimpy.preprocessing.util import TZ\n",
"\n",
"df11 = pd.DataFrame(\n",
" {\"user\": ['wAzQNrdKZZax'] * 3 + ['Afxzi7oI0yyp'] * 3 + ['lb983ODxEFUD'] * 4,\n",
" \"device\": ['iMTB2alwYk1B'] * 3 + ['3Zkk0bhWmyny'] * 3 + ['n8rndM6J5_4B'] * 4,\n",
" \"time\": [1547709614.05, 1547709686.036, 1547709722.06, 1547710540.99, 1547710688.469, 1547711339.439,\n",
" 1547711831.275, 1547711952.182, 1547712028.281, 1547713932.182],\n",
" \"battery_level\": [96, 96, 95, 95, 94, 93, 94, 94, 94, 92],\n",
" \"battery_status\": ['3'] * 5 + ['-2', '2', '3', '-2', '2'],\n",
" \"battery_health\": ['2'] * 10,\n",
" \"battery_adaptor\": ['0'] * 5 + ['1', '1', '0', '0', '1'],\n",
" \"datetime\": ['2019-01-17 09:20:14.049999872+02:00', '2019-01-17 09:21:26.036000+02:00',\n",
" '2019-01-17 09:22:02.060000+02:00',\n",
" '2019-01-17 09:35:40.990000128+02:00', '2019-01-17 09:38:08.469000192+02:00',\n",
" '2019-01-17 09:48:59.438999808+02:00',\n",
" '2019-01-17 09:57:11.275000064+02:00', '2019-01-17 09:59:12.181999872+02:00',\n",
" '2019-01-17 10:00:28.280999936+02:00', '2019-01-17 10:32:12.181999872+02:00'\n",
" ]\n",
" })\n",
"df11['datetime'] = pd.to_datetime(df11['datetime'])\n",
"df11 = df11.set_index('datetime', drop=False)\n",
"\n",
"df = df11.copy()\n",
"k = niimpy.preprocessing.battery.battery_gaps\n",
"gaps = niimpy.preprocessing.battery.extract_features_battery(df, features={k: {}})\n",
"\n",
"gaps\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "niimpy",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}