Screen On/Off Data

Introduction

Screen data refers to the information about the status of the screen as reported by Android. These data can reveal important information about people’s circadian rhythm, social patterns, and activity. Screen data is an event data, this means that it cannot be sampled at a regular frequency. We just have information about the events that occured.

A dataframe with screen data should contain the following columns (column names can be different, but in that case they must be provided as parameters):

user: Subject ID
device: Device ID
screen_status: An integer indicating screen new status. https://awareframework.com/battery/
- Status 0: off
- Status 1: on
- Status 2: locked
- Status 3: unlocked

Some factors may interfere with the correct detection of all screen events (e.g. when the phone’s battery is depleated). Therefore, to correctly process screen data, we need to take into account other information like the battery status of the phone. This may complicate the preprocessing. To address this, niimpy includes a set of functions to clean, downsample, and extract features from screen data while taking into account factors like the battery level. The functions allow us to extract the following features:

screen_off: reports when the screen has been turned off
screen_count: number of times the screen has turned on, off, or has been in use
screen_duration: duration in seconds of the screen on, off, and in use statuses
screen_duration_min: minimum duration in seconds of the screen on, off, and in use statuses
screen_duration_max: maximum duration in seconds of the screen on, off, and in use statuses
screen_duration_median: median duration in seconds of the screen on, off, and in use statuses
screen_duration_mean: mean duration in seconds of the screen on, off, and in use statuses
screen_duration_std: standard deviation of the duration in seconds of the screen on, off, and in use statuses
screen_first_unlock: reports the first time when the phone was unlocked every day
extract_features_screen: wrapper-like function to extract several features at the same time

In addition, the screen module has three internal functions that help classify the events and calculate their status duration.

In the following, we will analyze screen data provided by niimpy as an example to illustrate the use of screen data.

2. Read data

Let’s start by reading the example data provided in niimpy. These data have already been shaped in a format that meets the requirements of the data schema. Let’s start by importing the needed modules. Firstly we will import the niimpy package and then we will import the module we will use (screen) and give it a short name for use convenience.

[1]:

import niimpy
from niimpy import config
import niimpy.preprocessing.screen as s
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

/u/24/rantahj1/unix/miniconda3/envs/niimpy/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Now let’s read the example data provided in niimpy. The example data is in csv format, so we need to use the read_csv function. When reading the data, we can specify the timezone where the data was collected. This will help us handle daylight saving times easier. We can specify the timezone with the argument tz. The output is a dataframe. We can also check the number of rows and columns in the dataframe.

[2]:

data = niimpy.read_csv(config.MULTIUSER_AWARE_SCREEN_PATH, tz='Europe/Helsinki')
data.shape

[2]:

(343, 5)

The data was succesfully read. We can see that there are 277 datapoints with 5 columns in the dataset. However, we do not know yet what the data really looks like, so let’s have a quick look:

[3]:

data.head()

[3]:

	user	device	time	screen_status	datetime
2020-01-09 02:06:41.573999882+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1.578528e+09	0	2020-01-09 02:06:41.573999882+02:00
2020-01-09 02:09:29.151999950+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1.578529e+09	1	2020-01-09 02:09:29.151999950+02:00
2020-01-09 02:09:32.790999889+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1.578529e+09	3	2020-01-09 02:09:32.790999889+02:00
2020-01-09 02:11:41.996000051+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1.578529e+09	0	2020-01-09 02:11:41.996000051+02:00
2020-01-09 02:16:19.010999918+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1.578529e+09	1	2020-01-09 02:16:19.010999918+02:00

[4]:

data.tail()

[4]:

	user	device	time	screen_status	datetime
2019-08-07 17:42:41.009999990+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	1.565189e+09	2	2019-08-07 17:42:41.009999990+03:00
2019-08-07 18:32:41.009999990+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	1.565192e+09	1	2019-08-07 18:32:41.009999990+03:00
2019-08-07 19:22:41.009999990+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	1.565195e+09	0	2019-08-07 19:22:41.009999990+03:00
2019-08-07 20:12:41.009999990+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	1.565198e+09	1	2019-08-07 20:12:41.009999990+03:00
2019-08-07 21:02:41.009999990+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	1.565201e+09	2	2019-08-07 21:02:41.009999990+03:00

By exploring the head and tail of the dataframe we can form an idea of its entirety. From the data, we can see that:

rows are observations, indexed by timestamps, i.e. each row represents a screen event at a given time and date
columns are characteristics for each observation, for example, the user whose data we are analyzing
there are at least two different users in the dataframe
the main column is screen_status. This screen status is coded in numbers as: 0=off, 1=on, 2=locked, 3=unlocked.

* TIP! Data format requirements (or what should our data look like)

Data can take other shapes and formats. However, the niimpy data scheme requires it to be in a certain shape. This means the dataframe needs to have at least the following characteristics:

One row per screen status. Each row should store information about one screen status only
Each row’s index should be a timestamp
There should be at least three columns:
- index: date and time when the event happened (timestamp)
- user: stores the user name whose data is analyzed. Each user should have a unique name or hash (i.e. one hash for each unique user)
- screen_status: stores the screen status (0,1,2, or 3) as defined by Android.
Columns additional to those listed in item 3 are allowed
The names of the columns do not need to be exactly “screen_status” as we can pass our own names in an argument (to be explained later).

Below is an example of a dataframe that complies with these minimum requirements

[5]:

example_dataschema = data[['user','screen_status']]
example_dataschema.head(3)

[5]:

	user	screen_status
2020-01-09 02:06:41.573999882+02:00	jd9INuQ5BBlW	0
2020-01-09 02:09:29.151999950+02:00	jd9INuQ5BBlW	1
2020-01-09 02:09:32.790999889+02:00	jd9INuQ5BBlW	3

A few words on missing data

Missing data for screen is difficult to detect. Firstly, this sensor is triggered by events and not sampled at a fixed frequency. Secondly, different phones, OS, and settings change how the screen is turned on/off; for example, one phone may go from OFF to ON to UNLOCKED, while another phone may go from OFF to UNLOCKED directly. Thirdly, events not related to the screen may affect its behavior, e.g. battery running out. Neverthless, there are some events transitions that are impossible to have, like a status to itself (e.g. two consecutive 0s). These imposible statuses helps us determine the missing data.

A few words on the classification of the events

We can know the status of the screen at a certain timepoint. However, we need a bit more to know the duration and the meaning of it. Consequently, we need to look at the numbers of two consecutive events and classify the transitions (going from one state to another consecutively) as:

from 3 to 0,1,2: the phone was in use
from 1 to 0,1,3: the phone was on
from 0 to 1,2,3: the phone was off

Other transitions are irrelevant.

A few words on the role of the battery

As mentioned before, battery statuses can affect the screen behavior. In particular, when the battery is depleated and the phone is shut down automatically, the screen sensor does not cast any events, so even when the screen is technically OFF because the phone does not have any battery left, we will not see that 0 in the screen status column. Thus, it is important to take into account the battery information when analyzing screen data. niimpy’s screen module is adapted to take into account the battery data. Since we do have some battery data, we will load it.

[6]:

bat_data = niimpy.read_csv(config.MULTIUSER_AWARE_BATTERY_PATH, tz='Europe/Helsinki')
bat_data.head()

[6]:

	user	device	time	battery_level	battery_status	battery_health	battery_adaptor	datetime
2020-01-09 02:20:02.924999952+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578529e+09	74	3	2	0	2020-01-09 02:20:02.924999952+02:00
2020-01-09 02:21:30.405999899+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578529e+09	73	3	2	0	2020-01-09 02:21:30.405999899+02:00
2020-01-09 02:24:12.805999994+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578529e+09	72	3	2	0	2020-01-09 02:24:12.805999994+02:00
2020-01-09 02:35:38.561000109+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578530e+09	72	2	2	0	2020-01-09 02:35:38.561000109+02:00
2020-01-09 02:35:38.953000069+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578530e+09	72	2	2	2	2020-01-09 02:35:38.953000069+02:00

In this case, we are interested in the battery_status information. This is standard information provided by Android. However, if the dataframe has this information in a column with a different name, we can use the argument battery_column_name similarly to the use of screen_column_name (more info about this topic below).

4. Extracting features

There are two ways to extract features. We could use each function separately or we could use niimpy’s ready-made wrapper. Both ways will require us to specify arguments to pass to the functions/wrapper in order to customize the way the functions work. These arguments are specified in dictionaries. Let’s first understand how to extract features using stand-alone functions.

We can use niimpy’s functions to compute communication features. Each function will require two inputs:

(mandatory) dataframe that must comply with the minimum requirements (see ‘* TIP! Data requirements above)
(optional) arguments for stand-alone functions

4.1.1 The argument dictionary for stand-alone functions (or how we specify the way a function works)

We can input two types of arguments to customize the way a stand-alone function works:

the name of the columns to be preprocessed: Since the dataframe may have different columns, we need to specify which column has the data we would like to be preprocessed. To do so, we can simply pass the name of the column to the argument screen_column_name.
the way we resample: resampling options are specified in niimpy as a dictionary. niimpy’s resampling and aggregating relies on pandas.DataFrame.resample, so mastering the use of this pandas function will help us greatly in niimpy’s preprocessing. Please familiarize yourself with the pandas resample function before continuing. Briefly, to use the pandas.DataFrame.resample function, we need a rule. This rule states the intervals we would like to use to resample our data (e.g., 15-seconds, 30-minutes, 1-hour). Neverthless, we can input more details into the function to specify the exact sampling we would like. For example, we could use the close argument if we would like to specify which side of the interval is closed, or we could use the offset argument if we would like to start our binning with an offset, etc. There are plenty of options to use this command, so we strongly recommend having pandas.DataFrame.resample documentation at hand. All features for the pandas.DataFrame.resample will be specified in a dictionary where keys are the arguments’ names for the pandas.DataFrame.resample, and the dictionary’s values are the values for each of these selected arguments. This dictionary will be passed as a value to the key resample_args in niimpy.

Let’s see some basic examples of these dictionaries:

s.screen_count(data, screen_column_name = "screen_status", resample_args = {"rule":"1D"})
s.screen_count(data, screen_column_name = "random_name", resample_args = {"rule":"30T"})
s.screen_count(data, screen_column_name = "other_name", resample_args = {"rule":"45T","origin":"end"})

Here, we have three basic feature dictionaries.

The first example will analyze the data stored in the column screen_status in our dataframe. The data will be binned in one day periods
The second example will analyze the data stored in the column random_name in our dataframe. The data will be aggregated in 30-minutes bins
The third example will analyze the data stored in the column other_name in our dataframe. The data will be binned in 45-minutes bins, but the binning will start from the last timestamp in the dataframe.

Default values: if no arguments are passed, niimpy’s default values are “screen_status” for the screen_column_name, and 30-min aggregation bins.

4.1.2 Using the functions

Now that we understand how the functions are customized, it is time we compute our first communication feature. Suppose that we are interested in extracting the total duration of outgoing calls every 20 minutes. We will need niimpy’s screen_count function.

[7]:

my_screen_count = s.screen_count(data, bat_data, screen_column_name = "screen_status", resample_args = {"rule":"20T"})

Let’s look at some values for one of the subjects.

[8]:

my_screen_count[my_screen_count["user"] == "jd9INuQ5BBlW"]

[8]:

	user	screen_on_count	screen_off_count	screen_use_count	device
2020-01-09 02:00:00+02:00	jd9INuQ5BBlW	2	2	2	OWd1Uau8POix
2020-01-09 02:20:00+02:00	jd9INuQ5BBlW	3	4	2	OWd1Uau8POix
2020-01-09 02:40:00+02:00	jd9INuQ5BBlW	2	2	1	OWd1Uau8POix
2020-01-09 03:00:00+02:00	jd9INuQ5BBlW	0	0	0	OWd1Uau8POix
2020-01-09 03:20:00+02:00	jd9INuQ5BBlW	0	0	0	OWd1Uau8POix
...	...	...	...	...	...
2020-01-09 21:40:00+02:00	jd9INuQ5BBlW	1	1	0	OWd1Uau8POix
2020-01-09 22:00:00+02:00	jd9INuQ5BBlW	1	1	0	OWd1Uau8POix
2020-01-09 22:20:00+02:00	jd9INuQ5BBlW	0	0	0	OWd1Uau8POix
2020-01-09 22:40:00+02:00	jd9INuQ5BBlW	0	0	0	OWd1Uau8POix
2020-01-09 23:00:00+02:00	jd9INuQ5BBlW	4	3	0	OWd1Uau8POix

64 rows × 5 columns

Let’s remember how the original data looked like for this subject

[9]:

data[data["user"]=="jd9INuQ5BBlW"].head(7)

[9]:

	user	device	time	screen_status	datetime
2020-01-09 02:06:41.573999882+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1.578528e+09	0	2020-01-09 02:06:41.573999882+02:00
2020-01-09 02:09:29.151999950+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1.578529e+09	1	2020-01-09 02:09:29.151999950+02:00
2020-01-09 02:09:32.790999889+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1.578529e+09	3	2020-01-09 02:09:32.790999889+02:00
2020-01-09 02:11:41.996000051+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1.578529e+09	0	2020-01-09 02:11:41.996000051+02:00
2020-01-09 02:16:19.010999918+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1.578529e+09	1	2020-01-09 02:16:19.010999918+02:00
2020-01-09 02:16:29.648999929+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1.578529e+09	0	2020-01-09 02:16:29.648999929+02:00
2020-01-09 02:16:29.657999992+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1.578529e+09	2	2020-01-09 02:16:29.657999992+02:00

We see that the bins are indeed 20-minutes bins, however, they are adjusted to fixed, predetermined intervals, i.e. the bin does not start on the time of the first datapoint. Instead, pandas starts the binning at 00:00:00 of everyday and counts 20-minutes intervals from there.

If we want the binning to start from the first datapoint in our dataset, we need the origin parameter and a for loop.

[10]:

users = list(data['user'].unique())
results = []
for user in users:
    start_time = data[data["user"]==user].index.min()
    results.append(s.screen_count(
        data[data["user"]==user],
        bat_data[bat_data["user"]==user],
        screen_column_name = "screen_status",
        resample_args = {"rule":"20T","origin":start_time}
    ))
my_screen_count = pd.concat(results)

[11]:

my_screen_count

[11]:

	user	screen_on_count	screen_off_count	screen_use_count	device
2020-01-09 02:06:41.573999882+02:00	jd9INuQ5BBlW	4	3	3	OWd1Uau8POix
2020-01-09 02:26:41.573999882+02:00	jd9INuQ5BBlW	2	3	1	OWd1Uau8POix
2020-01-09 02:46:41.573999882+02:00	jd9INuQ5BBlW	2	2	1	OWd1Uau8POix
2020-01-09 03:06:41.573999882+02:00	jd9INuQ5BBlW	0	0	0	OWd1Uau8POix
2020-01-09 03:26:41.573999882+02:00	jd9INuQ5BBlW	0	0	0	OWd1Uau8POix
...	...	...	...	...	...
2019-08-07 17:52:41.009999990+03:00	dvWdLQesv21a	0	0	0	i8jmoIuoe12Mo
2019-08-07 18:12:41.009999990+03:00	dvWdLQesv21a	0	0	0	i8jmoIuoe12Mo
2019-08-07 18:32:41.009999990+03:00	dvWdLQesv21a	1	0	0	i8jmoIuoe12Mo
2019-08-07 18:52:41.009999990+03:00	dvWdLQesv21a	0	0	0	i8jmoIuoe12Mo
2019-08-07 19:12:41.009999990+03:00	dvWdLQesv21a	0	1	0	i8jmoIuoe12Mo

2689 rows × 5 columns

The functions can also be called in absence of a config dictionary. In this case, the binning will be automatically set to 30-minutes.

[12]:

my_screen_count = s.screen_count(data, bat_data)
my_screen_count.head()

[12]:

	user	screen_on_count	screen_off_count	device
2019-08-05 15:30:00+03:00	dvWdLQesv21a	0	0	i8jmoIuoe12Mo
2019-08-05 16:00:00+03:00	dvWdLQesv21a	0	0	i8jmoIuoe12Mo
2019-08-05 16:30:00+03:00	dvWdLQesv21a	1	0	i8jmoIuoe12Mo
2019-08-05 17:00:00+03:00	dvWdLQesv21a	0	1	i8jmoIuoe12Mo
2019-08-05 17:30:00+03:00	dvWdLQesv21a	0	0	i8jmoIuoe12Mo

In case we do not have battery data, the functions can also be called without it. In this case, simply input an empty dataframe in the second position of the function. For example,

[13]:

empty_bat = pd.DataFrame()
no_bat = s.screen_count(
    data,
    screen_column_name = "screen_status",
    resample_args = {"rule":"20T","origin":start_time}
) #no battery information
no_bat.head()

[13]:

	user	screen_on_count	device
2019-08-05 15:32:41.009999990+03:00	dvWdLQesv21a	0	i8jmoIuoe12Mo
2019-08-05 15:52:41.009999990+03:00	dvWdLQesv21a	0	i8jmoIuoe12Mo
2019-08-05 16:12:41.009999990+03:00	dvWdLQesv21a	0	i8jmoIuoe12Mo
2019-08-05 16:32:41.009999990+03:00	dvWdLQesv21a	1	i8jmoIuoe12Mo
2019-08-05 16:52:41.009999990+03:00	dvWdLQesv21a	0	i8jmoIuoe12Mo

4.2 Extract features using the wrapper

We can use niimpy’s ready-made wrapper to extract one or several features at the same time. The wrapper will require two inputs:

(mandatory) dataframe that must comply with the minimum requirements (see ‘* TIP! Data requirements above)
(optional) an argument dictionary for wrapper

4.2.1 The argument dictionary for wrapper (or how we specify the way the wrapper works)

The argument dictionary contains the arguments for each stand-alone function we would like to employ. Its keys are the feature functions we want to compute. Its values are argument dictionaries created for each stand-alone function we will employ. Let’s see some examples of wrapper dictionaries:

[14]:

wrapper_features1 = {s.screen_count:{"screen_column_name":"screen_status","resample_args":{"rule":"1D"}},
                     s.screen_duration_min:{"screen_column_name":"screen_status","resample_args":{"rule":"1D"}}}

wrapper_features1 will be used to analyze two features, screen_count and screen_duration_min. For the feature screen_count, we will use the data stored in the column screen_status in our dataframe and the data will be binned in one day periods. For the feature screen_duration_min, we will use the data stored in the column screen_status in our dataframe and the data will be binned in one day periods.

[15]:

wrapper_features2 = {s.screen_count:{"screen_column_name":"screen_status", "battery_column_name":"battery_status", "resample_args":{"rule":"1D"}},
                     s.screen_duration:{"screen_column_name":"random_name","resample_args":{"rule":"5H","offset":"5min"}}}

wrapper_features2 will be used to analyze two features, screen_status and screen_duration. For the feature screen_status, we will use the data stored in the column screen_status in our dataframe and the data will be binned in one day periods. In addition, we will use battery data stored in a column called “battery_status”. For the feature screen_duration, we will use the data stored in the column random_name in our dataframe and the data will be binned in 5-hour periods with a 5-minute offset.

[16]:

wrapper_features3 = {s.screen_count:{"screen_column_name":"one_name","resample_args":{"rule":"1D","offset":"5min"}},
                     s.screen_duration:{"screen_column_name":"one_name", "battery_column":"some_column","resample_args":{}},
                     s.screen_duration_min:{"screen_column_name":"another_name","resample_args":{"rule":"30T","origin":"end_day"}}}

wrapper_features3 will be used to analyze three features, screen_count, screen_duration, and screen_duration_min. For the feature screen_count, we will use the data stored in the column one_name and the data will be binned in one day periods with a 5-min offset. For the feature screen_duration, we will use the data stored in the column one_name in our dataframe and the data will be binned using the default settings, i.e. 30-min bins. In addition, we will use data from the battery sensor, which will be passed in a column called “some_column”. Finally, for the feature screen_duration_min, we will use the data stored in the column another_name in our dataframe and the data will be binned in 30-minute periods and the origin of the bins will be the ceiling midnight of the last day.

Default values: if no arguments are passed, niimpy’s default values are “screen_status” for the screen_column_name, and 30-min aggregation bins. Moreover, the wrapper will compute all the available functions in absence of the argument dictionary.

4.2.2 Using the wrapper

Now that we understand how the wrapper is customized, it is time we compute our first communication feature using the wrapper. Suppose that we are interested in extracting the call total duration every 50 minutes. We will need niimpy’s extract_features_comms function, the data, and we will also need to create a dictionary to customize our function. Let’s create the dictionary first

[17]:

wrapper_features1 = {s.screen_duration:{"screen_column_name":"screen_status","resample_args":{"rule":"50T"}}}

Now, let’s use the wrapper

[18]:

results_wrapper = s.extract_features_screen(data, bat_data, features=wrapper_features1)
results_wrapper.head(5)

[18]:

	user	device	screen_off_durationtotal	screen_use_durationtotal	screen_on_durationtotal
2019-08-05 15:50:00+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	NaN	NaN	3000.0
2019-08-05 16:40:00+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	3000.0	NaN	0.0
2019-08-05 17:30:00+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	0.0	3000.0	0.0
2019-08-05 18:20:00+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	0.0	0.0	0.0
2019-08-05 19:10:00+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	0.0	0.0	3000.0

Our first attempt was succesful. Now, let’s try something more. Let’s assume we want to compute the screen_duration and screen_count in 50-minutes bin.

[19]:

wrapper_features2 = {s.screen_duration:{"screen_column_name":"screen_status","resample_args":{"rule":"50T"}},
                     s.screen_count:{"screen_column_name":"screen_status","resample_args":{"rule":"50T"}}}
results_wrapper = s.extract_features_screen(data, bat_data, features=wrapper_features2)
results_wrapper.head(5)

[19]:

	user	device	screen_off_durationtotal	screen_use_durationtotal	screen_on_durationtotal	screen_on_count	screen_off_count	screen_use_count
2019-08-05 15:50:00+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	NaN	NaN	3000.0	1	0	0
2019-08-05 16:40:00+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	3000.0	NaN	0.0	0	1	0
2019-08-05 17:30:00+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	0.0	3000.0	0.0	0	0	1
2019-08-05 18:20:00+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	0.0	0.0	0.0	0	0	0
2019-08-05 19:10:00+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	0.0	0.0	3000.0	1	0	0

Great! Another successful attempt. We see from the results that more columns were added with the required calculations. This is how the wrapper works when all features are computed with the same bins. Now, let’s see how the wrapper performs when each function has different binning requirements. Let’s assume we need to compute the screen_duration every day, and the screen_count every 5 hours with an offset of 5 minutes.

[20]:

wrapper_features3 = {s.screen_duration:{"screen_column_name":"screen_status","resample_args":{"rule":"1D"}},
                     s.screen_count:{"screen_column_name":"screen_status","resample_args":{"rule":"5H","offset":"5min"}}}
results_wrapper = s.extract_features_screen(data, bat_data, features=wrapper_features3)
results_wrapper.head(5)

[20]:

	user	device	screen_off_durationtotal	screen_use_durationtotal	screen_on_durationtotal	screen_on_count	screen_off_count	screen_use_count
2019-08-05 00:00:00+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	6000.000000	6000.000	9000.000	NaN	NaN	NaN
2019-08-06 00:00:00+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	24000.000000	15000.000	27000.000	NaN	NaN	NaN
2019-08-07 00:00:00+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	15000.000000	15000.000	21000.000	NaN	NaN	NaN
2019-08-05 00:00:00+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	287266.927999	1.189	276.382	NaN	NaN	NaN
2019-08-06 00:00:00+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	0.000000	0.000	0.000	NaN	NaN	NaN

[21]:

results_wrapper.tail(5)

[21]:

	user	device	screen_off_durationtotal	screen_use_durationtotal	screen_on_durationtotal	screen_on_count	screen_off_count	screen_use_count
2020-01-09 00:05:00+02:00	jd9INuQ5BBlW	OWd1Uau8POix	NaN	NaN	NaN	7.0	8.0	5.0
2020-01-09 05:05:00+02:00	jd9INuQ5BBlW	OWd1Uau8POix	NaN	NaN	NaN	0.0	0.0	0.0
2020-01-09 10:05:00+02:00	jd9INuQ5BBlW	OWd1Uau8POix	NaN	NaN	NaN	9.0	9.0	3.0
2020-01-09 15:05:00+02:00	jd9INuQ5BBlW	OWd1Uau8POix	NaN	NaN	NaN	17.0	17.0	7.0
2020-01-09 20:05:00+02:00	jd9INuQ5BBlW	OWd1Uau8POix	NaN	NaN	NaN	12.0	11.0	3.0

The output is once again a dataframe. In this case, two aggregations are shown. The first one is the daily aggregation computed for the screen_duration feature (head). The second one is the 5-hour aggregation period with 5-min offset for the screen_count (tail). We must note that because the screen_countfeature is not required to be aggregated daily, the daily aggregation timestamps have a NaN value. Similarly, because the screen_durationis not required to be aggregated in 5-hour windows, its values are NaN for all subjects.

4.2.3 Wrapper and its default option

The default option will compute all features in 30-minute aggregation windows. To use the extract_features_comms function with its default options, simply call the function.

[22]:

default = s.extract_features_screen(data, bat_data)

The function prints the computed features so you can track its process. Now let’s have a look at the outputs

[23]:

default.tail(10)

[23]:

	user	device	screen_on_count	screen_off_count	screen_use_count	screen_off_durationtotal	screen_use_durationtotal	screen_on_durationtotal	screen_off_durationminimum	screen_on_durationminimum	...	screen_off_durationmean	screen_on_durationmean	screen_use_durationmean	screen_on_durationmedian	screen_use_durationmedian	screen_off_durationmedian	screen_on_durationstd	screen_use_durationstd	screen_off_durationstd	first_unlock
2020-01-09 20:00:00+02:00	jd9INuQ5BBlW	OWd1Uau8POix	0.0	0.0	0.0	0.000	0.000	0.000	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaT
2020-01-09 20:30:00+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1.0	1.0	1.0	0.005	28.930	8.253	0.005	8.253	...	0.005000	8.253000	28.930	8.2530	28.930	0.005	NaN	NaN	NaN	NaT
2020-01-09 21:00:00+02:00	jd9INuQ5BBlW	OWd1Uau8POix	2.0	1.0	1.0	0.010	39.087	11.158	0.010	5.234	...	0.010000	5.579000	39.087	5.5790	39.087	0.010	0.487904	NaN	NaN	NaT
2020-01-09 21:30:00+02:00	jd9INuQ5BBlW	OWd1Uau8POix	4.0	5.0	1.0	46.028	101.062	376.930	0.006	33.834	...	9.205600	94.232500	101.062	73.2835	101.062	0.012	71.990324	NaN	20.561987	NaT
2020-01-09 22:00:00+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1.0	1.0	0.0	0.011	NaN	154.643	0.011	154.643	...	0.011000	154.643000	NaN	154.6430	NaN	0.011	NaN	NaN	NaN	NaT
2020-01-09 22:30:00+02:00	jd9INuQ5BBlW	OWd1Uau8POix	0.0	0.0	0.0	0.000	NaN	0.000	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaT
2020-01-09 23:00:00+02:00	jd9INuQ5BBlW	OWd1Uau8POix	4.0	3.0	0.0	0.025	NaN	6.931	0.008	2.079	...	0.008333	2.310333	NaN	2.2620	NaN	0.008	0.258906	NaN	0.000577	NaT
2019-08-05 00:00:00+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2019-08-05 16:32:41.009999990+03:00
2019-08-05 00:00:00+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2019-08-05 14:03:42.322000027+03:00
2020-01-09 00:00:00+02:00	jd9INuQ5BBlW	OWd1Uau8POix	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2020-01-09 02:16:19.010999918+02:00

10 rows × 24 columns

Implementing own features

If none of the provided functions suits well, We can implement our own customized features easily. To do so, we need to define a function that accepts a dataframe and returns a dataframe. The returned object should be indexed by user and timestamps (multiindex). Let’s assume we need a new function that detects the last time the screen is unlocked. Let’s first define the function

[24]:

def screen_last_unlock(df, bat, screen_column_name = "screen_status", resample_args = {"rule":"30T"}):
    df2 = s.util_screen(df, bat, screen_column_name=screen_column_name)
    df2 = s.event_classification_screen(df2, screen_column_name=screen_column_name)

    df2["time"] = df2.index
    result = df2[df2.on==1].groupby(["user", "device"])["time"].resample(**resample_args).max()
    result = result.to_frame(name="first_unlock")
    result = result.reset_index(["user", "device"])

    return result

Then, we can call our new function in the stand-alone way or using the extract_features_screen function. Because the stand-alone way is the common way to call functions in python, we will not show it. Instead, we will show how to integrate this new function to the wrapper. Let’s read again the data and assume we want the default behavior of the wrapper.

[25]:

customized_features = s.extract_features_screen(data, bat_data, features={screen_last_unlock: {}})

[26]:

customized_features.head()

[26]:

	user	device	first_unlock
2019-08-05 16:30:00+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	2019-08-05 16:32:41.009999990+03:00
2019-08-05 17:00:00+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	NaT
2019-08-05 17:30:00+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	NaT
2019-08-05 18:00:00+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	NaT
2019-08-05 18:30:00+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	NaT