Basic transformations

This page shows some basic transformations you can do once you have read data. Really, it is simply a pandas crash course, since pandas provides all the operations you may need and there is no need for us to re-invent things. Pandas provides a solid but flexible base for us to build advanced operations on top of.

You can read more at the Pandas documentation.

Extracting single rows and columns

Let’s first import mobile phone battery status data.

[1]:
TZ = 'Europe/Helsinki'
[2]:
import niimpy
from niimpy import config
import warnings
warnings.filterwarnings("ignore")
/u/24/rantahj1/unix/miniconda3/envs/niimpy/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
[3]:
# Read the data
df = niimpy.read_csv(config.MULTIUSER_AWARE_BATTERY_PATH, tz='Europe/Helsinki')

Then check first rows of the dataframe.

[4]:
df.head()
[4]:
user device time battery_level battery_status battery_health battery_adaptor datetime
2020-01-09 02:20:02.924999952+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 74 3 2 0 2020-01-09 02:20:02.924999952+02:00
2020-01-09 02:21:30.405999899+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 73 3 2 0 2020-01-09 02:21:30.405999899+02:00
2020-01-09 02:24:12.805999994+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 72 3 2 0 2020-01-09 02:24:12.805999994+02:00
2020-01-09 02:35:38.561000109+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 72 2 2 0 2020-01-09 02:35:38.561000109+02:00
2020-01-09 02:35:38.953000069+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 72 2 2 2 2020-01-09 02:35:38.953000069+02:00

Get a single column, in this case all users:

[5]:
df['user']
[5]:
2020-01-09 02:20:02.924999952+02:00    jd9INuQ5BBlW
2020-01-09 02:21:30.405999899+02:00    jd9INuQ5BBlW
2020-01-09 02:24:12.805999994+02:00    jd9INuQ5BBlW
2020-01-09 02:35:38.561000109+02:00    jd9INuQ5BBlW
2020-01-09 02:35:38.953000069+02:00    jd9INuQ5BBlW
                                           ...
2019-08-09 00:30:48.073999882+03:00    dvWdLQesv21a
2019-08-09 00:32:40.717999935+03:00    dvWdLQesv21a
2019-08-09 00:34:23.114000082+03:00    dvWdLQesv21a
2019-08-09 00:36:05.505000114+03:00    dvWdLQesv21a
2019-08-07 06:53:35.671000004+03:00    dvWdLQesv21a
Name: user, Length: 634, dtype: object

Get a single row, in this case the 5th (the first row is zero):

[6]:
df.iloc[4]
[6]:
user                                      jd9INuQ5BBlW
device                                    3p83yASkOb_B
time                                    1578530138.953
battery_level                                       72
battery_status                                       2
battery_health                                       2
battery_adaptor                                      2
datetime           2020-01-09 02:35:38.953000069+02:00
Name: 2020-01-09 02:35:38.953000069+02:00, dtype: object

Listing unique users

We can list unique users by using pandas.unique() function.

[7]:
df['user'].unique()
[7]:
array(['jd9INuQ5BBlW', 'iGyXetHE3S8u', 'dvWdLQesv21a'], dtype=object)

List unique values

Same applies to other data features/columns.

[8]:
df['battery_status'].unique()
[8]:
array([ 3,  2,  5, -1, -3])

Extract data of only one subject

We can extract data of only one subject by following:

[9]:
df[df['user'] == 'jd9INuQ5BBlW']
[9]:
user device time battery_level battery_status battery_health battery_adaptor datetime
2020-01-09 02:20:02.924999952+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 74 3 2 0 2020-01-09 02:20:02.924999952+02:00
2020-01-09 02:21:30.405999899+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 73 3 2 0 2020-01-09 02:21:30.405999899+02:00
2020-01-09 02:24:12.805999994+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 72 3 2 0 2020-01-09 02:24:12.805999994+02:00
2020-01-09 02:35:38.561000109+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 72 2 2 0 2020-01-09 02:35:38.561000109+02:00
2020-01-09 02:35:38.953000069+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 72 2 2 2 2020-01-09 02:35:38.953000069+02:00
... ... ... ... ... ... ... ... ...
2020-01-09 23:02:13.938999891+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578604e+09 73 3 2 0 2020-01-09 23:02:13.938999891+02:00
2020-01-09 23:10:37.262000084+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578604e+09 73 3 2 0 2020-01-09 23:10:37.262000084+02:00
2020-01-09 23:22:13.966000080+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578605e+09 72 3 2 0 2020-01-09 23:22:13.966000080+02:00
2020-01-09 23:32:13.959000111+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578606e+09 71 3 2 0 2020-01-09 23:32:13.959000111+02:00
2020-01-09 23:39:06.799999952+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578606e+09 71 3 2 0 2020-01-09 23:39:06.799999952+02:00

373 rows × 8 columns

Renaming a column or columns

Dataframe column can be renamed using pandas.DataFrame.rename() function.

[10]:
df.rename(columns={'time': 'timestamp'}, inplace=True)
df.head()
[10]:
user device timestamp battery_level battery_status battery_health battery_adaptor datetime
2020-01-09 02:20:02.924999952+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 74 3 2 0 2020-01-09 02:20:02.924999952+02:00
2020-01-09 02:21:30.405999899+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 73 3 2 0 2020-01-09 02:21:30.405999899+02:00
2020-01-09 02:24:12.805999994+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 72 3 2 0 2020-01-09 02:24:12.805999994+02:00
2020-01-09 02:35:38.561000109+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 72 2 2 0 2020-01-09 02:35:38.561000109+02:00
2020-01-09 02:35:38.953000069+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 72 2 2 2 2020-01-09 02:35:38.953000069+02:00

Change datatypes

Let’s then check the dataframe datatypes:

[11]:
df.dtypes
[11]:
user                                        object
device                                      object
timestamp                                  float64
battery_level                                int64
battery_status                               int64
battery_health                               int64
battery_adaptor                              int64
datetime           datetime64[ns, Europe/Helsinki]
dtype: object

We can change the datatypes with pandas.astype() function. Here we change battery_health datatype to categorical:

[12]:
df.astype({'battery_health': 'category'}).dtypes
[12]:
user                                        object
device                                      object
timestamp                                  float64
battery_level                                int64
battery_status                               int64
battery_health                            category
battery_adaptor                              int64
datetime           datetime64[ns, Europe/Helsinki]
dtype: object

Transforming a column to a new value

Dataframe values can be transformed (decoded etc.) into new values by using pandas.transform()function.

Here we add one to the column values.

[13]:
df['battery_adaptor'].transform(lambda x: x + 1)
[13]:
2020-01-09 02:20:02.924999952+02:00    1
2020-01-09 02:21:30.405999899+02:00    1
2020-01-09 02:24:12.805999994+02:00    1
2020-01-09 02:35:38.561000109+02:00    1
2020-01-09 02:35:38.953000069+02:00    3
                                      ..
2019-08-09 00:30:48.073999882+03:00    2
2019-08-09 00:32:40.717999935+03:00    2
2019-08-09 00:34:23.114000082+03:00    2
2019-08-09 00:36:05.505000114+03:00    2
2019-08-07 06:53:35.671000004+03:00    2
Name: battery_adaptor, Length: 634, dtype: int64

Resample

Dataframe down/upsampling can be done with pandas.resample() function.

Here we downsample the data by hour and aggregate the mean:

[14]:
df['battery_level'].resample('h').agg("mean")
[14]:
2019-08-05 14:00:00+03:00    45.500000
2019-08-05 15:00:00+03:00    44.000000
2019-08-05 16:00:00+03:00    43.000000
2019-08-05 17:00:00+03:00    42.000000
2019-08-05 18:00:00+03:00    41.000000
                               ...
2020-01-09 19:00:00+02:00    86.166667
2020-01-09 20:00:00+02:00    82.000000
2020-01-09 21:00:00+02:00    78.428571
2020-01-09 22:00:00+02:00    75.000000
2020-01-09 23:00:00+02:00    72.000000
Freq: h, Name: battery_level, Length: 3779, dtype: float64

Groupby

For groupwise data inspection, we can use pandas.DataFrame.groupby() function.

Let’s first load dataframe having several subjects belonging to different groups.

[15]:
df = niimpy.read_csv(config.SL_ACTIVITY_PATH, tz='Europe/Helsinki')
df.set_index('timestamp', inplace=True)
df
[15]:
user activity group
timestamp
2013-03-27 06:00:00-05:00 u00 2 none
2013-03-27 07:00:00-05:00 u00 1 none
2013-03-27 08:00:00-05:00 u00 2 none
2013-03-27 09:00:00-05:00 u00 3 none
2013-03-27 10:00:00-05:00 u00 4 none
... ... ... ...
2013-05-31 18:00:00-05:00 u59 5 mild
2013-05-31 19:00:00-05:00 u59 5 mild
2013-05-31 20:00:00-05:00 u59 4 mild
2013-05-31 21:00:00-05:00 u59 5 mild
2013-05-31 22:00:00-05:00 u59 1 mild

55907 rows × 3 columns

We can summarize the data by grouping the observations by group and user, and then aggregating the mean:

[16]:
df.groupby(['group','user']).agg("mean")
[16]:
activity
group user
mild u02 0.922348
u04 1.466960
u07 0.914457
u16 0.702918
u20 0.277946
u24 0.938028
u27 0.653724
u31 0.929495
u35 0.519455
u43 0.809045
u49 1.159767
u58 0.620621
u59 1.626263
moderate u18 0.445323
u52 1.051735
moderately severe u17 0.489510
u23 0.412884
none u00 1.182973
u03 0.176737
u05 0.606742
u09 1.095908
u10 0.662612
u14 1.005859
u15 0.295990
u30 0.933036
u32 1.113593
u36 0.936281
u42 0.378851
u44 0.292580
u47 0.396026
u51 0.828662
u56 0.840967
severe u01 1.063660
u19 0.571792
u33 0.733115
u34 0.454789
u45 0.441134
u53 0.389404

Summary statistics

There are many ways you may want to get an overview of your data.

Let’s first load mobile phone screen activity data.

[17]:
df = niimpy.read_csv(config.MULTIUSER_AWARE_SCREEN_PATH, tz='Europe/Helsinki')
df
[17]:
user device time screen_status datetime
2020-01-09 02:06:41.573999882+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578528e+09 0 2020-01-09 02:06:41.573999882+02:00
2020-01-09 02:09:29.151999950+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578529e+09 1 2020-01-09 02:09:29.151999950+02:00
2020-01-09 02:09:32.790999889+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578529e+09 3 2020-01-09 02:09:32.790999889+02:00
2020-01-09 02:11:41.996000051+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578529e+09 0 2020-01-09 02:11:41.996000051+02:00
2020-01-09 02:16:19.010999918+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578529e+09 1 2020-01-09 02:16:19.010999918+02:00
... ... ... ... ... ...
2019-08-07 17:42:41.009999990+03:00 dvWdLQesv21a i8jmoIuoe12Mo 1.565189e+09 2 2019-08-07 17:42:41.009999990+03:00
2019-08-07 18:32:41.009999990+03:00 dvWdLQesv21a i8jmoIuoe12Mo 1.565192e+09 1 2019-08-07 18:32:41.009999990+03:00
2019-08-07 19:22:41.009999990+03:00 dvWdLQesv21a i8jmoIuoe12Mo 1.565195e+09 0 2019-08-07 19:22:41.009999990+03:00
2019-08-07 20:12:41.009999990+03:00 dvWdLQesv21a i8jmoIuoe12Mo 1.565198e+09 1 2019-08-07 20:12:41.009999990+03:00
2019-08-07 21:02:41.009999990+03:00 dvWdLQesv21a i8jmoIuoe12Mo 1.565201e+09 2 2019-08-07 21:02:41.009999990+03:00

343 rows × 5 columns

Hourly data

It is easy to get the amount of data (observations) in each hour

[18]:
hourly = df.groupby([df.index.date, df.index.hour]).size()
hourly
[18]:
2019-08-05  14    20
            15     1
            16     1
            17     1
            18     1
                  ..
2020-01-09  18     4
            20     4
            21    19
            22     3
            23    12
Length: 94, dtype: int64
[19]:
# The index is the (day, hour) pairs and the
# value is the number at that time
print('%s had %d data points'%(hourly.index[0], hourly.iloc[0]))
(datetime.date(2019, 8, 5), 14) had 20 data points

Occurrence

In niimpy, occurrence is a way to see the completeness of data.

Occurence is calculated as: * Divide the data into time intervals (1 hour by default). * Divide each interval into bins (5 bins by default). * Count the number of bins that have data. This is \(occurrence\). * For each interval, report \(occurrence\).

This isn’t the perfect measure, but is reasonably effective and simple to calculate. For data which isn’t continuous (like screen data we are actually using), it shows how much the sensor has been used.

Note that the “uniformly present data” is not true for all data sources.

[20]:
occurrences = niimpy.util.occurrence(df.index)
occurrences.head()
[20]:
occurrence
time
2019-08-05 14:00:00+03:00 4
2019-08-05 15:00:00+03:00 1
2019-08-05 16:00:00+03:00 1
2019-08-05 17:00:00+03:00 1
2019-08-05 18:00:00+03:00 1

We can create a simplified presentation (pivot table) for the data by using pandas.pivot()function:

[21]:
occurrences["day"] = occurrences.index.date
occurrences["hour"] = occurrences.index.hour
occurrences.head(5).pivot_table(index='day',columns='hour')
[21]:
occurrence
hour 14 15 16 17 18
day
2019-08-05 4.0 1.0 1.0 1.0 1.0