Basic transformations

This page shows some basic transformations you can do once you have read data. Really, it is simply a pandas crash course, since pandas provides all the operations you may need and there is no need for us to re-invent things. Pandas provides a solid but flexible base for us to build advanced operations on top of.

You can read more at the Pandas documentation.

Extracting single rows and columns

Let’s first import mobile phone battery status data.

[1]:

TZ = 'Europe/Helsinki'

[2]:

import niimpy
from niimpy import config
import warnings
warnings.filterwarnings("ignore")

/u/24/rantahj1/unix/miniconda3/envs/niimpy/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

[3]:

# Read the data
df = niimpy.read_csv(config.MULTIUSER_AWARE_BATTERY_PATH, tz='Europe/Helsinki')

Then check first rows of the dataframe.

[4]:

df.head()

[4]:

	user	device	time	battery_level	battery_status	battery_health	battery_adaptor	datetime
2020-01-09 02:20:02.924999952+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578529e+09	74	3	2	0	2020-01-09 02:20:02.924999952+02:00
2020-01-09 02:21:30.405999899+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578529e+09	73	3	2	0	2020-01-09 02:21:30.405999899+02:00
2020-01-09 02:24:12.805999994+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578529e+09	72	3	2	0	2020-01-09 02:24:12.805999994+02:00
2020-01-09 02:35:38.561000109+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578530e+09	72	2	2	0	2020-01-09 02:35:38.561000109+02:00
2020-01-09 02:35:38.953000069+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578530e+09	72	2	2	2	2020-01-09 02:35:38.953000069+02:00

Get a single column, in this case all users:

[5]:

df['user']

[5]:

2020-01-09 02:20:02.924999952+02:00    jd9INuQ5BBlW
2020-01-09 02:21:30.405999899+02:00    jd9INuQ5BBlW
2020-01-09 02:24:12.805999994+02:00    jd9INuQ5BBlW
2020-01-09 02:35:38.561000109+02:00    jd9INuQ5BBlW
2020-01-09 02:35:38.953000069+02:00    jd9INuQ5BBlW
                                           ...
2019-08-09 00:30:48.073999882+03:00    dvWdLQesv21a
2019-08-09 00:32:40.717999935+03:00    dvWdLQesv21a
2019-08-09 00:34:23.114000082+03:00    dvWdLQesv21a
2019-08-09 00:36:05.505000114+03:00    dvWdLQesv21a
2019-08-07 06:53:35.671000004+03:00    dvWdLQesv21a
Name: user, Length: 634, dtype: object

Get a single row, in this case the 5th (the first row is zero):

[6]:

df.iloc[4]

[6]:

user                                      jd9INuQ5BBlW
device                                    3p83yASkOb_B
time                                    1578530138.953
battery_level                                       72
battery_status                                       2
battery_health                                       2
battery_adaptor                                      2
datetime           2020-01-09 02:35:38.953000069+02:00
Name: 2020-01-09 02:35:38.953000069+02:00, dtype: object

Listing unique users

We can list unique users by using pandas.unique() function.

[7]:

df['user'].unique()

[7]:

array(['jd9INuQ5BBlW', 'iGyXetHE3S8u', 'dvWdLQesv21a'], dtype=object)

List unique values

Same applies to other data features/columns.

[8]:

df['battery_status'].unique()

[8]:

array([ 3,  2,  5, -1, -3])

Extract data of only one subject

We can extract data of only one subject by following:

[9]:

df[df['user'] == 'jd9INuQ5BBlW']

[9]:

	user	device	time	battery_level	battery_status	battery_health	battery_adaptor	datetime
2020-01-09 02:20:02.924999952+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578529e+09	74	3	2	0	2020-01-09 02:20:02.924999952+02:00
2020-01-09 02:21:30.405999899+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578529e+09	73	3	2	0	2020-01-09 02:21:30.405999899+02:00
2020-01-09 02:24:12.805999994+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578529e+09	72	3	2	0	2020-01-09 02:24:12.805999994+02:00
2020-01-09 02:35:38.561000109+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578530e+09	72	2	2	0	2020-01-09 02:35:38.561000109+02:00
2020-01-09 02:35:38.953000069+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578530e+09	72	2	2	2	2020-01-09 02:35:38.953000069+02:00
...	...	...	...	...	...	...	...	...
2020-01-09 23:02:13.938999891+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1.578604e+09	73	3	2	0	2020-01-09 23:02:13.938999891+02:00
2020-01-09 23:10:37.262000084+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1.578604e+09	73	3	2	0	2020-01-09 23:10:37.262000084+02:00
2020-01-09 23:22:13.966000080+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1.578605e+09	72	3	2	0	2020-01-09 23:22:13.966000080+02:00
2020-01-09 23:32:13.959000111+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1.578606e+09	71	3	2	0	2020-01-09 23:32:13.959000111+02:00
2020-01-09 23:39:06.799999952+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1.578606e+09	71	3	2	0	2020-01-09 23:39:06.799999952+02:00

373 rows × 8 columns

Renaming a column or columns

Dataframe column can be renamed using pandas.DataFrame.rename() function.

[10]:

df.rename(columns={'time': 'timestamp'}, inplace=True)
df.head()

[10]:

	user	device	timestamp	battery_level	battery_status	battery_health	battery_adaptor	datetime
2020-01-09 02:20:02.924999952+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578529e+09	74	3	2	0	2020-01-09 02:20:02.924999952+02:00
2020-01-09 02:21:30.405999899+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578529e+09	73	3	2	0	2020-01-09 02:21:30.405999899+02:00
2020-01-09 02:24:12.805999994+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578529e+09	72	3	2	0	2020-01-09 02:24:12.805999994+02:00
2020-01-09 02:35:38.561000109+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578530e+09	72	2	2	0	2020-01-09 02:35:38.561000109+02:00
2020-01-09 02:35:38.953000069+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578530e+09	72	2	2	2	2020-01-09 02:35:38.953000069+02:00

Change datatypes

Let’s then check the dataframe datatypes:

[11]:

df.dtypes

[11]:

user                                        object
device                                      object
timestamp                                  float64
battery_level                                int64
battery_status                               int64
battery_health                               int64
battery_adaptor                              int64
datetime           datetime64[ns, Europe/Helsinki]
dtype: object

We can change the datatypes with pandas.astype() function. Here we change battery_health datatype to categorical:

[12]:

df.astype({'battery_health': 'category'}).dtypes

[12]:

user                                        object
device                                      object
timestamp                                  float64
battery_level                                int64
battery_status                               int64
battery_health                            category
battery_adaptor                              int64
datetime           datetime64[ns, Europe/Helsinki]
dtype: object

Transforming a column to a new value

Dataframe values can be transformed (decoded etc.) into new values by using pandas.transform()function.

Here we add one to the column values.

[13]:

df['battery_adaptor'].transform(lambda x: x + 1)

[13]:

2020-01-09 02:20:02.924999952+02:00    1
2020-01-09 02:21:30.405999899+02:00    1
2020-01-09 02:24:12.805999994+02:00    1
2020-01-09 02:35:38.561000109+02:00    1
2020-01-09 02:35:38.953000069+02:00    3
                                      ..
2019-08-09 00:30:48.073999882+03:00    2
2019-08-09 00:32:40.717999935+03:00    2
2019-08-09 00:34:23.114000082+03:00    2
2019-08-09 00:36:05.505000114+03:00    2
2019-08-07 06:53:35.671000004+03:00    2
Name: battery_adaptor, Length: 634, dtype: int64

Resample

Dataframe down/upsampling can be done with pandas.resample() function.

Here we downsample the data by hour and aggregate the mean:

[14]:

df['battery_level'].resample('h').agg("mean")

[14]:

2019-08-05 14:00:00+03:00    45.500000
2019-08-05 15:00:00+03:00    44.000000
2019-08-05 16:00:00+03:00    43.000000
2019-08-05 17:00:00+03:00    42.000000
2019-08-05 18:00:00+03:00    41.000000
                               ...
2020-01-09 19:00:00+02:00    86.166667
2020-01-09 20:00:00+02:00    82.000000
2020-01-09 21:00:00+02:00    78.428571
2020-01-09 22:00:00+02:00    75.000000
2020-01-09 23:00:00+02:00    72.000000
Freq: h, Name: battery_level, Length: 3779, dtype: float64

Groupby

For groupwise data inspection, we can use pandas.DataFrame.groupby() function.

Let’s first load dataframe having several subjects belonging to different groups.

[15]:

df = niimpy.read_csv(config.SL_ACTIVITY_PATH, tz='Europe/Helsinki')
df.set_index('timestamp', inplace=True)
df

[15]:

	user	activity	group
timestamp
2013-03-27 06:00:00-05:00	u00	2	none
2013-03-27 07:00:00-05:00	u00	1	none
2013-03-27 08:00:00-05:00	u00	2	none
2013-03-27 09:00:00-05:00	u00	3	none
2013-03-27 10:00:00-05:00	u00	4	none
...	...	...	...
2013-05-31 18:00:00-05:00	u59	5	mild
2013-05-31 19:00:00-05:00	u59	5	mild
2013-05-31 20:00:00-05:00	u59	4	mild
2013-05-31 21:00:00-05:00	u59	5	mild
2013-05-31 22:00:00-05:00	u59	1	mild

55907 rows × 3 columns

We can summarize the data by grouping the observations by group and user, and then aggregating the mean:

[16]:

df.groupby(['group','user']).agg("mean")

[16]:

		activity
group	user
mild	u02	0.922348
	u04	1.466960
	u07	0.914457
	u16	0.702918
	u20	0.277946
	u24	0.938028
	u27	0.653724
	u31	0.929495
	u35	0.519455
	u43	0.809045
	u49	1.159767
	u58	0.620621
	u59	1.626263
moderate	u18	0.445323
moderate	u52	1.051735
moderately severe	u17	0.489510
moderately severe	u23	0.412884
none	u00	1.182973
	u03	0.176737
	u05	0.606742
	u09	1.095908
	u10	0.662612
	u14	1.005859
	u15	0.295990
	u30	0.933036
	u32	1.113593
	u36	0.936281
	u42	0.378851
	u44	0.292580
	u47	0.396026
	u51	0.828662
	u56	0.840967
severe	u01	1.063660
	u19	0.571792
	u33	0.733115
	u34	0.454789
	u45	0.441134
	u53	0.389404

Summary statistics

There are many ways you may want to get an overview of your data.

Let’s first load mobile phone screen activity data.

[17]:

df = niimpy.read_csv(config.MULTIUSER_AWARE_SCREEN_PATH, tz='Europe/Helsinki')
df

[17]:

	user	device	time	screen_status	datetime
2020-01-09 02:06:41.573999882+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1.578528e+09	0	2020-01-09 02:06:41.573999882+02:00
2020-01-09 02:09:29.151999950+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1.578529e+09	1	2020-01-09 02:09:29.151999950+02:00
2020-01-09 02:09:32.790999889+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1.578529e+09	3	2020-01-09 02:09:32.790999889+02:00
2020-01-09 02:11:41.996000051+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1.578529e+09	0	2020-01-09 02:11:41.996000051+02:00
2020-01-09 02:16:19.010999918+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1.578529e+09	1	2020-01-09 02:16:19.010999918+02:00
...	...	...	...	...	...
2019-08-07 17:42:41.009999990+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	1.565189e+09	2	2019-08-07 17:42:41.009999990+03:00
2019-08-07 18:32:41.009999990+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	1.565192e+09	1	2019-08-07 18:32:41.009999990+03:00
2019-08-07 19:22:41.009999990+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	1.565195e+09	0	2019-08-07 19:22:41.009999990+03:00
2019-08-07 20:12:41.009999990+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	1.565198e+09	1	2019-08-07 20:12:41.009999990+03:00
2019-08-07 21:02:41.009999990+03:00	dvWdLQesv21a	i8jmoIuoe12Mo	1.565201e+09	2	2019-08-07 21:02:41.009999990+03:00

343 rows × 5 columns

Hourly data

It is easy to get the amount of data (observations) in each hour

[18]:

hourly = df.groupby([df.index.date, df.index.hour]).size()
hourly

[18]:

2019-08-05  14    20
            15     1
            16     1
            17     1
            18     1
                  ..
2020-01-09  18     4
            20     4
            21    19
            22     3
            23    12
Length: 94, dtype: int64

[19]:

# The index is the (day, hour) pairs and the
# value is the number at that time
print('%s had %d data points'%(hourly.index[0], hourly.iloc[0]))

(datetime.date(2019, 8, 5), 14) had 20 data points

Occurrence

In niimpy, occurrence is a way to see the completeness of data.

Occurence is calculated as:

Divide the data into time intervals (1 hour by default).
Divide each interval into bins (5 bins by default).
Count the number of bins that have data. This is \(occurrence\).
For each interval, report \(occurrence\).

This isn’t the perfect measure, but is reasonably effective and simple to calculate. For data which isn’t continuous (like screen data we are actually using), it shows how much the sensor has been used.

Note that the “uniformly present data” is not true for all data sources.

[20]:

occurrences = niimpy.util.occurrence(df.index)
occurrences.head()

[20]:

	occurrence
time
2019-08-05 14:00:00+03:00	4
2019-08-05 15:00:00+03:00	1
2019-08-05 16:00:00+03:00	1
2019-08-05 17:00:00+03:00	1
2019-08-05 18:00:00+03:00	1

We can create a simplified presentation (pivot table) for the data by using pandas.pivot()function:

[21]:

occurrences["day"] = occurrences.index.date
occurrences["hour"] = occurrences.index.hour
occurrences.head(5).pivot_table(index='day',columns='hour')

[21]:

	occurrence
hour	14	15	16	17	18
day
2019-08-05	4.0	1.0	1.0	1.0	1.0