Basic transformations
This page shows some basic transformations you can do once you have read data. Really, it is simply a pandas crash course, since pandas provides all the operations you may need and there is no need for us to re-invent things. Pandas provides a solid but flexible base for us to build advanced operations on top of.
You can read more at the Pandas documentation.
Extracting single rows and columns
Let’s first import mobile phone battery status data.
[1]:
TZ = 'Europe/Helsinki'
[2]:
import niimpy
from niimpy import config
import warnings
warnings.filterwarnings("ignore")
/u/24/rantahj1/unix/miniconda3/envs/niimpy/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
[3]:
# Read the data
df = niimpy.read_csv(config.MULTIUSER_AWARE_BATTERY_PATH, tz='Europe/Helsinki')
Then check first rows of the dataframe.
[4]:
df.head()
[4]:
user | device | time | battery_level | battery_status | battery_health | battery_adaptor | datetime | |
---|---|---|---|---|---|---|---|---|
2020-01-09 02:20:02.924999952+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578529e+09 | 74 | 3 | 2 | 0 | 2020-01-09 02:20:02.924999952+02:00 |
2020-01-09 02:21:30.405999899+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578529e+09 | 73 | 3 | 2 | 0 | 2020-01-09 02:21:30.405999899+02:00 |
2020-01-09 02:24:12.805999994+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578529e+09 | 72 | 3 | 2 | 0 | 2020-01-09 02:24:12.805999994+02:00 |
2020-01-09 02:35:38.561000109+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578530e+09 | 72 | 2 | 2 | 0 | 2020-01-09 02:35:38.561000109+02:00 |
2020-01-09 02:35:38.953000069+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578530e+09 | 72 | 2 | 2 | 2 | 2020-01-09 02:35:38.953000069+02:00 |
Get a single column, in this case all users:
[5]:
df['user']
[5]:
2020-01-09 02:20:02.924999952+02:00 jd9INuQ5BBlW
2020-01-09 02:21:30.405999899+02:00 jd9INuQ5BBlW
2020-01-09 02:24:12.805999994+02:00 jd9INuQ5BBlW
2020-01-09 02:35:38.561000109+02:00 jd9INuQ5BBlW
2020-01-09 02:35:38.953000069+02:00 jd9INuQ5BBlW
...
2019-08-09 00:30:48.073999882+03:00 dvWdLQesv21a
2019-08-09 00:32:40.717999935+03:00 dvWdLQesv21a
2019-08-09 00:34:23.114000082+03:00 dvWdLQesv21a
2019-08-09 00:36:05.505000114+03:00 dvWdLQesv21a
2019-08-07 06:53:35.671000004+03:00 dvWdLQesv21a
Name: user, Length: 634, dtype: object
Get a single row, in this case the 5th (the first row is zero):
[6]:
df.iloc[4]
[6]:
user jd9INuQ5BBlW
device 3p83yASkOb_B
time 1578530138.953
battery_level 72
battery_status 2
battery_health 2
battery_adaptor 2
datetime 2020-01-09 02:35:38.953000069+02:00
Name: 2020-01-09 02:35:38.953000069+02:00, dtype: object
Listing unique users
We can list unique users by using pandas.unique()
function.
[7]:
df['user'].unique()
[7]:
array(['jd9INuQ5BBlW', 'iGyXetHE3S8u', 'dvWdLQesv21a'], dtype=object)
List unique values
Same applies to other data features/columns.
[8]:
df['battery_status'].unique()
[8]:
array([ 3, 2, 5, -1, -3])
Extract data of only one subject
We can extract data of only one subject by following:
[9]:
df[df['user'] == 'jd9INuQ5BBlW']
[9]:
user | device | time | battery_level | battery_status | battery_health | battery_adaptor | datetime | |
---|---|---|---|---|---|---|---|---|
2020-01-09 02:20:02.924999952+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578529e+09 | 74 | 3 | 2 | 0 | 2020-01-09 02:20:02.924999952+02:00 |
2020-01-09 02:21:30.405999899+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578529e+09 | 73 | 3 | 2 | 0 | 2020-01-09 02:21:30.405999899+02:00 |
2020-01-09 02:24:12.805999994+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578529e+09 | 72 | 3 | 2 | 0 | 2020-01-09 02:24:12.805999994+02:00 |
2020-01-09 02:35:38.561000109+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578530e+09 | 72 | 2 | 2 | 0 | 2020-01-09 02:35:38.561000109+02:00 |
2020-01-09 02:35:38.953000069+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578530e+09 | 72 | 2 | 2 | 2 | 2020-01-09 02:35:38.953000069+02:00 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
2020-01-09 23:02:13.938999891+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 1.578604e+09 | 73 | 3 | 2 | 0 | 2020-01-09 23:02:13.938999891+02:00 |
2020-01-09 23:10:37.262000084+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 1.578604e+09 | 73 | 3 | 2 | 0 | 2020-01-09 23:10:37.262000084+02:00 |
2020-01-09 23:22:13.966000080+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 1.578605e+09 | 72 | 3 | 2 | 0 | 2020-01-09 23:22:13.966000080+02:00 |
2020-01-09 23:32:13.959000111+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 1.578606e+09 | 71 | 3 | 2 | 0 | 2020-01-09 23:32:13.959000111+02:00 |
2020-01-09 23:39:06.799999952+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 1.578606e+09 | 71 | 3 | 2 | 0 | 2020-01-09 23:39:06.799999952+02:00 |
373 rows × 8 columns
Renaming a column or columns
Dataframe column can be renamed using pandas.DataFrame.rename()
function.
[10]:
df.rename(columns={'time': 'timestamp'}, inplace=True)
df.head()
[10]:
user | device | timestamp | battery_level | battery_status | battery_health | battery_adaptor | datetime | |
---|---|---|---|---|---|---|---|---|
2020-01-09 02:20:02.924999952+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578529e+09 | 74 | 3 | 2 | 0 | 2020-01-09 02:20:02.924999952+02:00 |
2020-01-09 02:21:30.405999899+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578529e+09 | 73 | 3 | 2 | 0 | 2020-01-09 02:21:30.405999899+02:00 |
2020-01-09 02:24:12.805999994+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578529e+09 | 72 | 3 | 2 | 0 | 2020-01-09 02:24:12.805999994+02:00 |
2020-01-09 02:35:38.561000109+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578530e+09 | 72 | 2 | 2 | 0 | 2020-01-09 02:35:38.561000109+02:00 |
2020-01-09 02:35:38.953000069+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578530e+09 | 72 | 2 | 2 | 2 | 2020-01-09 02:35:38.953000069+02:00 |
Change datatypes
Let’s then check the dataframe datatypes:
[11]:
df.dtypes
[11]:
user object
device object
timestamp float64
battery_level int64
battery_status int64
battery_health int64
battery_adaptor int64
datetime datetime64[ns, Europe/Helsinki]
dtype: object
We can change the datatypes with pandas.astype()
function. Here we change battery_health datatype to categorical:
[12]:
df.astype({'battery_health': 'category'}).dtypes
[12]:
user object
device object
timestamp float64
battery_level int64
battery_status int64
battery_health category
battery_adaptor int64
datetime datetime64[ns, Europe/Helsinki]
dtype: object
Transforming a column to a new value
Dataframe values can be transformed (decoded etc.) into new values by using pandas.transform()
function.
Here we add one to the column values.
[13]:
df['battery_adaptor'].transform(lambda x: x + 1)
[13]:
2020-01-09 02:20:02.924999952+02:00 1
2020-01-09 02:21:30.405999899+02:00 1
2020-01-09 02:24:12.805999994+02:00 1
2020-01-09 02:35:38.561000109+02:00 1
2020-01-09 02:35:38.953000069+02:00 3
..
2019-08-09 00:30:48.073999882+03:00 2
2019-08-09 00:32:40.717999935+03:00 2
2019-08-09 00:34:23.114000082+03:00 2
2019-08-09 00:36:05.505000114+03:00 2
2019-08-07 06:53:35.671000004+03:00 2
Name: battery_adaptor, Length: 634, dtype: int64
Resample
Dataframe down/upsampling can be done with pandas.resample()
function.
Here we downsample the data by hour and aggregate the mean:
[14]:
df['battery_level'].resample('h').agg("mean")
[14]:
2019-08-05 14:00:00+03:00 45.500000
2019-08-05 15:00:00+03:00 44.000000
2019-08-05 16:00:00+03:00 43.000000
2019-08-05 17:00:00+03:00 42.000000
2019-08-05 18:00:00+03:00 41.000000
...
2020-01-09 19:00:00+02:00 86.166667
2020-01-09 20:00:00+02:00 82.000000
2020-01-09 21:00:00+02:00 78.428571
2020-01-09 22:00:00+02:00 75.000000
2020-01-09 23:00:00+02:00 72.000000
Freq: h, Name: battery_level, Length: 3779, dtype: float64
Groupby
For groupwise data inspection, we can use pandas.DataFrame.groupby()
function.
Let’s first load dataframe having several subjects belonging to different groups.
[15]:
df = niimpy.read_csv(config.SL_ACTIVITY_PATH, tz='Europe/Helsinki')
df.set_index('timestamp', inplace=True)
df
[15]:
user | activity | group | |
---|---|---|---|
timestamp | |||
2013-03-27 06:00:00-05:00 | u00 | 2 | none |
2013-03-27 07:00:00-05:00 | u00 | 1 | none |
2013-03-27 08:00:00-05:00 | u00 | 2 | none |
2013-03-27 09:00:00-05:00 | u00 | 3 | none |
2013-03-27 10:00:00-05:00 | u00 | 4 | none |
... | ... | ... | ... |
2013-05-31 18:00:00-05:00 | u59 | 5 | mild |
2013-05-31 19:00:00-05:00 | u59 | 5 | mild |
2013-05-31 20:00:00-05:00 | u59 | 4 | mild |
2013-05-31 21:00:00-05:00 | u59 | 5 | mild |
2013-05-31 22:00:00-05:00 | u59 | 1 | mild |
55907 rows × 3 columns
We can summarize the data by grouping the observations by group and user, and then aggregating the mean:
[16]:
df.groupby(['group','user']).agg("mean")
[16]:
activity | ||
---|---|---|
group | user | |
mild | u02 | 0.922348 |
u04 | 1.466960 | |
u07 | 0.914457 | |
u16 | 0.702918 | |
u20 | 0.277946 | |
u24 | 0.938028 | |
u27 | 0.653724 | |
u31 | 0.929495 | |
u35 | 0.519455 | |
u43 | 0.809045 | |
u49 | 1.159767 | |
u58 | 0.620621 | |
u59 | 1.626263 | |
moderate | u18 | 0.445323 |
u52 | 1.051735 | |
moderately severe | u17 | 0.489510 |
u23 | 0.412884 | |
none | u00 | 1.182973 |
u03 | 0.176737 | |
u05 | 0.606742 | |
u09 | 1.095908 | |
u10 | 0.662612 | |
u14 | 1.005859 | |
u15 | 0.295990 | |
u30 | 0.933036 | |
u32 | 1.113593 | |
u36 | 0.936281 | |
u42 | 0.378851 | |
u44 | 0.292580 | |
u47 | 0.396026 | |
u51 | 0.828662 | |
u56 | 0.840967 | |
severe | u01 | 1.063660 |
u19 | 0.571792 | |
u33 | 0.733115 | |
u34 | 0.454789 | |
u45 | 0.441134 | |
u53 | 0.389404 |
Summary statistics
There are many ways you may want to get an overview of your data.
Let’s first load mobile phone screen activity data.
[17]:
df = niimpy.read_csv(config.MULTIUSER_AWARE_SCREEN_PATH, tz='Europe/Helsinki')
df
[17]:
user | device | time | screen_status | datetime | |
---|---|---|---|---|---|
2020-01-09 02:06:41.573999882+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 1.578528e+09 | 0 | 2020-01-09 02:06:41.573999882+02:00 |
2020-01-09 02:09:29.151999950+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 1.578529e+09 | 1 | 2020-01-09 02:09:29.151999950+02:00 |
2020-01-09 02:09:32.790999889+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 1.578529e+09 | 3 | 2020-01-09 02:09:32.790999889+02:00 |
2020-01-09 02:11:41.996000051+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 1.578529e+09 | 0 | 2020-01-09 02:11:41.996000051+02:00 |
2020-01-09 02:16:19.010999918+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 1.578529e+09 | 1 | 2020-01-09 02:16:19.010999918+02:00 |
... | ... | ... | ... | ... | ... |
2019-08-07 17:42:41.009999990+03:00 | dvWdLQesv21a | i8jmoIuoe12Mo | 1.565189e+09 | 2 | 2019-08-07 17:42:41.009999990+03:00 |
2019-08-07 18:32:41.009999990+03:00 | dvWdLQesv21a | i8jmoIuoe12Mo | 1.565192e+09 | 1 | 2019-08-07 18:32:41.009999990+03:00 |
2019-08-07 19:22:41.009999990+03:00 | dvWdLQesv21a | i8jmoIuoe12Mo | 1.565195e+09 | 0 | 2019-08-07 19:22:41.009999990+03:00 |
2019-08-07 20:12:41.009999990+03:00 | dvWdLQesv21a | i8jmoIuoe12Mo | 1.565198e+09 | 1 | 2019-08-07 20:12:41.009999990+03:00 |
2019-08-07 21:02:41.009999990+03:00 | dvWdLQesv21a | i8jmoIuoe12Mo | 1.565201e+09 | 2 | 2019-08-07 21:02:41.009999990+03:00 |
343 rows × 5 columns
Hourly data
It is easy to get the amount of data (observations) in each hour
[18]:
hourly = df.groupby([df.index.date, df.index.hour]).size()
hourly
[18]:
2019-08-05 14 20
15 1
16 1
17 1
18 1
..
2020-01-09 18 4
20 4
21 19
22 3
23 12
Length: 94, dtype: int64
[19]:
# The index is the (day, hour) pairs and the
# value is the number at that time
print('%s had %d data points'%(hourly.index[0], hourly.iloc[0]))
(datetime.date(2019, 8, 5), 14) had 20 data points
Occurrence
In niimpy, occurrence is a way to see the completeness of data.
Occurence is calculated as: * Divide the data into time intervals (1 hour by default). * Divide each interval into bins (5 bins by default). * Count the number of bins that have data. This is \(occurrence\). * For each interval, report \(occurrence\).
This isn’t the perfect measure, but is reasonably effective and simple to calculate. For data which isn’t continuous (like screen data we are actually using), it shows how much the sensor has been used.
Note that the “uniformly present data” is not true for all data sources.
[20]:
occurrences = niimpy.util.occurrence(df.index)
occurrences.head()
[20]:
occurrence | |
---|---|
time | |
2019-08-05 14:00:00+03:00 | 4 |
2019-08-05 15:00:00+03:00 | 1 |
2019-08-05 16:00:00+03:00 | 1 |
2019-08-05 17:00:00+03:00 | 1 |
2019-08-05 18:00:00+03:00 | 1 |
We can create a simplified presentation (pivot table) for the data by using pandas.pivot()
function:
[21]:
occurrences["day"] = occurrences.index.date
occurrences["hour"] = occurrences.index.hour
occurrences.head(5).pivot_table(index='day',columns='hour')
[21]:
occurrence | |||||
---|---|---|---|---|---|
hour | 14 | 15 | 16 | 17 | 18 |
day | |||||
2019-08-05 | 4.0 | 1.0 | 1.0 | 1.0 | 1.0 |