{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Basic transformations\n", "\n", "This page shows some basic transformations you can do once you have read data. Really, it is simply a pandas crash course, since pandas provides all the operations you may need and there is no need for us to re-invent things. Pandas provides a solid but flexible base for us to build advanced operations on top of.\n", "\n", "You can read more at the [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/index.html)." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Extracting single rows and columns\n", "Let's first import mobile phone battery status data." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "TZ = 'Europe/Helsinki'" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/u/24/rantahj1/unix/miniconda3/envs/niimpy/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n" ] } ], "source": [ "import niimpy\n", "from niimpy import config\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Read the data\n", "df = niimpy.read_csv(config.MULTIUSER_AWARE_BATTERY_PATH, tz='Europe/Helsinki')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Then check first rows of the dataframe." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
userdevicetimebattery_levelbattery_statusbattery_healthbattery_adaptordatetime
2020-01-09 02:20:02.924999952+02:00jd9INuQ5BBlW3p83yASkOb_B1.578529e+09743202020-01-09 02:20:02.924999952+02:00
2020-01-09 02:21:30.405999899+02:00jd9INuQ5BBlW3p83yASkOb_B1.578529e+09733202020-01-09 02:21:30.405999899+02:00
2020-01-09 02:24:12.805999994+02:00jd9INuQ5BBlW3p83yASkOb_B1.578529e+09723202020-01-09 02:24:12.805999994+02:00
2020-01-09 02:35:38.561000109+02:00jd9INuQ5BBlW3p83yASkOb_B1.578530e+09722202020-01-09 02:35:38.561000109+02:00
2020-01-09 02:35:38.953000069+02:00jd9INuQ5BBlW3p83yASkOb_B1.578530e+09722222020-01-09 02:35:38.953000069+02:00
\n", "
" ], "text/plain": [ " user device time \\\n", "2020-01-09 02:20:02.924999952+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 \n", "2020-01-09 02:21:30.405999899+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 \n", "2020-01-09 02:24:12.805999994+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 \n", "2020-01-09 02:35:38.561000109+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 \n", "2020-01-09 02:35:38.953000069+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 \n", "\n", " battery_level battery_status \\\n", "2020-01-09 02:20:02.924999952+02:00 74 3 \n", "2020-01-09 02:21:30.405999899+02:00 73 3 \n", "2020-01-09 02:24:12.805999994+02:00 72 3 \n", "2020-01-09 02:35:38.561000109+02:00 72 2 \n", "2020-01-09 02:35:38.953000069+02:00 72 2 \n", "\n", " battery_health battery_adaptor \\\n", "2020-01-09 02:20:02.924999952+02:00 2 0 \n", "2020-01-09 02:21:30.405999899+02:00 2 0 \n", "2020-01-09 02:24:12.805999994+02:00 2 0 \n", "2020-01-09 02:35:38.561000109+02:00 2 0 \n", "2020-01-09 02:35:38.953000069+02:00 2 2 \n", "\n", " datetime \n", "2020-01-09 02:20:02.924999952+02:00 2020-01-09 02:20:02.924999952+02:00 \n", "2020-01-09 02:21:30.405999899+02:00 2020-01-09 02:21:30.405999899+02:00 \n", "2020-01-09 02:24:12.805999994+02:00 2020-01-09 02:24:12.805999994+02:00 \n", "2020-01-09 02:35:38.561000109+02:00 2020-01-09 02:35:38.561000109+02:00 \n", "2020-01-09 02:35:38.953000069+02:00 2020-01-09 02:35:38.953000069+02:00 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Get a single column, in this case all **users**:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2020-01-09 02:20:02.924999952+02:00 jd9INuQ5BBlW\n", "2020-01-09 02:21:30.405999899+02:00 jd9INuQ5BBlW\n", "2020-01-09 02:24:12.805999994+02:00 jd9INuQ5BBlW\n", "2020-01-09 02:35:38.561000109+02:00 jd9INuQ5BBlW\n", "2020-01-09 02:35:38.953000069+02:00 jd9INuQ5BBlW\n", " ... \n", "2019-08-09 00:30:48.073999882+03:00 dvWdLQesv21a\n", "2019-08-09 00:32:40.717999935+03:00 dvWdLQesv21a\n", "2019-08-09 00:34:23.114000082+03:00 dvWdLQesv21a\n", "2019-08-09 00:36:05.505000114+03:00 dvWdLQesv21a\n", "2019-08-07 06:53:35.671000004+03:00 dvWdLQesv21a\n", "Name: user, Length: 634, dtype: object" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['user']" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Get a single row, in this case the **5th** (the first row is zero):" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "user jd9INuQ5BBlW\n", "device 3p83yASkOb_B\n", "time 1578530138.953\n", "battery_level 72\n", "battery_status 2\n", "battery_health 2\n", "battery_adaptor 2\n", "datetime 2020-01-09 02:35:38.953000069+02:00\n", "Name: 2020-01-09 02:35:38.953000069+02:00, dtype: object" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[4]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Listing unique users\n", "We can list unique users by using `pandas.unique()` function." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['jd9INuQ5BBlW', 'iGyXetHE3S8u', 'dvWdLQesv21a'], dtype=object)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['user'].unique()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## List unique values\n", "Same applies to other data features/columns." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 3, 2, 5, -1, -3])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['battery_status'].unique()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Extract data of only one subject\n", "We can extract data of only one subject by following:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
userdevicetimebattery_levelbattery_statusbattery_healthbattery_adaptordatetime
2020-01-09 02:20:02.924999952+02:00jd9INuQ5BBlW3p83yASkOb_B1.578529e+09743202020-01-09 02:20:02.924999952+02:00
2020-01-09 02:21:30.405999899+02:00jd9INuQ5BBlW3p83yASkOb_B1.578529e+09733202020-01-09 02:21:30.405999899+02:00
2020-01-09 02:24:12.805999994+02:00jd9INuQ5BBlW3p83yASkOb_B1.578529e+09723202020-01-09 02:24:12.805999994+02:00
2020-01-09 02:35:38.561000109+02:00jd9INuQ5BBlW3p83yASkOb_B1.578530e+09722202020-01-09 02:35:38.561000109+02:00
2020-01-09 02:35:38.953000069+02:00jd9INuQ5BBlW3p83yASkOb_B1.578530e+09722222020-01-09 02:35:38.953000069+02:00
...........................
2020-01-09 23:02:13.938999891+02:00jd9INuQ5BBlWOWd1Uau8POix1.578604e+09733202020-01-09 23:02:13.938999891+02:00
2020-01-09 23:10:37.262000084+02:00jd9INuQ5BBlWOWd1Uau8POix1.578604e+09733202020-01-09 23:10:37.262000084+02:00
2020-01-09 23:22:13.966000080+02:00jd9INuQ5BBlWOWd1Uau8POix1.578605e+09723202020-01-09 23:22:13.966000080+02:00
2020-01-09 23:32:13.959000111+02:00jd9INuQ5BBlWOWd1Uau8POix1.578606e+09713202020-01-09 23:32:13.959000111+02:00
2020-01-09 23:39:06.799999952+02:00jd9INuQ5BBlWOWd1Uau8POix1.578606e+09713202020-01-09 23:39:06.799999952+02:00
\n", "

373 rows × 8 columns

\n", "
" ], "text/plain": [ " user device time \\\n", "2020-01-09 02:20:02.924999952+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 \n", "2020-01-09 02:21:30.405999899+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 \n", "2020-01-09 02:24:12.805999994+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 \n", "2020-01-09 02:35:38.561000109+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 \n", "2020-01-09 02:35:38.953000069+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 \n", "... ... ... ... \n", "2020-01-09 23:02:13.938999891+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578604e+09 \n", "2020-01-09 23:10:37.262000084+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578604e+09 \n", "2020-01-09 23:22:13.966000080+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578605e+09 \n", "2020-01-09 23:32:13.959000111+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578606e+09 \n", "2020-01-09 23:39:06.799999952+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578606e+09 \n", "\n", " battery_level battery_status \\\n", "2020-01-09 02:20:02.924999952+02:00 74 3 \n", "2020-01-09 02:21:30.405999899+02:00 73 3 \n", "2020-01-09 02:24:12.805999994+02:00 72 3 \n", "2020-01-09 02:35:38.561000109+02:00 72 2 \n", "2020-01-09 02:35:38.953000069+02:00 72 2 \n", "... ... ... \n", "2020-01-09 23:02:13.938999891+02:00 73 3 \n", "2020-01-09 23:10:37.262000084+02:00 73 3 \n", "2020-01-09 23:22:13.966000080+02:00 72 3 \n", "2020-01-09 23:32:13.959000111+02:00 71 3 \n", "2020-01-09 23:39:06.799999952+02:00 71 3 \n", "\n", " battery_health battery_adaptor \\\n", "2020-01-09 02:20:02.924999952+02:00 2 0 \n", "2020-01-09 02:21:30.405999899+02:00 2 0 \n", "2020-01-09 02:24:12.805999994+02:00 2 0 \n", "2020-01-09 02:35:38.561000109+02:00 2 0 \n", "2020-01-09 02:35:38.953000069+02:00 2 2 \n", "... ... ... \n", "2020-01-09 23:02:13.938999891+02:00 2 0 \n", "2020-01-09 23:10:37.262000084+02:00 2 0 \n", "2020-01-09 23:22:13.966000080+02:00 2 0 \n", "2020-01-09 23:32:13.959000111+02:00 2 0 \n", "2020-01-09 23:39:06.799999952+02:00 2 0 \n", "\n", " datetime \n", "2020-01-09 02:20:02.924999952+02:00 2020-01-09 02:20:02.924999952+02:00 \n", "2020-01-09 02:21:30.405999899+02:00 2020-01-09 02:21:30.405999899+02:00 \n", "2020-01-09 02:24:12.805999994+02:00 2020-01-09 02:24:12.805999994+02:00 \n", "2020-01-09 02:35:38.561000109+02:00 2020-01-09 02:35:38.561000109+02:00 \n", "2020-01-09 02:35:38.953000069+02:00 2020-01-09 02:35:38.953000069+02:00 \n", "... ... \n", "2020-01-09 23:02:13.938999891+02:00 2020-01-09 23:02:13.938999891+02:00 \n", "2020-01-09 23:10:37.262000084+02:00 2020-01-09 23:10:37.262000084+02:00 \n", "2020-01-09 23:22:13.966000080+02:00 2020-01-09 23:22:13.966000080+02:00 \n", "2020-01-09 23:32:13.959000111+02:00 2020-01-09 23:32:13.959000111+02:00 \n", "2020-01-09 23:39:06.799999952+02:00 2020-01-09 23:39:06.799999952+02:00 \n", "\n", "[373 rows x 8 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df['user'] == 'jd9INuQ5BBlW']" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Renaming a column or columns\n", "Dataframe column can be renamed using `pandas.DataFrame.rename()` function." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
userdevicetimestampbattery_levelbattery_statusbattery_healthbattery_adaptordatetime
2020-01-09 02:20:02.924999952+02:00jd9INuQ5BBlW3p83yASkOb_B1.578529e+09743202020-01-09 02:20:02.924999952+02:00
2020-01-09 02:21:30.405999899+02:00jd9INuQ5BBlW3p83yASkOb_B1.578529e+09733202020-01-09 02:21:30.405999899+02:00
2020-01-09 02:24:12.805999994+02:00jd9INuQ5BBlW3p83yASkOb_B1.578529e+09723202020-01-09 02:24:12.805999994+02:00
2020-01-09 02:35:38.561000109+02:00jd9INuQ5BBlW3p83yASkOb_B1.578530e+09722202020-01-09 02:35:38.561000109+02:00
2020-01-09 02:35:38.953000069+02:00jd9INuQ5BBlW3p83yASkOb_B1.578530e+09722222020-01-09 02:35:38.953000069+02:00
\n", "
" ], "text/plain": [ " user device timestamp \\\n", "2020-01-09 02:20:02.924999952+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 \n", "2020-01-09 02:21:30.405999899+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 \n", "2020-01-09 02:24:12.805999994+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 \n", "2020-01-09 02:35:38.561000109+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 \n", "2020-01-09 02:35:38.953000069+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 \n", "\n", " battery_level battery_status \\\n", "2020-01-09 02:20:02.924999952+02:00 74 3 \n", "2020-01-09 02:21:30.405999899+02:00 73 3 \n", "2020-01-09 02:24:12.805999994+02:00 72 3 \n", "2020-01-09 02:35:38.561000109+02:00 72 2 \n", "2020-01-09 02:35:38.953000069+02:00 72 2 \n", "\n", " battery_health battery_adaptor \\\n", "2020-01-09 02:20:02.924999952+02:00 2 0 \n", "2020-01-09 02:21:30.405999899+02:00 2 0 \n", "2020-01-09 02:24:12.805999994+02:00 2 0 \n", "2020-01-09 02:35:38.561000109+02:00 2 0 \n", "2020-01-09 02:35:38.953000069+02:00 2 2 \n", "\n", " datetime \n", "2020-01-09 02:20:02.924999952+02:00 2020-01-09 02:20:02.924999952+02:00 \n", "2020-01-09 02:21:30.405999899+02:00 2020-01-09 02:21:30.405999899+02:00 \n", "2020-01-09 02:24:12.805999994+02:00 2020-01-09 02:24:12.805999994+02:00 \n", "2020-01-09 02:35:38.561000109+02:00 2020-01-09 02:35:38.561000109+02:00 \n", "2020-01-09 02:35:38.953000069+02:00 2020-01-09 02:35:38.953000069+02:00 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.rename(columns={'time': 'timestamp'}, inplace=True)\n", "df.head()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Change datatypes\n", "Let's then check the dataframe datatypes:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "user object\n", "device object\n", "timestamp float64\n", "battery_level int64\n", "battery_status int64\n", "battery_health int64\n", "battery_adaptor int64\n", "datetime datetime64[ns, Europe/Helsinki]\n", "dtype: object" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We can change the datatypes with `pandas.astype()` function. Here we change **battery_health** datatype to categorical:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "user object\n", "device object\n", "timestamp float64\n", "battery_level int64\n", "battery_status int64\n", "battery_health category\n", "battery_adaptor int64\n", "datetime datetime64[ns, Europe/Helsinki]\n", "dtype: object" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.astype({'battery_health': 'category'}).dtypes" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Transforming a column to a new value\n", "Dataframe values can be transformed (decoded etc.) into new values by using `pandas.transform()`function.\n", "\n", "Here we add one to the column values.\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2020-01-09 02:20:02.924999952+02:00 1\n", "2020-01-09 02:21:30.405999899+02:00 1\n", "2020-01-09 02:24:12.805999994+02:00 1\n", "2020-01-09 02:35:38.561000109+02:00 1\n", "2020-01-09 02:35:38.953000069+02:00 3\n", " ..\n", "2019-08-09 00:30:48.073999882+03:00 2\n", "2019-08-09 00:32:40.717999935+03:00 2\n", "2019-08-09 00:34:23.114000082+03:00 2\n", "2019-08-09 00:36:05.505000114+03:00 2\n", "2019-08-07 06:53:35.671000004+03:00 2\n", "Name: battery_adaptor, Length: 634, dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['battery_adaptor'].transform(lambda x: x + 1)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Resample\n", "Dataframe down/upsampling can be done with `pandas.resample()` function.\n", "\n", "Here we downsample the data by hour and aggregate the mean:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2019-08-05 14:00:00+03:00 45.500000\n", "2019-08-05 15:00:00+03:00 44.000000\n", "2019-08-05 16:00:00+03:00 43.000000\n", "2019-08-05 17:00:00+03:00 42.000000\n", "2019-08-05 18:00:00+03:00 41.000000\n", " ... \n", "2020-01-09 19:00:00+02:00 86.166667\n", "2020-01-09 20:00:00+02:00 82.000000\n", "2020-01-09 21:00:00+02:00 78.428571\n", "2020-01-09 22:00:00+02:00 75.000000\n", "2020-01-09 23:00:00+02:00 72.000000\n", "Freq: h, Name: battery_level, Length: 3779, dtype: float64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['battery_level'].resample('h').agg(\"mean\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Groupby\n", "For groupwise data inspection, we can use `pandas.DataFrame.groupby()` function.\n", "\n", "Let's first load dataframe having several subjects belonging to different groups." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
useractivitygroup
timestamp
2013-03-27 06:00:00-05:00u002none
2013-03-27 07:00:00-05:00u001none
2013-03-27 08:00:00-05:00u002none
2013-03-27 09:00:00-05:00u003none
2013-03-27 10:00:00-05:00u004none
............
2013-05-31 18:00:00-05:00u595mild
2013-05-31 19:00:00-05:00u595mild
2013-05-31 20:00:00-05:00u594mild
2013-05-31 21:00:00-05:00u595mild
2013-05-31 22:00:00-05:00u591mild
\n", "

55907 rows × 3 columns

\n", "
" ], "text/plain": [ " user activity group\n", "timestamp \n", "2013-03-27 06:00:00-05:00 u00 2 none\n", "2013-03-27 07:00:00-05:00 u00 1 none\n", "2013-03-27 08:00:00-05:00 u00 2 none\n", "2013-03-27 09:00:00-05:00 u00 3 none\n", "2013-03-27 10:00:00-05:00 u00 4 none\n", "... ... ... ...\n", "2013-05-31 18:00:00-05:00 u59 5 mild\n", "2013-05-31 19:00:00-05:00 u59 5 mild\n", "2013-05-31 20:00:00-05:00 u59 4 mild\n", "2013-05-31 21:00:00-05:00 u59 5 mild\n", "2013-05-31 22:00:00-05:00 u59 1 mild\n", "\n", "[55907 rows x 3 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = niimpy.read_csv(config.SL_ACTIVITY_PATH, tz='Europe/Helsinki')\n", "df.set_index('timestamp', inplace=True)\n", "df" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We can summarize the data by grouping the observations by **group** and **user**, and then aggregating the mean:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
activity
groupuser
mildu020.922348
u041.466960
u070.914457
u160.702918
u200.277946
u240.938028
u270.653724
u310.929495
u350.519455
u430.809045
u491.159767
u580.620621
u591.626263
moderateu180.445323
u521.051735
moderately severeu170.489510
u230.412884
noneu001.182973
u030.176737
u050.606742
u091.095908
u100.662612
u141.005859
u150.295990
u300.933036
u321.113593
u360.936281
u420.378851
u440.292580
u470.396026
u510.828662
u560.840967
severeu011.063660
u190.571792
u330.733115
u340.454789
u450.441134
u530.389404
\n", "
" ], "text/plain": [ " activity\n", "group user \n", "mild u02 0.922348\n", " u04 1.466960\n", " u07 0.914457\n", " u16 0.702918\n", " u20 0.277946\n", " u24 0.938028\n", " u27 0.653724\n", " u31 0.929495\n", " u35 0.519455\n", " u43 0.809045\n", " u49 1.159767\n", " u58 0.620621\n", " u59 1.626263\n", "moderate u18 0.445323\n", " u52 1.051735\n", "moderately severe u17 0.489510\n", " u23 0.412884\n", "none u00 1.182973\n", " u03 0.176737\n", " u05 0.606742\n", " u09 1.095908\n", " u10 0.662612\n", " u14 1.005859\n", " u15 0.295990\n", " u30 0.933036\n", " u32 1.113593\n", " u36 0.936281\n", " u42 0.378851\n", " u44 0.292580\n", " u47 0.396026\n", " u51 0.828662\n", " u56 0.840967\n", "severe u01 1.063660\n", " u19 0.571792\n", " u33 0.733115\n", " u34 0.454789\n", " u45 0.441134\n", " u53 0.389404" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(['group','user']).agg(\"mean\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Summary statistics\n", "\n", "There are many ways you may want to get an overview of your data.\n", "\n", "Let's first load mobile phone screen activity data." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
userdevicetimescreen_statusdatetime
2020-01-09 02:06:41.573999882+02:00jd9INuQ5BBlWOWd1Uau8POix1.578528e+0902020-01-09 02:06:41.573999882+02:00
2020-01-09 02:09:29.151999950+02:00jd9INuQ5BBlWOWd1Uau8POix1.578529e+0912020-01-09 02:09:29.151999950+02:00
2020-01-09 02:09:32.790999889+02:00jd9INuQ5BBlWOWd1Uau8POix1.578529e+0932020-01-09 02:09:32.790999889+02:00
2020-01-09 02:11:41.996000051+02:00jd9INuQ5BBlWOWd1Uau8POix1.578529e+0902020-01-09 02:11:41.996000051+02:00
2020-01-09 02:16:19.010999918+02:00jd9INuQ5BBlWOWd1Uau8POix1.578529e+0912020-01-09 02:16:19.010999918+02:00
..................
2019-08-07 17:42:41.009999990+03:00dvWdLQesv21ai8jmoIuoe12Mo1.565189e+0922019-08-07 17:42:41.009999990+03:00
2019-08-07 18:32:41.009999990+03:00dvWdLQesv21ai8jmoIuoe12Mo1.565192e+0912019-08-07 18:32:41.009999990+03:00
2019-08-07 19:22:41.009999990+03:00dvWdLQesv21ai8jmoIuoe12Mo1.565195e+0902019-08-07 19:22:41.009999990+03:00
2019-08-07 20:12:41.009999990+03:00dvWdLQesv21ai8jmoIuoe12Mo1.565198e+0912019-08-07 20:12:41.009999990+03:00
2019-08-07 21:02:41.009999990+03:00dvWdLQesv21ai8jmoIuoe12Mo1.565201e+0922019-08-07 21:02:41.009999990+03:00
\n", "

343 rows × 5 columns

\n", "
" ], "text/plain": [ " user device \\\n", "2020-01-09 02:06:41.573999882+02:00 jd9INuQ5BBlW OWd1Uau8POix \n", "2020-01-09 02:09:29.151999950+02:00 jd9INuQ5BBlW OWd1Uau8POix \n", "2020-01-09 02:09:32.790999889+02:00 jd9INuQ5BBlW OWd1Uau8POix \n", "2020-01-09 02:11:41.996000051+02:00 jd9INuQ5BBlW OWd1Uau8POix \n", "2020-01-09 02:16:19.010999918+02:00 jd9INuQ5BBlW OWd1Uau8POix \n", "... ... ... \n", "2019-08-07 17:42:41.009999990+03:00 dvWdLQesv21a i8jmoIuoe12Mo \n", "2019-08-07 18:32:41.009999990+03:00 dvWdLQesv21a i8jmoIuoe12Mo \n", "2019-08-07 19:22:41.009999990+03:00 dvWdLQesv21a i8jmoIuoe12Mo \n", "2019-08-07 20:12:41.009999990+03:00 dvWdLQesv21a i8jmoIuoe12Mo \n", "2019-08-07 21:02:41.009999990+03:00 dvWdLQesv21a i8jmoIuoe12Mo \n", "\n", " time screen_status \\\n", "2020-01-09 02:06:41.573999882+02:00 1.578528e+09 0 \n", "2020-01-09 02:09:29.151999950+02:00 1.578529e+09 1 \n", "2020-01-09 02:09:32.790999889+02:00 1.578529e+09 3 \n", "2020-01-09 02:11:41.996000051+02:00 1.578529e+09 0 \n", "2020-01-09 02:16:19.010999918+02:00 1.578529e+09 1 \n", "... ... ... \n", "2019-08-07 17:42:41.009999990+03:00 1.565189e+09 2 \n", "2019-08-07 18:32:41.009999990+03:00 1.565192e+09 1 \n", "2019-08-07 19:22:41.009999990+03:00 1.565195e+09 0 \n", "2019-08-07 20:12:41.009999990+03:00 1.565198e+09 1 \n", "2019-08-07 21:02:41.009999990+03:00 1.565201e+09 2 \n", "\n", " datetime \n", "2020-01-09 02:06:41.573999882+02:00 2020-01-09 02:06:41.573999882+02:00 \n", "2020-01-09 02:09:29.151999950+02:00 2020-01-09 02:09:29.151999950+02:00 \n", "2020-01-09 02:09:32.790999889+02:00 2020-01-09 02:09:32.790999889+02:00 \n", "2020-01-09 02:11:41.996000051+02:00 2020-01-09 02:11:41.996000051+02:00 \n", "2020-01-09 02:16:19.010999918+02:00 2020-01-09 02:16:19.010999918+02:00 \n", "... ... \n", "2019-08-07 17:42:41.009999990+03:00 2019-08-07 17:42:41.009999990+03:00 \n", "2019-08-07 18:32:41.009999990+03:00 2019-08-07 18:32:41.009999990+03:00 \n", "2019-08-07 19:22:41.009999990+03:00 2019-08-07 19:22:41.009999990+03:00 \n", "2019-08-07 20:12:41.009999990+03:00 2019-08-07 20:12:41.009999990+03:00 \n", "2019-08-07 21:02:41.009999990+03:00 2019-08-07 21:02:41.009999990+03:00 \n", "\n", "[343 rows x 5 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = niimpy.read_csv(config.MULTIUSER_AWARE_SCREEN_PATH, tz='Europe/Helsinki')\n", "df" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Hourly data\n", "It is easy to get the amount of data (observations) in each hour" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2019-08-05 14 20\n", " 15 1\n", " 16 1\n", " 17 1\n", " 18 1\n", " ..\n", "2020-01-09 18 4\n", " 20 4\n", " 21 19\n", " 22 3\n", " 23 12\n", "Length: 94, dtype: int64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hourly = df.groupby([df.index.date, df.index.hour]).size()\n", "hourly" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(datetime.date(2019, 8, 5), 14) had 20 data points\n" ] } ], "source": [ "# The index is the (day, hour) pairs and the\n", "# value is the number at that time\n", "print('%s had %d data points'%(hourly.index[0], hourly.iloc[0]))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Occurrence\n", "\n", "In niimpy, **occurrence** is a way to see the completeness of data.\n", "\n", "Occurence is calculated as:\n", "* Divide the data into time intervals (1 hour by default).\n", "* Divide each interval into bins (5 bins by default).\n", "* Count the number of bins that have data. This is $occurrence$.\n", "* For each interval, report $occurrence$.\n", "\n", "This isn't the perfect measure, but is reasonably effective and simple to calculate. For data which isn't continuous (like screen data we are actually using), it shows how much the sensor has been used.\n", "\n", "Note that the \"uniformly present data\" is not true for all data sources." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
occurrence
time
2019-08-05 14:00:00+03:004
2019-08-05 15:00:00+03:001
2019-08-05 16:00:00+03:001
2019-08-05 17:00:00+03:001
2019-08-05 18:00:00+03:001
\n", "
" ], "text/plain": [ " occurrence\n", "time \n", "2019-08-05 14:00:00+03:00 4\n", "2019-08-05 15:00:00+03:00 1\n", "2019-08-05 16:00:00+03:00 1\n", "2019-08-05 17:00:00+03:00 1\n", "2019-08-05 18:00:00+03:00 1" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "occurrences = niimpy.util.occurrence(df.index)\n", "occurrences.head()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We can create a simplified presentation (pivot table) for the data by using `pandas.pivot()`function:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
occurrence
hour1415161718
day
2019-08-054.01.01.01.01.0
\n", "
" ], "text/plain": [ " occurrence \n", "hour 14 15 16 17 18\n", "day \n", "2019-08-05 4.0 1.0 1.0 1.0 1.0" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "occurrences[\"day\"] = occurrences.index.date\n", "occurrences[\"hour\"] = occurrences.index.hour\n", "occurrences.head(5).pivot_table(index='day',columns='hour')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" } }, "nbformat": 4, "nbformat_minor": 4 }