{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reading\n",
    "\n",
    "In principle, Niimpy can deal with any files of any format - you only need to convert them to a DataFrame.  Still, it is very useful to have some common formats, so we present two standard formats with default readers:\n",
    "\n",
    "* **CSV files** are very standard and normal to create and understand, but in order to deal with them everything must be loaded into memory.\n",
    "* **Google TakeOut** provides a large selection of data in different formats. We provide readers most commonly used data types.\n",
    "* **MHealth** is a common format for health data.\n",
    "* **sqlite3 databases**, which requires sqlite3 to read, provides more power for filtering and automatic processing without reading everything into memory.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## DataFrame format (in-memory)\n",
    "\n",
    "In-memory, data is stored in a [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).  This is basically a normal dataframe.  There are some standardized columns (see the [schema](/about/schema.html)) and the index is a DatetimeIndex."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## CSV files\n",
    "\n",
    "CSV files should have a header that lists the column names and generally be readable by `pandas.read_csv`.\n",
    "\n",
    "Reading these can be done with `niimpy.read_csv`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import niimpy \n",
    "import niimpy.config as config\n",
    "\n",
    "# Read the battery data\n",
    "df= niimpy.read_csv(config.MULTIUSER_AWARE_BATTERY_PATH, tz='Europe/Helsinki')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## sqlite3 databases\n",
    "\n",
    "For the purposes of niimpy, sqlite3 databases can generally be seen as supercharged CSV files.\n",
    "\n",
    "A single database file could contain multiple datasets within it, thus when reading them a **table name** must be specified.\n",
    "\n",
    "One reads the entire database into memory using `sqlite.read_sqlite`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read the sqlite3 data\n",
    "df= niimpy.read_sqlite(config.SQLITE_SINGLEUSER_PATH, table=\"AwareScreen\", tz='Europe/Helsinki')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can list the tables within a database using `niimpy.reading.read.read_sqlite_tables`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'AwareScreen'}"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "niimpy.reading.read.read_sqlite_tables(config.SQLITE_SINGLEUSER_PATH)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "sqlite3 files are highly recommended as a data storage format, since many common exploration options can be done within the database itself without reading the whole data into memory or writing an iterator.  However, the interface is more difficult to use.  Niimpy (before 2021-07) used this as its primary interface, but since then this interface has been de-emphasized.  You can read more in [the database section](/database.html), but this is only recommended if you need efficiency when using massive amounts of data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Note of incorrect encoding\n",
    "\n",
    "Sometimes a sqlite3 database has incorrect metadata for character encoding. This is a common issue with languages that cannot be encoded with `utf-8`, such as Finnish. When this happens, use the function `niimpy.utils.set_encoding()`.\n",
    "\n",
    "```Python\n",
    "import niimpy\n",
    "df = niimpy.read_sqlite(\n",
    "    config.SQLITE_SINGLEUSER_PATH,\n",
    "    table=\"AwareScreen\",\n",
    "    tz='Europe/Helsinki'\n",
    ")\n",
    "df = niimpy.utils.set_encoding(\n",
    "    df,\n",
    "    to_encoding = 'utf-8',\n",
    "    from_encoding = 'iso-8859-1'\n",
    ")\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Google TakeOut\n",
    "\n",
    "Google takeout contains a many different types of data and new types are added as Google creates services or changes data storage methods. Readers are currently available for location data, emails, and activity data from the fit app. For other data types, the user needs to manually convert them into a Niimpy compatible Pandas DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Data downloaded from Google Takeout is compressed as a zip archive to conserve disk space. To \n",
    "# demonstrate reading for the zip file, we will first compress our example data into the zip format.\n",
    "import zipfile\n",
    "test_zip = zipfile.ZipFile(\"test.zip\", mode=\"w\")\n",
    "\n",
    "for dirpath,dirs,files in os.walk(config.GOOGLE_TAKEOUT_DIR):\n",
    "    for f in files:\n",
    "        filename = os.path.join(dirpath, f)\n",
    "        filename_in_zip = filename.replace(config.GOOGLE_TAKEOUT_DIR, \"\")\n",
    "        test_zip.write(filename, filename_in_zip)\n",
    "\n",
    "test_zip.close()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Location"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>accuracy</th>\n",
       "      <th>source</th>\n",
       "      <th>device</th>\n",
       "      <th>placeid</th>\n",
       "      <th>formfactor</th>\n",
       "      <th>altitude</th>\n",
       "      <th>verticalaccuracy</th>\n",
       "      <th>platformtype</th>\n",
       "      <th>servertimestamp</th>\n",
       "      <th>devicetimestamp</th>\n",
       "      <th>batterycharging</th>\n",
       "      <th>velocity</th>\n",
       "      <th>heading</th>\n",
       "      <th>latitude</th>\n",
       "      <th>longitude</th>\n",
       "      <th>inferred_latitude</th>\n",
       "      <th>inferred_longitude</th>\n",
       "      <th>activity_type</th>\n",
       "      <th>activity_inference_confidence</th>\n",
       "      <th>user</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>timestamp</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2016-08-12 19:29:43.821000+00:00</th>\n",
       "      <td>25</td>\n",
       "      <td>WIFI</td>\n",
       "      <td>-577680260</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>35.997488</td>\n",
       "      <td>-78.922194</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>f1066868-2e1d-11ef-abd1-b0dcef010c43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2016-08-12 19:30:49.531000+00:00</th>\n",
       "      <td>21</td>\n",
       "      <td>WIFI</td>\n",
       "      <td>-577680260</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>35.997559</td>\n",
       "      <td>-78.922504</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>STILL</td>\n",
       "      <td>62.0</td>\n",
       "      <td>f1066868-2e1d-11ef-abd1-b0dcef010c43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2016-08-12 19:31:49.531000+00:00</th>\n",
       "      <td>21</td>\n",
       "      <td>WIFI</td>\n",
       "      <td>-577680260</td>\n",
       "      <td>ChIJS_5Nmuz1jUYRGYf3QiiZco4</td>\n",
       "      <td>PHONE</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>35.997559</td>\n",
       "      <td>-78.922504</td>\n",
       "      <td>60.187135</td>\n",
       "      <td>24.824478</td>\n",
       "      <td>STILL</td>\n",
       "      <td>62.0</td>\n",
       "      <td>f1066868-2e1d-11ef-abd1-b0dcef010c43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2016-08-12 21:15:55.295000+00:00</th>\n",
       "      <td>1500</td>\n",
       "      <td>CELL</td>\n",
       "      <td>-577680260</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>36.000870</td>\n",
       "      <td>-78.923343</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>ON_FOOT</td>\n",
       "      <td>54.0</td>\n",
       "      <td>f1066868-2e1d-11ef-abd1-b0dcef010c43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2016-08-12 21:16:33+00:00</th>\n",
       "      <td>8</td>\n",
       "      <td>GPS</td>\n",
       "      <td>-577680260</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>35.997250</td>\n",
       "      <td>-78.923989</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>f1066868-2e1d-11ef-abd1-b0dcef010c43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2016-08-12 21:16:48+00:00</th>\n",
       "      <td>3</td>\n",
       "      <td>GPS</td>\n",
       "      <td>-577680260</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>35.997236</td>\n",
       "      <td>-78.924124</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>f1066868-2e1d-11ef-abd1-b0dcef010c43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2023-11-21 11:29:21.730000+00:00</th>\n",
       "      <td>13</td>\n",
       "      <td>WIFI</td>\n",
       "      <td>1832214273</td>\n",
       "      <td>ChIJw1WKQev1jUYRCdZmYR-HCiI</td>\n",
       "      <td>PHONE</td>\n",
       "      <td>28.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>ANDROID</td>\n",
       "      <td>2023-11-21T11:29:24.747Z</td>\n",
       "      <td>2023-11-21T11:29:24.350Z</td>\n",
       "      <td>False</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>60.186818</td>\n",
       "      <td>24.821288</td>\n",
       "      <td>60.186816</td>\n",
       "      <td>24.821288</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>f1066868-2e1d-11ef-abd1-b0dcef010c43</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                  accuracy source      device  \\\n",
       "timestamp                                                       \n",
       "2016-08-12 19:29:43.821000+00:00        25   WIFI  -577680260   \n",
       "2016-08-12 19:30:49.531000+00:00        21   WIFI  -577680260   \n",
       "2016-08-12 19:31:49.531000+00:00        21   WIFI  -577680260   \n",
       "2016-08-12 21:15:55.295000+00:00      1500   CELL  -577680260   \n",
       "2016-08-12 21:16:33+00:00                8    GPS  -577680260   \n",
       "2016-08-12 21:16:48+00:00                3    GPS  -577680260   \n",
       "2023-11-21 11:29:21.730000+00:00        13   WIFI  1832214273   \n",
       "\n",
       "                                                      placeid formfactor  \\\n",
       "timestamp                                                                  \n",
       "2016-08-12 19:29:43.821000+00:00                          NaN        NaN   \n",
       "2016-08-12 19:30:49.531000+00:00                          NaN        NaN   \n",
       "2016-08-12 19:31:49.531000+00:00  ChIJS_5Nmuz1jUYRGYf3QiiZco4      PHONE   \n",
       "2016-08-12 21:15:55.295000+00:00                          NaN        NaN   \n",
       "2016-08-12 21:16:33+00:00                                 NaN        NaN   \n",
       "2016-08-12 21:16:48+00:00                                 NaN        NaN   \n",
       "2023-11-21 11:29:21.730000+00:00  ChIJw1WKQev1jUYRCdZmYR-HCiI      PHONE   \n",
       "\n",
       "                                  altitude  verticalaccuracy platformtype  \\\n",
       "timestamp                                                                   \n",
       "2016-08-12 19:29:43.821000+00:00       NaN               NaN          NaN   \n",
       "2016-08-12 19:30:49.531000+00:00       NaN               NaN          NaN   \n",
       "2016-08-12 19:31:49.531000+00:00       NaN               NaN          NaN   \n",
       "2016-08-12 21:15:55.295000+00:00       NaN               NaN          NaN   \n",
       "2016-08-12 21:16:33+00:00              NaN               NaN          NaN   \n",
       "2016-08-12 21:16:48+00:00              NaN               NaN          NaN   \n",
       "2023-11-21 11:29:21.730000+00:00      28.0               2.0      ANDROID   \n",
       "\n",
       "                                           servertimestamp  \\\n",
       "timestamp                                                    \n",
       "2016-08-12 19:29:43.821000+00:00                       NaN   \n",
       "2016-08-12 19:30:49.531000+00:00                       NaN   \n",
       "2016-08-12 19:31:49.531000+00:00                       NaN   \n",
       "2016-08-12 21:15:55.295000+00:00                       NaN   \n",
       "2016-08-12 21:16:33+00:00                              NaN   \n",
       "2016-08-12 21:16:48+00:00                              NaN   \n",
       "2023-11-21 11:29:21.730000+00:00  2023-11-21T11:29:24.747Z   \n",
       "\n",
       "                                           devicetimestamp batterycharging  \\\n",
       "timestamp                                                                    \n",
       "2016-08-12 19:29:43.821000+00:00                       NaN             NaN   \n",
       "2016-08-12 19:30:49.531000+00:00                       NaN             NaN   \n",
       "2016-08-12 19:31:49.531000+00:00                       NaN             NaN   \n",
       "2016-08-12 21:15:55.295000+00:00                       NaN             NaN   \n",
       "2016-08-12 21:16:33+00:00                              NaN             NaN   \n",
       "2016-08-12 21:16:48+00:00                              NaN             NaN   \n",
       "2023-11-21 11:29:21.730000+00:00  2023-11-21T11:29:24.350Z           False   \n",
       "\n",
       "                                 velocity heading   latitude  longitude  \\\n",
       "timestamp                                                                 \n",
       "2016-08-12 19:29:43.821000+00:00      NaN     NaN  35.997488 -78.922194   \n",
       "2016-08-12 19:30:49.531000+00:00      NaN     NaN  35.997559 -78.922504   \n",
       "2016-08-12 19:31:49.531000+00:00      NaN     NaN  35.997559 -78.922504   \n",
       "2016-08-12 21:15:55.295000+00:00      NaN     NaN  36.000870 -78.923343   \n",
       "2016-08-12 21:16:33+00:00             NaN     NaN  35.997250 -78.923989   \n",
       "2016-08-12 21:16:48+00:00             NaN     NaN  35.997236 -78.924124   \n",
       "2023-11-21 11:29:21.730000+00:00                   60.186818  24.821288   \n",
       "\n",
       "                                  inferred_latitude  inferred_longitude  \\\n",
       "timestamp                                                                 \n",
       "2016-08-12 19:29:43.821000+00:00                NaN                 NaN   \n",
       "2016-08-12 19:30:49.531000+00:00                NaN                 NaN   \n",
       "2016-08-12 19:31:49.531000+00:00          60.187135           24.824478   \n",
       "2016-08-12 21:15:55.295000+00:00                NaN                 NaN   \n",
       "2016-08-12 21:16:33+00:00                       NaN                 NaN   \n",
       "2016-08-12 21:16:48+00:00                       NaN                 NaN   \n",
       "2023-11-21 11:29:21.730000+00:00          60.186816           24.821288   \n",
       "\n",
       "                                 activity_type  activity_inference_confidence  \\\n",
       "timestamp                                                                       \n",
       "2016-08-12 19:29:43.821000+00:00           NaN                            NaN   \n",
       "2016-08-12 19:30:49.531000+00:00         STILL                           62.0   \n",
       "2016-08-12 19:31:49.531000+00:00         STILL                           62.0   \n",
       "2016-08-12 21:15:55.295000+00:00       ON_FOOT                           54.0   \n",
       "2016-08-12 21:16:33+00:00                  NaN                            NaN   \n",
       "2016-08-12 21:16:48+00:00                  NaN                            NaN   \n",
       "2023-11-21 11:29:21.730000+00:00           NaN                            NaN   \n",
       "\n",
       "                                                                  user  \n",
       "timestamp                                                               \n",
       "2016-08-12 19:29:43.821000+00:00  f1066868-2e1d-11ef-abd1-b0dcef010c43  \n",
       "2016-08-12 19:30:49.531000+00:00  f1066868-2e1d-11ef-abd1-b0dcef010c43  \n",
       "2016-08-12 19:31:49.531000+00:00  f1066868-2e1d-11ef-abd1-b0dcef010c43  \n",
       "2016-08-12 21:15:55.295000+00:00  f1066868-2e1d-11ef-abd1-b0dcef010c43  \n",
       "2016-08-12 21:16:33+00:00         f1066868-2e1d-11ef-abd1-b0dcef010c43  \n",
       "2016-08-12 21:16:48+00:00         f1066868-2e1d-11ef-abd1-b0dcef010c43  \n",
       "2023-11-21 11:29:21.730000+00:00  f1066868-2e1d-11ef-abd1-b0dcef010c43  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Next we read location data from the zip file.\n",
    "import niimpy\n",
    "import niimpy.config as config\n",
    "import niimpy.preprocessing.location as nilo\n",
    "\n",
    "data = niimpy.reading.google_takeout.location_history(\"test.zip\")\n",
    "data = nilo.filter_location(\n",
    "    data,\n",
    "    latitude_column = \"latitude\",\n",
    "    longitude_column = \"longitude\",\n",
    "    remove_disabled=False, remove_network=False, remove_zeros=True\n",
    ")\n",
    "data\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Activity\n",
    "\n",
    "Activity data is read similarly. The data contains many columns with missing data, so in order to use the step count data, for example, we must set the NaN values to 0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>calories_(kcal)</th>\n",
       "      <th>step_count</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>timestamp</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2023-11-20 00:00:00+02:00</th>\n",
       "      <td>1752.250027</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2023-11-21 00:00:00+02:00</th>\n",
       "      <td>1996.456746</td>\n",
       "      <td>89.900002</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                           calories_(kcal)  step_count\n",
       "timestamp                                             \n",
       "2023-11-20 00:00:00+02:00      1752.250027    0.000000\n",
       "2023-11-21 00:00:00+02:00      1996.456746   89.900002"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = niimpy.reading.google_takeout.activity(\"test.zip\")\n",
    "data.loc[data[\"step_count\"].isna(), \"step_count\"] = 0\n",
    "data[[\"calories_(kcal)\", \"step_count\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Google Fit Takeout data can contain more detailed information. To access this data, we first list all data types stored in Google Fit data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>filename</th>\n",
       "      <th>derived</th>\n",
       "      <th>content</th>\n",
       "      <th>source</th>\n",
       "      <th>source type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>raw_com.google.step_count.delta_fi.polar.polar...</td>\n",
       "      <td>raw</td>\n",
       "      <td>step_count.delta</td>\n",
       "      <td>fi.test.test</td>\n",
       "      <td>step count</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>derived_com.google.step_count.delta_com.google...</td>\n",
       "      <td>derived</td>\n",
       "      <td>step_count.delta</td>\n",
       "      <td>com.google.android.fit</td>\n",
       "      <td>top_level</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>derived_com.google.respiratory_rate_com.google...</td>\n",
       "      <td>derived</td>\n",
       "      <td>respiratory_rate</td>\n",
       "      <td>com.google.android.gms</td>\n",
       "      <td>merge_respiratory_rate</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>raw_com.google.sleep.segment_com.urbandroid.sl...</td>\n",
       "      <td>raw</td>\n",
       "      <td>sleep.segment</td>\n",
       "      <td>com.urbandroid.sleep</td>\n",
       "      <td>saa-generic</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>raw_com.google.heart_rate.summary_fi.polar.pol...</td>\n",
       "      <td>raw</td>\n",
       "      <td>heart_rate.summary</td>\n",
       "      <td>fi.test.testtest</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>raw_com.google.calories.expended_fi.polar.pola...</td>\n",
       "      <td>raw</td>\n",
       "      <td>calories.expended</td>\n",
       "      <td>fi.manual.manualtest</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>derived_com.google.sleep.segment_com.google.an...</td>\n",
       "      <td>derived</td>\n",
       "      <td>sleep.segment</td>\n",
       "      <td>com.google.android.gms</td>\n",
       "      <td>merged</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>raw_com.google.nutrition_com.fourtechnologies....</td>\n",
       "      <td>raw</td>\n",
       "      <td>nutrition</td>\n",
       "      <td>com.manual.test</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>raw_com.google.heart_rate.bpm_nl.appyhapps.hea...</td>\n",
       "      <td>raw</td>\n",
       "      <td>heart_rate.bpm</td>\n",
       "      <td>fi.test.test</td>\n",
       "      <td>Test - heart rate</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>raw_com.google.sleep.segment_nl.appyhapps.heal...</td>\n",
       "      <td>raw</td>\n",
       "      <td>sleep.segment</td>\n",
       "      <td>nl.test.healthsync</td>\n",
       "      <td>HealthSync_sleep_by_Health_Sync_1715724879000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>raw_com.google.heart_minutes_com.google.androi...</td>\n",
       "      <td>raw</td>\n",
       "      <td>heart_minutes</td>\n",
       "      <td>com.google.android.apps.fitness</td>\n",
       "      <td>user_input</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                             filename  derived  \\\n",
       "0   raw_com.google.step_count.delta_fi.polar.polar...      raw   \n",
       "1   derived_com.google.step_count.delta_com.google...  derived   \n",
       "2   derived_com.google.respiratory_rate_com.google...  derived   \n",
       "3   raw_com.google.sleep.segment_com.urbandroid.sl...      raw   \n",
       "4   raw_com.google.heart_rate.summary_fi.polar.pol...      raw   \n",
       "5   raw_com.google.calories.expended_fi.polar.pola...      raw   \n",
       "6   derived_com.google.sleep.segment_com.google.an...  derived   \n",
       "7   raw_com.google.nutrition_com.fourtechnologies....      raw   \n",
       "8   raw_com.google.heart_rate.bpm_nl.appyhapps.hea...      raw   \n",
       "9   raw_com.google.sleep.segment_nl.appyhapps.heal...      raw   \n",
       "10  raw_com.google.heart_minutes_com.google.androi...      raw   \n",
       "\n",
       "               content                           source  \\\n",
       "0     step_count.delta                     fi.test.test   \n",
       "1     step_count.delta           com.google.android.fit   \n",
       "2     respiratory_rate           com.google.android.gms   \n",
       "3        sleep.segment             com.urbandroid.sleep   \n",
       "4   heart_rate.summary                 fi.test.testtest   \n",
       "5    calories.expended             fi.manual.manualtest   \n",
       "6        sleep.segment           com.google.android.gms   \n",
       "7            nutrition                  com.manual.test   \n",
       "8       heart_rate.bpm                     fi.test.test   \n",
       "9        sleep.segment               nl.test.healthsync   \n",
       "10       heart_minutes  com.google.android.apps.fitness   \n",
       "\n",
       "                                      source type  \n",
       "0                                      step count  \n",
       "1                                       top_level  \n",
       "2                          merge_respiratory_rate  \n",
       "3                                     saa-generic  \n",
       "4                                                  \n",
       "5                                                  \n",
       "6                                          merged  \n",
       "7                                                  \n",
       "8                               Test - heart rate  \n",
       "9   HealthSync_sleep_by_Health_Sync_1715724879000  \n",
       "10                                     user_input  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "niimpy.reading.google_takeout.fit_list_data(\"test.zip\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have two types of step count data. The original raw data is in \"raw_com.google.step_count.delta_fi.polar.polar.json\". You can read this data using `fit_read_data`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>measurement_index</th>\n",
       "      <th>id</th>\n",
       "      <th>value</th>\n",
       "      <th>end_time</th>\n",
       "      <th>modified_time</th>\n",
       "      <th>datatype</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>timestamp</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2024-04-30 17:59:58.999</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>8</td>\n",
       "      <td>2024-04-30 19:17:00</td>\n",
       "      <td>2024-05-01 03:15:57.130</td>\n",
       "      <td>step_count.delta</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2024-04-30 17:59:58.999</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>8</td>\n",
       "      <td>2024-04-30 19:17:30</td>\n",
       "      <td>2024-04-30 20:02:14.506</td>\n",
       "      <td>step_count.delta</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                         measurement_index  id  value            end_time  \\\n",
       "timestamp                                                                   \n",
       "2024-04-30 17:59:58.999                  0   0      8 2024-04-30 19:17:00   \n",
       "2024-04-30 17:59:58.999                  1   0      8 2024-04-30 19:17:30   \n",
       "\n",
       "                                  modified_time          datatype  \n",
       "timestamp                                                          \n",
       "2024-04-30 17:59:58.999 2024-05-01 03:15:57.130  step_count.delta  \n",
       "2024-04-30 17:59:58.999 2024-04-30 20:02:14.506  step_count.delta  "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "niimpy.reading.google_takeout.fit_read_data(\"test.zip\", \"raw_com.google.step_count.delta_fi.polar.polar.json\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Email and Chat\n",
    "\n",
    "The `google_takeout.email_activity` and `google_takeout.chat` function will read and process all emails in the GMail mailbox and all Google chat messages respectively. They return a dataframe containing metadata and statistics of each message. Email addresses, email IDs and names are replaced by numerical indexes.\n",
    "\n",
    "The email files can be large and processing them could take some time. You can also include sentiment analysis of each email using the `sentiment` parameter. For this, we recommend using a system with a GPU."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/u/24/rantahj1/unix/src/niimpy/niimpy/reading/google_takeout.py:466: UserWarning: Could not parse message timestamp: 2023-12-15 12:19:43+00:00\n",
      "  warnings.warn(f\"Could not parse message timestamp: {received}\")\n",
      "/u/24/rantahj1/unix/src/niimpy/niimpy/reading/google_takeout.py:480: UserWarning: Failed to format received time: Sat, 15 DeNot a timec 2023 12:19:43 0000\n",
      "  warnings.warn(f\"Failed to format received time: {received}\")\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>received</th>\n",
       "      <th>from</th>\n",
       "      <th>to</th>\n",
       "      <th>cc</th>\n",
       "      <th>bcc</th>\n",
       "      <th>message_id</th>\n",
       "      <th>in_reply_to</th>\n",
       "      <th>character_count</th>\n",
       "      <th>word_count</th>\n",
       "      <th>message_type</th>\n",
       "      <th>user</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>timestamp</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2023-12-15 12:19:43+00:00</th>\n",
       "      <td>NaT</td>\n",
       "      <td>0</td>\n",
       "      <td>[2]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>2</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>33</td>\n",
       "      <td>6</td>\n",
       "      <td>outgoing</td>\n",
       "      <td>f10e4d26-2e1d-11ef-abd1-b0dcef010c43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2023-12-15 12:29:43+00:00</th>\n",
       "      <td>NaT</td>\n",
       "      <td>0</td>\n",
       "      <td>[6, 2]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>1</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>31</td>\n",
       "      <td>6</td>\n",
       "      <td>outgoing</td>\n",
       "      <td>f10e4d26-2e1d-11ef-abd1-b0dcef010c43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2023-12-15 12:29:43+00:00</th>\n",
       "      <td>NaT</td>\n",
       "      <td>0</td>\n",
       "      <td>[6, 2]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>1</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>28</td>\n",
       "      <td>5</td>\n",
       "      <td>outgoing</td>\n",
       "      <td>f10e4d26-2e1d-11ef-abd1-b0dcef010c43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2023-12-15 12:39:43+00:00</th>\n",
       "      <td>2023-12-15 12:19:43+00:00</td>\n",
       "      <td>6</td>\n",
       "      <td>[0]</td>\n",
       "      <td>[4]</td>\n",
       "      <td>[4]</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>30</td>\n",
       "      <td>5</td>\n",
       "      <td>incoming</td>\n",
       "      <td>f10e4d26-2e1d-11ef-abd1-b0dcef010c43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2023-12-15 12:39:43+00:00</th>\n",
       "      <td>Sat, 15 DeNot a timec 2023 12:19:43 0000</td>\n",
       "      <td>6</td>\n",
       "      <td>[0]</td>\n",
       "      <td>[4]</td>\n",
       "      <td>[4]</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>51</td>\n",
       "      <td>7</td>\n",
       "      <td>incoming</td>\n",
       "      <td>f10e4d26-2e1d-11ef-abd1-b0dcef010c43</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                           received  from  \\\n",
       "timestamp                                                                   \n",
       "2023-12-15 12:19:43+00:00                                       NaT     0   \n",
       "2023-12-15 12:29:43+00:00                                       NaT     0   \n",
       "2023-12-15 12:29:43+00:00                                       NaT     0   \n",
       "2023-12-15 12:39:43+00:00                 2023-12-15 12:19:43+00:00     6   \n",
       "2023-12-15 12:39:43+00:00  Sat, 15 DeNot a timec 2023 12:19:43 0000     6   \n",
       "\n",
       "                               to   cc  bcc  message_id in_reply_to  \\\n",
       "timestamp                                                             \n",
       "2023-12-15 12:19:43+00:00     [2]   []   []           2        <NA>   \n",
       "2023-12-15 12:29:43+00:00  [6, 2]   []   []           1        <NA>   \n",
       "2023-12-15 12:29:43+00:00  [6, 2]   []   []           1        <NA>   \n",
       "2023-12-15 12:39:43+00:00     [0]  [4]  [4]           3           2   \n",
       "2023-12-15 12:39:43+00:00     [0]  [4]  [4]           3           2   \n",
       "\n",
       "                           character_count  word_count message_type  \\\n",
       "timestamp                                                             \n",
       "2023-12-15 12:19:43+00:00               33           6     outgoing   \n",
       "2023-12-15 12:29:43+00:00               31           6     outgoing   \n",
       "2023-12-15 12:29:43+00:00               28           5     outgoing   \n",
       "2023-12-15 12:39:43+00:00               30           5     incoming   \n",
       "2023-12-15 12:39:43+00:00               51           7     incoming   \n",
       "\n",
       "                                                           user  \n",
       "timestamp                                                        \n",
       "2023-12-15 12:19:43+00:00  f10e4d26-2e1d-11ef-abd1-b0dcef010c43  \n",
       "2023-12-15 12:29:43+00:00  f10e4d26-2e1d-11ef-abd1-b0dcef010c43  \n",
       "2023-12-15 12:29:43+00:00  f10e4d26-2e1d-11ef-abd1-b0dcef010c43  \n",
       "2023-12-15 12:39:43+00:00  f10e4d26-2e1d-11ef-abd1-b0dcef010c43  \n",
       "2023-12-15 12:39:43+00:00  f10e4d26-2e1d-11ef-abd1-b0dcef010c43  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "niimpy.reading.google_takeout.email_activity(\"test.zip\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Google chat data is read similarly.\n",
    "\n",
    "Additionally, you can run sentiment analysis on email and chat messages using the `sentiment` option. To run sentiment analysis, you need to install the optional dependency by running.\n",
    "\n",
    "```bash\n",
    "pip install niimpy[sentiment]\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/u/24/rantahj1/unix/src/niimpy/niimpy/reading/google_takeout.py:648: UserWarning: Multiple user emails found. Using the first one.\n",
      "  warnings.warn(\"Multiple user emails found. Using the first one.\")\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  0%|          | 0/4 [00:01<?, ?it/s]\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>topic_id</th>\n",
       "      <th>message_id</th>\n",
       "      <th>chat_group</th>\n",
       "      <th>creator_name</th>\n",
       "      <th>creator_email</th>\n",
       "      <th>creator_user_type</th>\n",
       "      <th>character_count</th>\n",
       "      <th>word_count</th>\n",
       "      <th>user</th>\n",
       "      <th>message_type</th>\n",
       "      <th>sentiment</th>\n",
       "      <th>sentiment_score</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>timestamp</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2024-01-30 13:27:33+00:00</th>\n",
       "      <td>iDImYGRudHk</td>\n",
       "      <td>9guW_0AAAAE/iDImYGRudHk/iDImYGRudHk</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Human</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>f1144500-2e1d-11ef-abd1-b0dcef010c43</td>\n",
       "      <td>outgoing</td>\n",
       "      <td>none</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2024-01-30 13:29:10+00:00</th>\n",
       "      <td>cVEoT9zu63M</td>\n",
       "      <td>9guW_0AAAAE/cVEoT9zu63M/cVEoT9zu63M</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Human</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>f1144500-2e1d-11ef-abd1-b0dcef010c43</td>\n",
       "      <td>incoming</td>\n",
       "      <td>none</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2024-01-30 13:29:17+00:00</th>\n",
       "      <td>qEfkUgUvX80</td>\n",
       "      <td>9guW_0AAAAE/qEfkUgUvX80/qEfkUgUvX80</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Human</td>\n",
       "      <td>11</td>\n",
       "      <td>3</td>\n",
       "      <td>f1144500-2e1d-11ef-abd1-b0dcef010c43</td>\n",
       "      <td>incoming</td>\n",
       "      <td>positive</td>\n",
       "      <td>0.535310</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2024-01-30 13:29:17+00:00</th>\n",
       "      <td>qEfkUgUvX80</td>\n",
       "      <td>9guW_0AAAAE/qEfkUgUvX80/qEfkUgUvX80</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Human</td>\n",
       "      <td>22</td>\n",
       "      <td>5</td>\n",
       "      <td>f1144500-2e1d-11ef-abd1-b0dcef010c43</td>\n",
       "      <td>outgoing</td>\n",
       "      <td>positive</td>\n",
       "      <td>0.912528</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                              topic_id                           message_id  \\\n",
       "timestamp                                                                     \n",
       "2024-01-30 13:27:33+00:00  iDImYGRudHk  9guW_0AAAAE/iDImYGRudHk/iDImYGRudHk   \n",
       "2024-01-30 13:29:10+00:00  cVEoT9zu63M  9guW_0AAAAE/cVEoT9zu63M/cVEoT9zu63M   \n",
       "2024-01-30 13:29:17+00:00  qEfkUgUvX80  9guW_0AAAAE/qEfkUgUvX80/qEfkUgUvX80   \n",
       "2024-01-30 13:29:17+00:00  qEfkUgUvX80  9guW_0AAAAE/qEfkUgUvX80/qEfkUgUvX80   \n",
       "\n",
       "                           chat_group  creator_name  creator_email  \\\n",
       "timestamp                                                            \n",
       "2024-01-30 13:27:33+00:00           0             0              0   \n",
       "2024-01-30 13:29:10+00:00           0             1              1   \n",
       "2024-01-30 13:29:17+00:00           0             1              1   \n",
       "2024-01-30 13:29:17+00:00           0             0              0   \n",
       "\n",
       "                          creator_user_type  character_count  word_count  \\\n",
       "timestamp                                                                  \n",
       "2024-01-30 13:27:33+00:00             Human                5           1   \n",
       "2024-01-30 13:29:10+00:00             Human                5           1   \n",
       "2024-01-30 13:29:17+00:00             Human               11           3   \n",
       "2024-01-30 13:29:17+00:00             Human               22           5   \n",
       "\n",
       "                                                           user message_type  \\\n",
       "timestamp                                                                      \n",
       "2024-01-30 13:27:33+00:00  f1144500-2e1d-11ef-abd1-b0dcef010c43     outgoing   \n",
       "2024-01-30 13:29:10+00:00  f1144500-2e1d-11ef-abd1-b0dcef010c43     incoming   \n",
       "2024-01-30 13:29:17+00:00  f1144500-2e1d-11ef-abd1-b0dcef010c43     incoming   \n",
       "2024-01-30 13:29:17+00:00  f1144500-2e1d-11ef-abd1-b0dcef010c43     outgoing   \n",
       "\n",
       "                          sentiment  sentiment_score  \n",
       "timestamp                                             \n",
       "2024-01-30 13:27:33+00:00      none         0.000000  \n",
       "2024-01-30 13:29:10+00:00      none         0.000000  \n",
       "2024-01-30 13:29:17+00:00  positive         0.535310  \n",
       "2024-01-30 13:29:17+00:00  positive         0.912528  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "niimpy.reading.google_takeout.chat(\"test.zip\", sentiment=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Youtube Activity\n",
    "\n",
    "Finally, we have a reader for extracting Youtube watch history data. We do not, by default, return video identifiers, but replace them with numerical IDs. The only available information then is the recorded time, which corresponds to video start time.\n",
    "\n",
    "Importantly, we have no information on how long the user watched a given video, as this is not stored in the TakeOut data. You can deduce whether the user has rewatched a given video, watched multiple videos in a row, or started another video quickly without finishing the previous one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>video_title</th>\n",
       "      <th>channel_title</th>\n",
       "      <th>user</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>timestamp</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2024-02-13 08:36:49+02:00</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>f1f6ae72-2e1d-11ef-abd1-b0dcef010c43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2024-02-13 08:36:05+02:00</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>f1f6ae72-2e1d-11ef-abd1-b0dcef010c43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2024-02-13 08:35:38+02:00</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>f1f6ae72-2e1d-11ef-abd1-b0dcef010c43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2024-02-13 08:35:03+02:00</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>f1f6ae72-2e1d-11ef-abd1-b0dcef010c43</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                           video_title  channel_title  \\\n",
       "timestamp                                               \n",
       "2024-02-13 08:36:49+02:00            0              0   \n",
       "2024-02-13 08:36:05+02:00            1              1   \n",
       "2024-02-13 08:35:38+02:00            2              2   \n",
       "2024-02-13 08:35:03+02:00            0              0   \n",
       "\n",
       "                                                           user  \n",
       "timestamp                                                        \n",
       "2024-02-13 08:36:49+02:00  f1f6ae72-2e1d-11ef-abd1-b0dcef010c43  \n",
       "2024-02-13 08:36:05+02:00  f1f6ae72-2e1d-11ef-abd1-b0dcef010c43  \n",
       "2024-02-13 08:35:38+02:00  f1f6ae72-2e1d-11ef-abd1-b0dcef010c43  \n",
       "2024-02-13 08:35:03+02:00  f1f6ae72-2e1d-11ef-abd1-b0dcef010c43  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "niimpy.reading.google_takeout.youtube_watch_history(\"test.zip\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since Google takeout may provide the mailbox as a single uncompressed file, it is also possible to provide it's file path directly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/u/24/rantahj1/unix/src/niimpy/niimpy/reading/google_takeout.py:466: UserWarning: Could not parse message timestamp: 2023-12-15 12:19:43+00:00\n",
      "  warnings.warn(f\"Could not parse message timestamp: {received}\")\n",
      "/u/24/rantahj1/unix/src/niimpy/niimpy/reading/google_takeout.py:480: UserWarning: Failed to format received time: Sat, 15 DeNot a timec 2023 12:19:43 0000\n",
      "  warnings.warn(f\"Failed to format received time: {received}\")\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Running sentiment analysis on 5 messages.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  0%|          | 0/5 [00:00<?, ?it/s]\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>received</th>\n",
       "      <th>from</th>\n",
       "      <th>to</th>\n",
       "      <th>cc</th>\n",
       "      <th>bcc</th>\n",
       "      <th>message_id</th>\n",
       "      <th>in_reply_to</th>\n",
       "      <th>character_count</th>\n",
       "      <th>word_count</th>\n",
       "      <th>message_type</th>\n",
       "      <th>user</th>\n",
       "      <th>sentiment</th>\n",
       "      <th>sentiment_score</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>timestamp</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2023-12-15 12:19:43+00:00</th>\n",
       "      <td>NaT</td>\n",
       "      <td>0</td>\n",
       "      <td>[1]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>2</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>33</td>\n",
       "      <td>6</td>\n",
       "      <td>outgoing</td>\n",
       "      <td>f1f97e90-2e1d-11ef-abd1-b0dcef010c43</td>\n",
       "      <td>positive</td>\n",
       "      <td>0.993223</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2023-12-15 12:29:43+00:00</th>\n",
       "      <td>NaT</td>\n",
       "      <td>0</td>\n",
       "      <td>[6, 1]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>1</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>31</td>\n",
       "      <td>6</td>\n",
       "      <td>outgoing</td>\n",
       "      <td>f1f97e90-2e1d-11ef-abd1-b0dcef010c43</td>\n",
       "      <td>negative</td>\n",
       "      <td>0.980209</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2023-12-15 12:29:43+00:00</th>\n",
       "      <td>NaT</td>\n",
       "      <td>0</td>\n",
       "      <td>[6, 1]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>1</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>28</td>\n",
       "      <td>5</td>\n",
       "      <td>outgoing</td>\n",
       "      <td>f1f97e90-2e1d-11ef-abd1-b0dcef010c43</td>\n",
       "      <td>negative</td>\n",
       "      <td>0.968588</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2023-12-15 12:39:43+00:00</th>\n",
       "      <td>2023-12-15 12:19:43+00:00</td>\n",
       "      <td>6</td>\n",
       "      <td>[0]</td>\n",
       "      <td>[4]</td>\n",
       "      <td>[4]</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>30</td>\n",
       "      <td>5</td>\n",
       "      <td>incoming</td>\n",
       "      <td>f1f97e90-2e1d-11ef-abd1-b0dcef010c43</td>\n",
       "      <td>positive</td>\n",
       "      <td>0.997529</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2023-12-15 12:39:43+00:00</th>\n",
       "      <td>Sat, 15 DeNot a timec 2023 12:19:43 0000</td>\n",
       "      <td>6</td>\n",
       "      <td>[0]</td>\n",
       "      <td>[4]</td>\n",
       "      <td>[4]</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>51</td>\n",
       "      <td>7</td>\n",
       "      <td>incoming</td>\n",
       "      <td>f1f97e90-2e1d-11ef-abd1-b0dcef010c43</td>\n",
       "      <td>neutral</td>\n",
       "      <td>0.477333</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                           received  from  \\\n",
       "timestamp                                                                   \n",
       "2023-12-15 12:19:43+00:00                                       NaT     0   \n",
       "2023-12-15 12:29:43+00:00                                       NaT     0   \n",
       "2023-12-15 12:29:43+00:00                                       NaT     0   \n",
       "2023-12-15 12:39:43+00:00                 2023-12-15 12:19:43+00:00     6   \n",
       "2023-12-15 12:39:43+00:00  Sat, 15 DeNot a timec 2023 12:19:43 0000     6   \n",
       "\n",
       "                               to   cc  bcc  message_id in_reply_to  \\\n",
       "timestamp                                                             \n",
       "2023-12-15 12:19:43+00:00     [1]   []   []           2        <NA>   \n",
       "2023-12-15 12:29:43+00:00  [6, 1]   []   []           1        <NA>   \n",
       "2023-12-15 12:29:43+00:00  [6, 1]   []   []           1        <NA>   \n",
       "2023-12-15 12:39:43+00:00     [0]  [4]  [4]           3           2   \n",
       "2023-12-15 12:39:43+00:00     [0]  [4]  [4]           3           2   \n",
       "\n",
       "                           character_count  word_count message_type  \\\n",
       "timestamp                                                             \n",
       "2023-12-15 12:19:43+00:00               33           6     outgoing   \n",
       "2023-12-15 12:29:43+00:00               31           6     outgoing   \n",
       "2023-12-15 12:29:43+00:00               28           5     outgoing   \n",
       "2023-12-15 12:39:43+00:00               30           5     incoming   \n",
       "2023-12-15 12:39:43+00:00               51           7     incoming   \n",
       "\n",
       "                                                           user sentiment  \\\n",
       "timestamp                                                                   \n",
       "2023-12-15 12:19:43+00:00  f1f97e90-2e1d-11ef-abd1-b0dcef010c43  positive   \n",
       "2023-12-15 12:29:43+00:00  f1f97e90-2e1d-11ef-abd1-b0dcef010c43  negative   \n",
       "2023-12-15 12:29:43+00:00  f1f97e90-2e1d-11ef-abd1-b0dcef010c43  negative   \n",
       "2023-12-15 12:39:43+00:00  f1f97e90-2e1d-11ef-abd1-b0dcef010c43  positive   \n",
       "2023-12-15 12:39:43+00:00  f1f97e90-2e1d-11ef-abd1-b0dcef010c43   neutral   \n",
       "\n",
       "                           sentiment_score  \n",
       "timestamp                                   \n",
       "2023-12-15 12:19:43+00:00         0.993223  \n",
       "2023-12-15 12:29:43+00:00         0.980209  \n",
       "2023-12-15 12:29:43+00:00         0.968588  \n",
       "2023-12-15 12:39:43+00:00         0.997529  \n",
       "2023-12-15 12:39:43+00:00         0.477333  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "path = os.path.join(config.GOOGLE_TAKEOUT_DIR, \"Takeout\", \"Mail\", \"All mail Including Spam and Trash.mbox\")\n",
    "niimpy.reading.google_takeout.email_activity(path, sentiment=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### General Notes on Google Takeout\n",
    "\n",
    "Each subject Downloads their Google TakeOut data as a separate zip file. The Zipfile package, which is included in the Python standard, is convenient for reading the data files contained in the zip file. For example, one could read the location data with the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>latitudeE7</th>\n",
       "      <th>longitudeE7</th>\n",
       "      <th>accuracy</th>\n",
       "      <th>source</th>\n",
       "      <th>deviceTag</th>\n",
       "      <th>timestamp</th>\n",
       "      <th>activity</th>\n",
       "      <th>locationMetadata</th>\n",
       "      <th>placeId</th>\n",
       "      <th>formFactor</th>\n",
       "      <th>...</th>\n",
       "      <th>deviceDesignation</th>\n",
       "      <th>altitude</th>\n",
       "      <th>verticalAccuracy</th>\n",
       "      <th>platformType</th>\n",
       "      <th>osLevel</th>\n",
       "      <th>serverTimestamp</th>\n",
       "      <th>deviceTimestamp</th>\n",
       "      <th>batteryCharging</th>\n",
       "      <th>velocity</th>\n",
       "      <th>heading</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>359974880</td>\n",
       "      <td>-789221943</td>\n",
       "      <td>25</td>\n",
       "      <td>WIFI</td>\n",
       "      <td>-577680260</td>\n",
       "      <td>2016-08-12T19:29:43.821Z</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>359975588</td>\n",
       "      <td>-789225036</td>\n",
       "      <td>21</td>\n",
       "      <td>WIFI</td>\n",
       "      <td>-577680260</td>\n",
       "      <td>2016-08-12T19:30:49.531Z</td>\n",
       "      <td>[{'activity': [{'type': 'STILL', 'confidence':...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>359975588</td>\n",
       "      <td>-789225036</td>\n",
       "      <td>21</td>\n",
       "      <td>WIFI</td>\n",
       "      <td>-577680260</td>\n",
       "      <td>2016-08-12T19:31:49.531Z</td>\n",
       "      <td>[{'activity': [{'type': 'STILL', 'confidence':...</td>\n",
       "      <td>[{'wifiScan': {'accessPoints': [{'mac': '12410...</td>\n",
       "      <td>ChIJS_5Nmuz1jUYRGYf3QiiZco4</td>\n",
       "      <td>PHONE</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>360008703</td>\n",
       "      <td>-789233433</td>\n",
       "      <td>1500</td>\n",
       "      <td>CELL</td>\n",
       "      <td>-577680260</td>\n",
       "      <td>2016-08-12T21:15:55.295Z</td>\n",
       "      <td>[{'activity': [{'type': 'ON_FOOT', 'confidence...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>359972502</td>\n",
       "      <td>-789239894</td>\n",
       "      <td>8</td>\n",
       "      <td>GPS</td>\n",
       "      <td>-577680260</td>\n",
       "      <td>2016-08-12T21:16:33Z</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 22 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   latitudeE7  longitudeE7  accuracy source  deviceTag  \\\n",
       "0   359974880   -789221943        25   WIFI -577680260   \n",
       "1   359975588   -789225036        21   WIFI -577680260   \n",
       "2   359975588   -789225036        21   WIFI -577680260   \n",
       "3   360008703   -789233433      1500   CELL -577680260   \n",
       "4   359972502   -789239894         8    GPS -577680260   \n",
       "\n",
       "                  timestamp  \\\n",
       "0  2016-08-12T19:29:43.821Z   \n",
       "1  2016-08-12T19:30:49.531Z   \n",
       "2  2016-08-12T19:31:49.531Z   \n",
       "3  2016-08-12T21:15:55.295Z   \n",
       "4      2016-08-12T21:16:33Z   \n",
       "\n",
       "                                            activity  \\\n",
       "0                                                NaN   \n",
       "1  [{'activity': [{'type': 'STILL', 'confidence':...   \n",
       "2  [{'activity': [{'type': 'STILL', 'confidence':...   \n",
       "3  [{'activity': [{'type': 'ON_FOOT', 'confidence...   \n",
       "4                                                NaN   \n",
       "\n",
       "                                    locationMetadata  \\\n",
       "0                                                NaN   \n",
       "1                                                NaN   \n",
       "2  [{'wifiScan': {'accessPoints': [{'mac': '12410...   \n",
       "3                                                NaN   \n",
       "4                                                NaN   \n",
       "\n",
       "                       placeId formFactor  ... deviceDesignation altitude  \\\n",
       "0                          NaN        NaN  ...               NaN      NaN   \n",
       "1                          NaN        NaN  ...               NaN      NaN   \n",
       "2  ChIJS_5Nmuz1jUYRGYf3QiiZco4      PHONE  ...               NaN      NaN   \n",
       "3                          NaN        NaN  ...               NaN      NaN   \n",
       "4                          NaN        NaN  ...               NaN      NaN   \n",
       "\n",
       "  verticalAccuracy  platformType  osLevel serverTimestamp  deviceTimestamp  \\\n",
       "0              NaN           NaN      NaN             NaN              NaN   \n",
       "1              NaN           NaN      NaN             NaN              NaN   \n",
       "2              NaN           NaN      NaN             NaN              NaN   \n",
       "3              NaN           NaN      NaN             NaN              NaN   \n",
       "4              NaN           NaN      NaN             NaN              NaN   \n",
       "\n",
       "  batteryCharging velocity heading  \n",
       "0             NaN      NaN     NaN  \n",
       "1             NaN      NaN     NaN  \n",
       "2             NaN      NaN     NaN  \n",
       "3             NaN      NaN     NaN  \n",
       "4             NaN      NaN     NaN  \n",
       "\n",
       "[5 rows x 22 columns]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from zipfile import ZipFile\n",
    "import json\n",
    "import pandas as pd\n",
    "\n",
    "zip_file = ZipFile(\"test.zip\")\n",
    "json_data  = zip_file.read(\"Takeout/Location History (Timeline)/Records.json\")\n",
    "json_data = json.loads(json_data)\n",
    "data = pd.json_normalize(json_data[\"locations\"])\n",
    "data = pd.DataFrame(data)\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Location data is stored in the json format. Other types of data are stored in various formats and with different files structures. The user must find how each type of data they need is stored and how it can be read in Python."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## MHealth\n",
    "\n",
    "We have implemented readers for 3 data types formatted according to the [MHealth schema](https://www.openmhealth.org/documentation/#/schema-docs/schema-library). These are total sleep time, heart rate and geolocation. Other data types may be added as needed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>descriptive_statistic</th>\n",
       "      <th>descriptive_statistic_denominator</th>\n",
       "      <th>date</th>\n",
       "      <th>part_of_day</th>\n",
       "      <th>total_sleep_time</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>timestamp</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2016-02-06 04:35:00+00:00</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaT</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0 days 07:45:00</td>\n",
       "      <td>2016-02-06 04:35:00+00:00</td>\n",
       "      <td>2016-02-06 14:35:00+00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2016-02-05 15:00:00+00:00</th>\n",
       "      <td>average</td>\n",
       "      <td>d</td>\n",
       "      <td>NaT</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0 days 07:15:00</td>\n",
       "      <td>2016-02-05 15:00:00+00:00</td>\n",
       "      <td>2016-06-06 15:00:00+00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2013-01-26 07:35:00+00:00</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaT</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0 days 03:00:00</td>\n",
       "      <td>2013-01-26 07:35:00+00:00</td>\n",
       "      <td>2013-02-05 07:35:00+00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2013-02-05 00:00:00+00:00</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2013-02-05 00:00:00+00:00</td>\n",
       "      <td>evening</td>\n",
       "      <td>0 days 03:00:00</td>\n",
       "      <td>NaT</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                          descriptive_statistic  \\\n",
       "timestamp                                         \n",
       "2016-02-06 04:35:00+00:00                   NaN   \n",
       "2016-02-05 15:00:00+00:00               average   \n",
       "2013-01-26 07:35:00+00:00                   NaN   \n",
       "2013-02-05 00:00:00+00:00                   NaN   \n",
       "\n",
       "                          descriptive_statistic_denominator  \\\n",
       "timestamp                                                     \n",
       "2016-02-06 04:35:00+00:00                               NaN   \n",
       "2016-02-05 15:00:00+00:00                                 d   \n",
       "2013-01-26 07:35:00+00:00                               NaN   \n",
       "2013-02-05 00:00:00+00:00                               NaN   \n",
       "\n",
       "                                               date part_of_day  \\\n",
       "timestamp                                                         \n",
       "2016-02-06 04:35:00+00:00                       NaT         NaN   \n",
       "2016-02-05 15:00:00+00:00                       NaT         NaN   \n",
       "2013-01-26 07:35:00+00:00                       NaT         NaN   \n",
       "2013-02-05 00:00:00+00:00 2013-02-05 00:00:00+00:00     evening   \n",
       "\n",
       "                          total_sleep_time                     start  \\\n",
       "timestamp                                                              \n",
       "2016-02-06 04:35:00+00:00  0 days 07:45:00 2016-02-06 04:35:00+00:00   \n",
       "2016-02-05 15:00:00+00:00  0 days 07:15:00 2016-02-05 15:00:00+00:00   \n",
       "2013-01-26 07:35:00+00:00  0 days 03:00:00 2013-01-26 07:35:00+00:00   \n",
       "2013-02-05 00:00:00+00:00  0 days 03:00:00                       NaT   \n",
       "\n",
       "                                                end  \n",
       "timestamp                                            \n",
       "2016-02-06 04:35:00+00:00 2016-02-06 14:35:00+00:00  \n",
       "2016-02-05 15:00:00+00:00 2016-06-06 15:00:00+00:00  \n",
       "2013-01-26 07:35:00+00:00 2013-02-05 07:35:00+00:00  \n",
       "2013-02-05 00:00:00+00:00                       NaT  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Reading total sleep time data:\n",
    "filename = config.MHEALTH_TOTAL_SLEEP_TIME_PATH\n",
    "niimpy.reading.mhealth.total_sleep_time_from_file(filename)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>temporal_relationship_to_sleep</th>\n",
       "      <th>heart_rate</th>\n",
       "      <th>effective_time_frame.date_time</th>\n",
       "      <th>descriptive_statistic</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>timestamp</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>NaT</th>\n",
       "      <td>on waking</td>\n",
       "      <td>70</td>\n",
       "      <td>2023-11-20T07:25:00-08:00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaT</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2023-12-20 09:50:00+00:00</th>\n",
       "      <td>on waking</td>\n",
       "      <td>65</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2023-12-20 09:50:00+00:00</td>\n",
       "      <td>2023-12-20 10:00:00+00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2023-12-20 03:50:00+00:00</th>\n",
       "      <td>during sleep</td>\n",
       "      <td>35</td>\n",
       "      <td>NaN</td>\n",
       "      <td>average</td>\n",
       "      <td>2023-12-20 03:50:00+00:00</td>\n",
       "      <td>2023-12-20 04:00:00+00:00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                          temporal_relationship_to_sleep  heart_rate  \\\n",
       "timestamp                                                              \n",
       "NaT                                            on waking          70   \n",
       "2023-12-20 09:50:00+00:00                      on waking          65   \n",
       "2023-12-20 03:50:00+00:00                   during sleep          35   \n",
       "\n",
       "                          effective_time_frame.date_time  \\\n",
       "timestamp                                                  \n",
       "NaT                            2023-11-20T07:25:00-08:00   \n",
       "2023-12-20 09:50:00+00:00                            NaN   \n",
       "2023-12-20 03:50:00+00:00                            NaN   \n",
       "\n",
       "                          descriptive_statistic                     start  \\\n",
       "timestamp                                                                   \n",
       "NaT                                         NaN                       NaT   \n",
       "2023-12-20 09:50:00+00:00                   NaN 2023-12-20 09:50:00+00:00   \n",
       "2023-12-20 03:50:00+00:00               average 2023-12-20 03:50:00+00:00   \n",
       "\n",
       "                                                end  \n",
       "timestamp                                            \n",
       "NaT                                             NaT  \n",
       "2023-12-20 09:50:00+00:00 2023-12-20 10:00:00+00:00  \n",
       "2023-12-20 03:50:00+00:00 2023-12-20 04:00:00+00:00  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Reading heart rate data:\n",
    "filename = config.MHEALTH_HEART_RATE_PATH\n",
    "niimpy.reading.mhealth.heart_rate_from_file(filename)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>positioning_system</th>\n",
       "      <th>latitude</th>\n",
       "      <th>latitude.unit</th>\n",
       "      <th>longitude</th>\n",
       "      <th>longitude.unit</th>\n",
       "      <th>effective_time_frame.time_interval.start_date_time</th>\n",
       "      <th>effective_time_frame.time_interval.end_date_time</th>\n",
       "      <th>elevation.value</th>\n",
       "      <th>elevation.unit</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>GPS</td>\n",
       "      <td>60.1867</td>\n",
       "      <td>deg</td>\n",
       "      <td>24.8283</td>\n",
       "      <td>deg</td>\n",
       "      <td>2016-02-05T20:35:00-08:00</td>\n",
       "      <td>2016-02-06T06:35:00-08:00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>GPS</td>\n",
       "      <td>60.1867</td>\n",
       "      <td>deg</td>\n",
       "      <td>24.8283</td>\n",
       "      <td>deg</td>\n",
       "      <td>2016-02-05T20:35:00-08:00</td>\n",
       "      <td>2016-02-06T06:35:00-08:00</td>\n",
       "      <td>20.4</td>\n",
       "      <td>m</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  positioning_system  latitude latitude.unit  longitude longitude.unit  \\\n",
       "0                GPS   60.1867           deg    24.8283            deg   \n",
       "1                GPS   60.1867           deg    24.8283            deg   \n",
       "\n",
       "  effective_time_frame.time_interval.start_date_time  \\\n",
       "0                          2016-02-05T20:35:00-08:00   \n",
       "1                          2016-02-05T20:35:00-08:00   \n",
       "\n",
       "  effective_time_frame.time_interval.end_date_time  elevation.value  \\\n",
       "0                        2016-02-06T06:35:00-08:00              NaN   \n",
       "1                        2016-02-06T06:35:00-08:00             20.4   \n",
       "\n",
       "  elevation.unit  \n",
       "0            NaN  \n",
       "1              m  "
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Reading geolocation data:\n",
    "filename = config.MHEALTH_GEOLOCATION_PATH\n",
    "niimpy.reading.mhealth.geolocation_from_file(filename)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Other formats\n",
    "\n",
    "You can add readers for any types of formats which you can convert into a Pandas dataframe (so basically anything).  For examples of readers, see `niimpy/reading/read.py`.  Apply the function `niimpy.preprocessing.util.df_normalize` in order to apply some standardizations to get the standard Niimpy format."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}