Reading

In principle, Niimpy can deal with any files of any format - you only need to convert them to a DataFrame. Still, it is very useful to have some common formats, so we present two standard formats with default readers:

CSV files are very standard and normal to create and understand, but in order to deal with them everything must be loaded into memory.
Google TakeOut provides a large selection of data in different formats. We provide readers most commonly used data types.
MHealth is a common format for health data.
sqlite3 databases, which requires sqlite3 to read, provides more power for filtering and automatic processing without reading everything into memory.

DataFrame format (in-memory)

In-memory, data is stored in a pandas DataFrame. This is basically a normal dataframe. There are some standardized columns (see the schema) and the index is a DatetimeIndex.

CSV files

CSV files should have a header that lists the column names and generally be readable by pandas.read_csv.

Reading these can be done with niimpy.read_csv:

[1]:

import os
import niimpy
import niimpy.config as config

# Read the battery data
df= niimpy.read_csv(config.MULTIUSER_AWARE_BATTERY_PATH, tz='Europe/Helsinki')

sqlite3 databases

For the purposes of niimpy, sqlite3 databases can generally be seen as supercharged CSV files.

A single database file could contain multiple datasets within it, thus when reading them a table name must be specified.

One reads the entire database into memory using sqlite.read_sqlite:

[2]:

# Read the sqlite3 data
df= niimpy.read_sqlite(config.SQLITE_SINGLEUSER_PATH, table="AwareScreen", tz='Europe/Helsinki')

You can list the tables within a database using niimpy.reading.read.read_sqlite_tables:

[3]:

niimpy.reading.read.read_sqlite_tables(config.SQLITE_SINGLEUSER_PATH)

[3]:

{'AwareScreen'}

sqlite3 files are highly recommended as a data storage format, since many common exploration options can be done within the database itself without reading the whole data into memory or writing an iterator. However, the interface is more difficult to use. Niimpy (before 2021-07) used this as its primary interface, but since then this interface has been de-emphasized. You can read more in the database section, but this is only recommended if you need efficiency when using massive amounts of data.

Google TakeOut

Google takeout contains a many different types of data and new types are added as Google creates services or changes data storage methods. Readers are currently available for location data, emails, and activity data from the fit app. For other data types, the user needs to manually convert them into a Niimpy compatible Pandas DataFrame.

[4]:

# Data downloaded from Google Takeout is compressed as a zip archive to conserve disk space. To
# demonstrate reading for the zip file, we will first compress our example data into the zip format.
import zipfile
test_zip = zipfile.ZipFile("test.zip", mode="w")

for dirpath,dirs,files in os.walk(config.GOOGLE_TAKEOUT_DIR):
    for f in files:
        filename = os.path.join(dirpath, f)
        filename_in_zip = filename.replace(config.GOOGLE_TAKEOUT_DIR, "")
        test_zip.write(filename, filename_in_zip)

test_zip.close()

Location

[5]:

# Next we read location data from the zip file.
import niimpy
import niimpy.config as config
import niimpy.preprocessing.location as nilo

data = niimpy.reading.google_takeout.location_history("test.zip")
data = nilo.filter_location(
    data,
    latitude_column = "latitude",
    longitude_column = "longitude",
    remove_disabled=False, remove_network=False, remove_zeros=True
)
data

[5]:

	accuracy	source	device	placeid	formfactor	altitude	verticalaccuracy	platformtype	servertimestamp	devicetimestamp	batterycharging	velocity	heading	latitude	longitude	inferred_latitude	inferred_longitude	activity_type	activity_inference_confidence	user
timestamp
2016-08-12 19:29:43.821000+00:00	25	WIFI	-577680260	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	35.997488	-78.922194	NaN	NaN	NaN	NaN	f1066868-2e1d-11ef-abd1-b0dcef010c43
2016-08-12 19:30:49.531000+00:00	21	WIFI	-577680260	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	35.997559	-78.922504	NaN	NaN	STILL	62.0	f1066868-2e1d-11ef-abd1-b0dcef010c43
2016-08-12 19:31:49.531000+00:00	21	WIFI	-577680260	ChIJS_5Nmuz1jUYRGYf3QiiZco4	PHONE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	35.997559	-78.922504	60.187135	24.824478	STILL	62.0	f1066868-2e1d-11ef-abd1-b0dcef010c43
2016-08-12 21:15:55.295000+00:00	1500	CELL	-577680260	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	36.000870	-78.923343	NaN	NaN	ON_FOOT	54.0	f1066868-2e1d-11ef-abd1-b0dcef010c43
2016-08-12 21:16:33+00:00	8	GPS	-577680260	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	35.997250	-78.923989	NaN	NaN	NaN	NaN	f1066868-2e1d-11ef-abd1-b0dcef010c43
2016-08-12 21:16:48+00:00	3	GPS	-577680260	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	35.997236	-78.924124	NaN	NaN	NaN	NaN	f1066868-2e1d-11ef-abd1-b0dcef010c43
2023-11-21 11:29:21.730000+00:00	13	WIFI	1832214273	ChIJw1WKQev1jUYRCdZmYR-HCiI	PHONE	28.0	2.0	ANDROID	2023-11-21T11:29:24.747Z	2023-11-21T11:29:24.350Z	False			60.186818	24.821288	60.186816	24.821288	NaN	NaN	f1066868-2e1d-11ef-abd1-b0dcef010c43

Activity

Activity data is read similarly. The data contains many columns with missing data, so in order to use the step count data, for example, we must set the NaN values to 0.

[6]:

data = niimpy.reading.google_takeout.activity("test.zip")
data.loc[data["step_count"].isna(), "step_count"] = 0
data[["calories_(kcal)", "step_count"]]

[6]:

	calories_(kcal)	step_count
timestamp
2023-11-20 00:00:00+02:00	1752.250027	0.000000
2023-11-21 00:00:00+02:00	1996.456746	89.900002

Google Fit Takeout data can contain more detailed information. To access this data, we first list all data types stored in Google Fit data.

[15]:

niimpy.reading.google_takeout.fit_list_data("test.zip")

[15]:

	filename	derived	content	source	source type
0	raw_com.google.step_count.delta_fi.polar.polar...	raw	step_count.delta	fi.test.test	step count
1	derived_com.google.step_count.delta_com.google...	derived	step_count.delta	com.google.android.fit	top_level
2	derived_com.google.respiratory_rate_com.google...	derived	respiratory_rate	com.google.android.gms	merge_respiratory_rate
3	raw_com.google.sleep.segment_com.urbandroid.sl...	raw	sleep.segment	com.urbandroid.sleep	saa-generic
4	raw_com.google.heart_rate.summary_fi.polar.pol...	raw	heart_rate.summary	fi.test.testtest
5	raw_com.google.calories.expended_fi.polar.pola...	raw	calories.expended	fi.manual.manualtest
6	derived_com.google.sleep.segment_com.google.an...	derived	sleep.segment	com.google.android.gms	merged
7	raw_com.google.nutrition_com.fourtechnologies....	raw	nutrition	com.manual.test
8	raw_com.google.heart_rate.bpm_nl.appyhapps.hea...	raw	heart_rate.bpm	fi.test.test	Test - heart rate
9	raw_com.google.sleep.segment_nl.appyhapps.heal...	raw	sleep.segment	nl.test.healthsync	HealthSync_sleep_by_Health_Sync_1715724879000
10	raw_com.google.heart_minutes_com.google.androi...	raw	heart_minutes	com.google.android.apps.fitness	user_input

We have two types of step count data. The original raw data is in “raw_com.google.step_count.delta_fi.polar.polar.json”. You can read this data using fit_read_data.

[17]:

niimpy.reading.google_takeout.fit_read_data("test.zip", "raw_com.google.step_count.delta_fi.polar.polar.json")

[17]:

	measurement_index	id	value	end_time	modified_time	datatype
timestamp
2024-04-30 17:59:58.999	0	0	8	2024-04-30 19:17:00	2024-05-01 03:15:57.130	step_count.delta
2024-04-30 17:59:58.999	1	0	8	2024-04-30 19:17:30	2024-04-30 20:02:14.506	step_count.delta

Email and Chat

The google_takeout.email_activity and google_takeout.chat function will read and process all emails in the GMail mailbox and all Google chat messages respectively. They return a dataframe containing metadata and statistics of each message. Email addresses, email IDs and names are replaced by numerical indexes.

The email files can be large and processing them could take some time. You can also include sentiment analysis of each email using the sentiment parameter. For this, we recommend using a system with a GPU.

[7]:

niimpy.reading.google_takeout.email_activity("test.zip")

/u/24/rantahj1/unix/src/niimpy/niimpy/reading/google_takeout.py:466: UserWarning: Could not parse message timestamp: 2023-12-15 12:19:43+00:00
  warnings.warn(f"Could not parse message timestamp: {received}")
/u/24/rantahj1/unix/src/niimpy/niimpy/reading/google_takeout.py:480: UserWarning: Failed to format received time: Sat, 15 DeNot a timec 2023 12:19:43 0000
  warnings.warn(f"Failed to format received time: {received}")

[7]:

	received	from	to	cc	bcc	message_id	in_reply_to	character_count	word_count	message_type	user
timestamp
2023-12-15 12:19:43+00:00	NaT	0	[2]	[]	[]	2	<NA>	33	6	outgoing	f10e4d26-2e1d-11ef-abd1-b0dcef010c43
2023-12-15 12:29:43+00:00	NaT	0	[6, 2]	[]	[]	1	<NA>	31	6	outgoing	f10e4d26-2e1d-11ef-abd1-b0dcef010c43
2023-12-15 12:29:43+00:00	NaT	0	[6, 2]	[]	[]	1	<NA>	28	5	outgoing	f10e4d26-2e1d-11ef-abd1-b0dcef010c43
2023-12-15 12:39:43+00:00	2023-12-15 12:19:43+00:00	6	[0]	[4]	[4]	3	2	30	5	incoming	f10e4d26-2e1d-11ef-abd1-b0dcef010c43
2023-12-15 12:39:43+00:00	Sat, 15 DeNot a timec 2023 12:19:43 0000	6	[0]	[4]	[4]	3	2	51	7	incoming	f10e4d26-2e1d-11ef-abd1-b0dcef010c43

Google chat data is read similarly.

Additionally, you can run sentiment analysis on email and chat messages using the sentiment option. To run sentiment analysis, you need to install the optional dependency by running.

pip install niimpy[sentiment]

[8]:

niimpy.reading.google_takeout.chat("test.zip", sentiment=True)

/u/24/rantahj1/unix/src/niimpy/niimpy/reading/google_takeout.py:648: UserWarning: Multiple user emails found. Using the first one.
  warnings.warn("Multiple user emails found. Using the first one.")
  0%|          | 0/4 [00:01<?, ?it/s]

[8]:

	topic_id	message_id	chat_group	creator_name	creator_email	creator_user_type	character_count	word_count	user	message_type	sentiment	sentiment_score
timestamp
2024-01-30 13:27:33+00:00	iDImYGRudHk	9guW_0AAAAE/iDImYGRudHk/iDImYGRudHk	0	0	0	Human	5	1	f1144500-2e1d-11ef-abd1-b0dcef010c43	outgoing	none	0.000000
2024-01-30 13:29:10+00:00	cVEoT9zu63M	9guW_0AAAAE/cVEoT9zu63M/cVEoT9zu63M	0	1	1	Human	5	1	f1144500-2e1d-11ef-abd1-b0dcef010c43	incoming	none	0.000000
2024-01-30 13:29:17+00:00	qEfkUgUvX80	9guW_0AAAAE/qEfkUgUvX80/qEfkUgUvX80	0	1	1	Human	11	3	f1144500-2e1d-11ef-abd1-b0dcef010c43	incoming	positive	0.535310
2024-01-30 13:29:17+00:00	qEfkUgUvX80	9guW_0AAAAE/qEfkUgUvX80/qEfkUgUvX80	0	0	0	Human	22	5	f1144500-2e1d-11ef-abd1-b0dcef010c43	outgoing	positive	0.912528

Youtube Activity

Finally, we have a reader for extracting Youtube watch history data. We do not, by default, return video identifiers, but replace them with numerical IDs. The only available information then is the recorded time, which corresponds to video start time.

Importantly, we have no information on how long the user watched a given video, as this is not stored in the TakeOut data. You can deduce whether the user has rewatched a given video, watched multiple videos in a row, or started another video quickly without finishing the previous one.

[9]:

niimpy.reading.google_takeout.youtube_watch_history("test.zip")

[9]:

	video_title	channel_title	user
timestamp
2024-02-13 08:36:49+02:00	0	0	f1f6ae72-2e1d-11ef-abd1-b0dcef010c43
2024-02-13 08:36:05+02:00	1	1	f1f6ae72-2e1d-11ef-abd1-b0dcef010c43
2024-02-13 08:35:38+02:00	2	2	f1f6ae72-2e1d-11ef-abd1-b0dcef010c43
2024-02-13 08:35:03+02:00	0	0	f1f6ae72-2e1d-11ef-abd1-b0dcef010c43

Since Google takeout may provide the mailbox as a single uncompressed file, it is also possible to provide it’s file path directly.

[10]:

path = os.path.join(config.GOOGLE_TAKEOUT_DIR, "Takeout", "Mail", "All mail Including Spam and Trash.mbox")
niimpy.reading.google_takeout.email_activity(path, sentiment=True)

/u/24/rantahj1/unix/src/niimpy/niimpy/reading/google_takeout.py:466: UserWarning: Could not parse message timestamp: 2023-12-15 12:19:43+00:00
  warnings.warn(f"Could not parse message timestamp: {received}")
/u/24/rantahj1/unix/src/niimpy/niimpy/reading/google_takeout.py:480: UserWarning: Failed to format received time: Sat, 15 DeNot a timec 2023 12:19:43 0000
  warnings.warn(f"Failed to format received time: {received}")

Running sentiment analysis on 5 messages.

  0%|          | 0/5 [00:00<?, ?it/s]

[10]:

	received	from	to	cc	bcc	message_id	in_reply_to	character_count	word_count	message_type	user	sentiment	sentiment_score
timestamp
2023-12-15 12:19:43+00:00	NaT	0	[1]	[]	[]	2	<NA>	33	6	outgoing	f1f97e90-2e1d-11ef-abd1-b0dcef010c43	positive	0.993223
2023-12-15 12:29:43+00:00	NaT	0	[6, 1]	[]	[]	1	<NA>	31	6	outgoing	f1f97e90-2e1d-11ef-abd1-b0dcef010c43	negative	0.980209
2023-12-15 12:29:43+00:00	NaT	0	[6, 1]	[]	[]	1	<NA>	28	5	outgoing	f1f97e90-2e1d-11ef-abd1-b0dcef010c43	negative	0.968588
2023-12-15 12:39:43+00:00	2023-12-15 12:19:43+00:00	6	[0]	[4]	[4]	3	2	30	5	incoming	f1f97e90-2e1d-11ef-abd1-b0dcef010c43	positive	0.997529
2023-12-15 12:39:43+00:00	Sat, 15 DeNot a timec 2023 12:19:43 0000	6	[0]	[4]	[4]	3	2	51	7	incoming	f1f97e90-2e1d-11ef-abd1-b0dcef010c43	neutral	0.477333

General Notes on Google Takeout

Each subject Downloads their Google TakeOut data as a separate zip file. The Zipfile package, which is included in the Python standard, is convenient for reading the data files contained in the zip file. For example, one could read the location data with the following code:

[11]:

from zipfile import ZipFile
import json
import pandas as pd

zip_file = ZipFile("test.zip")
json_data  = zip_file.read("Takeout/Location History (Timeline)/Records.json")
json_data = json.loads(json_data)
data = pd.json_normalize(json_data["locations"])
data = pd.DataFrame(data)
data.head()

[11]:

	latitudeE7	longitudeE7	accuracy	source	deviceTag	timestamp	activity	locationMetadata	placeId	formFactor	...	deviceDesignation	altitude	verticalAccuracy	platformType	osLevel	serverTimestamp	deviceTimestamp	batteryCharging	velocity	heading
0	359974880	-789221943	25	WIFI	-577680260	2016-08-12T19:29:43.821Z	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	359975588	-789225036	21	WIFI	-577680260	2016-08-12T19:30:49.531Z	[{'activity': [{'type': 'STILL', 'confidence':...	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	359975588	-789225036	21	WIFI	-577680260	2016-08-12T19:31:49.531Z	[{'activity': [{'type': 'STILL', 'confidence':...	[{'wifiScan': {'accessPoints': [{'mac': '12410...	ChIJS_5Nmuz1jUYRGYf3QiiZco4	PHONE	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	360008703	-789233433	1500	CELL	-577680260	2016-08-12T21:15:55.295Z	[{'activity': [{'type': 'ON_FOOT', 'confidence...	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	359972502	-789239894	8	GPS	-577680260	2016-08-12T21:16:33Z	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 22 columns

Location data is stored in the json format. Other types of data are stored in various formats and with different files structures. The user must find how each type of data they need is stored and how it can be read in Python.

MHealth

We have implemented readers for 3 data types formatted according to the MHealth schema. These are total sleep time, heart rate and geolocation. Other data types may be added as needed.

[12]:

# Reading total sleep time data:
filename = config.MHEALTH_TOTAL_SLEEP_TIME_PATH
niimpy.reading.mhealth.total_sleep_time_from_file(filename)

[12]:

	descriptive_statistic	descriptive_statistic_denominator	date	part_of_day	total_sleep_time	start	end
timestamp
2016-02-06 04:35:00+00:00	NaN	NaN	NaT	NaN	0 days 07:45:00	2016-02-06 04:35:00+00:00	2016-02-06 14:35:00+00:00
2016-02-05 15:00:00+00:00	average	d	NaT	NaN	0 days 07:15:00	2016-02-05 15:00:00+00:00	2016-06-06 15:00:00+00:00
2013-01-26 07:35:00+00:00	NaN	NaN	NaT	NaN	0 days 03:00:00	2013-01-26 07:35:00+00:00	2013-02-05 07:35:00+00:00
2013-02-05 00:00:00+00:00	NaN	NaN	2013-02-05 00:00:00+00:00	evening	0 days 03:00:00	NaT	NaT

[13]:

# Reading heart rate data:
filename = config.MHEALTH_HEART_RATE_PATH
niimpy.reading.mhealth.heart_rate_from_file(filename)

[13]:

	temporal_relationship_to_sleep	heart_rate	effective_time_frame.date_time	descriptive_statistic	start	end
timestamp
NaT	on waking	70	2023-11-20T07:25:00-08:00	NaN	NaT	NaT
2023-12-20 09:50:00+00:00	on waking	65	NaN	NaN	2023-12-20 09:50:00+00:00	2023-12-20 10:00:00+00:00
2023-12-20 03:50:00+00:00	during sleep	35	NaN	average	2023-12-20 03:50:00+00:00	2023-12-20 04:00:00+00:00

[14]:

# Reading geolocation data:
filename = config.MHEALTH_GEOLOCATION_PATH
niimpy.reading.mhealth.geolocation_from_file(filename)

[14]:

	positioning_system	latitude	latitude.unit	longitude	longitude.unit	effective_time_frame.time_interval.start_date_time	effective_time_frame.time_interval.end_date_time	elevation.value	elevation.unit
0	GPS	60.1867	deg	24.8283	deg	2016-02-05T20:35:00-08:00	2016-02-06T06:35:00-08:00	NaN	NaN
1	GPS	60.1867	deg	24.8283	deg	2016-02-05T20:35:00-08:00	2016-02-06T06:35:00-08:00	20.4	m

Other formats

You can add readers for any types of formats which you can convert into a Pandas dataframe (so basically anything). For examples of readers, see niimpy/reading/read.py. Apply the function niimpy.preprocessing.util.df_normalize in order to apply some standardizations to get the standard Niimpy format.