Data schema
This page documents the expected data schema of Niimpy. This does not extend to the contents of data from sensors (yet), but relates to the metadata applicable to all sensors.
By using a standardized schema (mainly column names), we can promote interoperability of various tools.
Format
Data is in a tabular (relational) format. A row is an observation, and columns are properties of observations. (At this level of abstraction, an “observation” may be one sensor observation, or some data which contains a package of multiple observations).
In Niimpy, this is internally stored and handled as a pandas.DataFrame. The schema naturally maps to the columns/rows of the DataFrames.
The on-disk format is currently irrelevant, as long as the producers can create a DataFrame of the necessary format. Currently, we provide readers for sqlite3 and csv. Other standards may be implemented later.
Standard columns in DataFrames
By having standard columns, we can create portable functions that easily operate on diverse data types.
The DataFrame index should be a
pandas.DatetimeIndex
.user
: opaque identifier for the user. Often a string or integer.device
: unique identifier for a user’s device (not the device type). For example, a user could have multiple phones, and each would have a separatedevice
identifier.time
: timestamp of the observation, in unixtime (seconds since 00:00 on 1970-01-01), stored as an integer. Unixtime is a globally unique measure of an instance of time on Earth, and to get localtime it is combined with a timezone.In on-disk formats,
time
is considered the master timestamp, many other time-based properties are computed from it (though you could produce your own DataFrames other ways). In some of the standard formats (CSV/sqlite3), when a file is read, this integer column is automatically converted to thedatetime
column below and the DataFrame index.datetime
: aDateTime
-compatible object, such as in pandas a numpy.datetime64 object, used only in in-memory representations (not usually written to portable save files). This should be an timezone-aware object, and the data loader handles the timezone conversion. automatically added to DataFrames when loaded.It is the responsibility of each loader (or preprocessor) to add this column to the in-memory representation by converting the
time
column to this format. This happens automatically with readers included inniimpy
.timezone
: Timezone in some format. Not yet used, to be decided.For questionaire data
id
: a question identifier. String, should be of formQUESTIONAIRE_QUESTION
, for examplePHQ9_01
. The common prefix is used to group questions of the same series.answer
: the answer to the question. Opaque identifier.
Sensor-specific schemas are defined elsewhere. Columns which are not defined here are allowed and considered to be part of the sensors, most APIs should pass through unknown columns for handling in a future layer (sensor analysis).
Other standard columns in Niimpy
These are not part of the primary schema, but are standard in Niimpy.
day
: e.g.2021-04-09
(str)hour
: hour of day, e.g.15
(int)
Standard columns in on-disk formats
For the most part, this maps directly to the columns you see above.
An on-disk format should have a time
column (unixtime, integer)
plus whatever else is needed for that particular sensor, based on the
above.