Location Data

Introduction

GPS location data contain rich information about people’s behavioral and mobility patterns. However, working with such data is a challenging task since there exists a lot of noise and missingness. Also, designing relevant features to gain knowledge about the mobility pattern of subjects is a crucial task.

Location data is expected to have the following columns (column names can be different, but in that case they must be provided as parameters):

user: Subject ID
device: Device ID
latitude: Latitude as a floating point number
longitude: Longitude as a floating point number

Optional columns include:

speed: Speed measured at the location

Niimpy provides these main functions to clean, downsample, and extract features from GPS location data:

niimpy.preprocessing.location.filter_location: removes low-quality location data points
niimpy.util.aggregate: downsamples data points to reduce noise
niimpy.preprocessing.location.extract_features_location: feature extraction from location data

In the following, we go through analysing a subset of location data provided in StudentLife dataset.

Read data

[1]:

import niimpy
from niimpy import config
import niimpy.preprocessing.location as nilo
import warnings
warnings.filterwarnings("ignore")

/u/24/rantahj1/unix/miniconda3/envs/niimpy/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

[2]:

data = niimpy.read_csv(config.GPS_PATH, tz='Europe/Helsinki')
data.shape

[2]:

(9857, 6)

There are 9857 location datapoints with 6 columns in the dataset. Let us have a quick look at the data:

[3]:

data.head()

[3]:

	time	double_latitude	double_longitude	double_speed	user	datetime
2013-03-27 06:03:29+02:00	1364357009	43.706667	-72.289097	0.00	gps_u01	2013-03-27 06:03:29+02:00
2013-03-27 06:23:29+02:00	1364358209	43.706637	-72.289066	0.00	gps_u01	2013-03-27 06:23:29+02:00
2013-03-27 06:43:25+02:00	1364359405	43.706678	-72.289018	0.25	gps_u01	2013-03-27 06:43:25+02:00
2013-03-27 07:03:29+02:00	1364360609	43.706665	-72.289087	0.00	gps_u01	2013-03-27 07:03:29+02:00
2013-03-27 07:23:25+02:00	1364361805	43.706808	-72.289370	0.00	gps_u01	2013-03-27 07:23:25+02:00

For further analysis we need a latitude, longitude, speed, and user column. user refers to a unique identifier for a subject.

These columsn exist in the data, but some column names are different. We could provide these column names as arguments, but it is easier to rename them here:

[4]:

data = data.rename(columns={"double_latitude": "latitude", "double_longitude": "longitude", "double_speed": "speed"})
data.head()

[4]:

	time	latitude	longitude	speed	user	datetime
2013-03-27 06:03:29+02:00	1364357009	43.706667	-72.289097	0.00	gps_u01	2013-03-27 06:03:29+02:00
2013-03-27 06:23:29+02:00	1364358209	43.706637	-72.289066	0.00	gps_u01	2013-03-27 06:23:29+02:00
2013-03-27 06:43:25+02:00	1364359405	43.706678	-72.289018	0.25	gps_u01	2013-03-27 06:43:25+02:00
2013-03-27 07:03:29+02:00	1364360609	43.706665	-72.289087	0.00	gps_u01	2013-03-27 07:03:29+02:00
2013-03-27 07:23:25+02:00	1364361805	43.706808	-72.289370	0.00	gps_u01	2013-03-27 07:23:25+02:00

Filter data

Three different methods for filtering low-quality data points are implemented in niimpy:

remove_disabled: removes data points whose disabled column is True.
remove_network: removes data points whose provider column is network. This method keeps only gps-derived data points.
remove_zeros: removes data points close to the point <lat=0, lon=0>.

[5]:

data = nilo.filter_location(data, remove_disabled=False, remove_network=False, remove_zeros=True)
data.shape

[5]:

(9857, 6)

There is no such data points in this dataset; therefore the dataset does not change after this step and the number of datapoints remains the same.

Downsample

Because GPS records are not always very accurate and they have random errors, it is a good practice to downsample or aggregate data points which are recorded in close time windows. In other words, all the records in the same time window are aggregated to form one GPS record associated to that time window. There are a few parameters to adjust the aggregation setting:

freq: represents the length of time window. This parameter follows the formatting of the pandas time offset aliases function. For example ‘5T’ means 5 minute intervals.
method_numerical: specifies how numerical columns should be aggregated. Options are ‘mean’, ‘median’, ‘sum’.
method_categorical: specifies how categorical columns should be aggregated. Options are ‘first’, ‘mode’ (most frequent), ‘last’.

The aggregation is performed for each user (subject) separately.

[6]:

binned_data = niimpy.util.aggregate(data, freq='5min', method_numerical='median')
binned_data = binned_data.dropna()
binned_data.shape

[6]:

(9755, 5)

[7]:

binned_data.head()

[7]:

	user	time	latitude	longitude
2013-03-27 06:00:00+02:00	gps_u00	1.364357e+09	43.759135	-72.329240
2013-03-27 06:20:00+02:00	gps_u00	1.364358e+09	43.759503	-72.329018
2013-03-27 06:40:00+02:00	gps_u00	1.364359e+09	43.759134	-72.329238
2013-03-27 07:00:00+02:00	gps_u00	1.364361e+09	43.759135	-72.329240
2013-03-27 07:20:00+02:00	gps_u00	1.364362e+09	43.759135	-72.329240

After binning, the number of datapoints (bins) reduces to 9755.

Feature extraction

Here is the list of features niimpy extracts from location data:

Distance based features (niimpy.preprocessing.location.location_distance_features):

Feature	Description
`dist_total`	Total distance a person traveled in meters
`variance`, `log_variance`	Variance is defined as sum of variance in latitudes and longitudes
`speed_average`, `speed_variance`, and `speed_max`	Statistics of speed (m/s). Speed, if not given, can be calculated by dividing the distance between two consequitive bins by their time difference
`n_bins`	Number of location bins that a user recorded in dataset

Significant place related features (niimpy.preprocessing.location.location_significant_place_features):

Feature	Description
`n_static`	Number of static points. Static points are defined as bins whose speed is lower than a threshold
`n_moving`	Number of moving points. Equivalent to `n_bins - n_static`
`n_home`	Number of static bins which are close to the person’s home. Home is defined the place most visited during nights. More formally, all the locations recorded during 12 Am and 6 AM are clusterd and the center of largest cluster is assumed to be home
`max_dist_home`	Maximum distance from home
`n_sps`	Bumber of significant places. All of the static bins are clusterd using DBSCAN algorithm. Each cluster represents a Signicant Place (SP) for a user
`n_rare`	Number of rarely visited (referred as outliers in DBSCAN)
`n_transitions`	Number of transitions between significant places
`n_top1`, `n_top2`, `n_top3`, `n_top4`, `n_top5`	: Number of bins in the top `N` cluster. In other words, `n_top1` shows the number of times the person has visited the most freqently visited place
`entropy`, `normalized_entropy`	: Entropy of time spent in clusters. Normalized entropy is the entropy divided by the number of clusters

Local time feature (niimpy.preprocessing.location.location_local_time):

Feature	Description
`time_local`	Local time at the location. This feature is calculated from the time index and the longitude of the location.

[8]:

import warnings
warnings.filterwarnings('ignore', category=RuntimeWarning)

# extract all the available features
all_features = nilo.extract_features_location(binned_data)
all_features

[8]:

	user	n_sps	max_dist_home	n_rare	n_transitions	n_static	n_top3	normalized_entropy	entropy	n_top1	...	n_moving	n_top4	dist_total	n_bins	speed_max	log_variance	speed_average	speed_variance	variance	timezone
2013-03-31 00:00:00+02:00	gps_u00	5.0	2.074186e+04	3.0	48.0	280.0	34.0	3.163631	5.091668	106.0	...	8.0	20.0	4.132581e+05	288.0	1.750000	-5.761688	0.033496	0.044885	0.003146	America/New_York
2013-04-30 00:00:00+03:00	gps_u00	10.0	2.914790e+05	45.0	194.0	1966.0	135.0	3.163793	7.284903	1016.0	...	66.0	45.0	2.179693e+06	2032.0	33.250000	-1.439133	0.269932	6.129277	0.237133	America/New_York
2013-05-31 00:00:00+03:00	gps_u00	12.0	1.041741e+06	86.0	107.0	1827.0	86.0	2.696752	6.701177	1030.0	...	76.0	65.0	6.986551e+06	1903.0	34.000000	2.114892	0.351280	7.590639	8.288687	America/New_York
2013-06-30 00:00:00+03:00	gps_u00	1.0	2.035837e+04	15.0	10.0	22.0	0.0	0.000000	0.000000	15.0	...	2.0	0.0	2.252893e+05	24.0	0.559017	-4.200287	0.044126	0.021490	0.014991	America/New_York
2013-03-31 00:00:00+02:00	gps_u01	2.0	6.975303e+02	0.0	8.0	307.0	0.0	4.392317	3.044522	286.0	...	18.0	0.0	1.328713e+04	325.0	2.692582	-12.520989	0.056290	0.073370	0.000004	America/New_York
2013-04-30 00:00:00+03:00	gps_u01	1.0	1.156568e+04	1.0	2.0	1999.0	0.0	0.000000	0.000000	1998.0	...	71.0	0.0	1.238429e+05	2070.0	32.750000	-10.510017	0.066961	0.629393	0.000027	America/New_York
2013-05-31 00:00:00+03:00	gps_u01	1.0	3.957650e+03	1.0	2.0	3079.0	0.0	0.000000	0.000000	3078.0	...	34.0	0.0	1.228235e+05	3113.0	20.250000	-11.364454	0.026392	0.261978	0.000012	America/New_York

7 rows × 23 columns

[9]:

# extract only distance related features
distance_features = nilo.location_distance_features(binned_data)
distance_features

[9]:

	dist_total	n_bins	speed_max	user	log_variance	speed_average	speed_variance	variance
2013-03-31 00:00:00+02:00	4.132581e+05	288.0	1.750000	gps_u00	-5.761688	0.033496	0.044885	0.003146
2013-04-30 00:00:00+03:00	2.179693e+06	2032.0	33.250000	gps_u00	-1.439133	0.269932	6.129277	0.237133
2013-05-31 00:00:00+03:00	6.986551e+06	1903.0	34.000000	gps_u00	2.114892	0.351280	7.590639	8.288687
2013-06-30 00:00:00+03:00	2.252893e+05	24.0	0.559017	gps_u00	-4.200287	0.044126	0.021490	0.014991
2013-03-31 00:00:00+02:00	1.328713e+04	325.0	2.692582	gps_u01	-12.520989	0.056290	0.073370	0.000004
2013-04-30 00:00:00+03:00	1.238429e+05	2070.0	32.750000	gps_u01	-10.510017	0.066961	0.629393	0.000027
2013-05-31 00:00:00+03:00	1.228235e+05	3113.0	20.250000	gps_u01	-11.364454	0.026392	0.261978	0.000012

The 2 rows correspond to the 2 users present in the dataset. Each column represents a feature. For example user gps_u00 has higher variance in speeds (speed_variance) and location variance (variance) compared to the user gps_u01.

Implementing your own features

If you want to implement a customized feature you can do so with defining a function that accepts a dataframe and returns a dataframe or a series. The returned object should be indexed by user. Then, when calling extract_features_location function, you add the newly implemented function to the feature_functions argument. The default feature functions implemented in niimpy are in this variable:

[10]:

nilo.ALL_FEATURES

[10]:

{<function niimpy.preprocessing.location.location_significant_place_features(df, latitude_column='latitude', longitude_column='latitude', speed_column='speed', speed_threshold=0.277, resample_args={'rule': '1ME'}, **kwargs)>: {},
 <function niimpy.preprocessing.location.location_distance_features(df, latitude_column='latitude', longitude_column='latitude', speed_column='speed', resample_args={'rule': '1ME'}, **kwargs)>: {},
 <function niimpy.preprocessing.location.location_local_time(df, longitude_column='longitude', latitude_column='latitude', resample_args={'rule': '1ME'})>: {}}

[11]:

binned_data.head()

[11]:

	user	time	latitude	longitude
2013-03-27 06:00:00+02:00	gps_u00	2013-03-27 06:00:00+02:00	43.759135	-72.329240
2013-03-27 06:20:00+02:00	gps_u00	2013-03-27 06:20:00+02:00	43.759503	-72.329018
2013-03-27 06:40:00+02:00	gps_u00	2013-03-27 06:40:00+02:00	43.759134	-72.329238
2013-03-27 07:00:00+02:00	gps_u00	2013-03-27 07:00:00+02:00	43.759135	-72.329240
2013-03-27 07:20:00+02:00	gps_u00	2013-03-27 07:20:00+02:00	43.759135	-72.329240

You can add your new function to the nilo.ALL_FEATURES dictionary and call extract_features_location function. Or if you are interested in only extracting your desired feature you can pass a dictionary containing just that function, like here:

[12]:

# customized function
def max_speed(df):
    grouped = df.groupby('user')
    df = grouped['speed'].max().reset_index('user')
    return df

customized_features = nilo.extract_features_location(
    binned_data,
    features={max_speed: {}}
)
customized_features

[12]:

	user	speed
0	gps_u00	34.00
1	gps_u01	32.75