Location Data

Introduction

GPS location data contain rich information about people’s behavioral and mobility patterns. However, working with such data is a challenging task since there exists a lot of noise and missingness. Also, designing relevant features to gain knowledge about the mobility pattern of subjects is a crucial task. To address these problems, niimpy provides these main functions to clean, downsample, and extract features from GPS location data:

  • niimpy.preprocessing.location.filter_location: removes low-quality location data points

  • niimpy.util.aggregate: downsamples data points to reduce noise

  • niimpy.preprocessing.location.extract_features_location: feature extraction from location data

In the following, we go through analysing a subset of location data provided in StudentLife dataset.

Read data

[1]:
import niimpy
from niimpy import config
import niimpy.preprocessing.location as nilo
import warnings
warnings.filterwarnings("ignore")
[2]:
data = niimpy.read_csv(config.GPS_PATH, tz='Europe/Helsinki')
data.shape
[2]:
(9857, 6)

There are 9857 location datapoints with 6 columns in the dataset. Let us have a quick look at the data:

[3]:
data.head()
[3]:
time double_latitude double_longitude double_speed user datetime
2013-03-27 06:03:29+02:00 1364357009 43.706667 -72.289097 0.00 gps_u01 2013-03-27 06:03:29+02:00
2013-03-27 06:23:29+02:00 1364358209 43.706637 -72.289066 0.00 gps_u01 2013-03-27 06:23:29+02:00
2013-03-27 06:43:25+02:00 1364359405 43.706678 -72.289018 0.25 gps_u01 2013-03-27 06:43:25+02:00
2013-03-27 07:03:29+02:00 1364360609 43.706665 -72.289087 0.00 gps_u01 2013-03-27 07:03:29+02:00
2013-03-27 07:23:25+02:00 1364361805 43.706808 -72.289370 0.00 gps_u01 2013-03-27 07:23:25+02:00

The necessary columns for further analysis are double_latitude, double_longitude, double_speed, and user. user refers to a unique identifier for a subject.

Filter data

Three different methods for filtering low-quality data points are implemented in niimpy:

  • remove_disabled: removes data points whose disabled column is True.

  • remove_network: removes data points whose provider column is network. This method keeps only gps-derived data points.

  • remove_zeros: removes data points close to the point <lat=0, lon=0>.

[4]:
data = nilo.filter_location(data, remove_disabled=False, remove_network=False, remove_zeros=True)
data.shape
[4]:
(9857, 6)

There is no such data points in this dataset; therefore the dataset does not change after this step and the number of datapoints remains the same.

Downsample

Because GPS records are not always very accurate and they have random errors, it is a good practice to downsample or aggregate data points which are recorded in close time windows. In other words, all the records in the same time window are aggregated to form one GPS record associated to that time window. There are a few parameters to adjust the aggregation setting:

  • freq: represents the length of time window. This parameter follows the formatting of the pandas time offset aliases function. For example ‘5T’ means 5 minute intervals.

  • method_numerical: specifies how numerical columns should be aggregated. Options are ‘mean’, ‘median’, ‘sum’.

  • method_categorical: specifies how categorical columns should be aggregated. Options are ‘first’, ‘mode’ (most frequent), ‘last’.

The aggregation is performed for each user (subject) separately.

[5]:
binned_data = niimpy.util.aggregate(data, freq='5min', method_numerical='median')
binned_data = binned_data.reset_index(0).dropna()
binned_data.shape
[5]:
(9755, 5)
[6]:
binned_data
[6]:
user time double_latitude double_longitude double_speed
2013-03-27 06:00:00+02:00 gps_u00 1.364357e+09 43.759135 -72.329240 0.0
2013-03-27 06:20:00+02:00 gps_u00 1.364358e+09 43.759503 -72.329018 0.0
2013-03-27 06:40:00+02:00 gps_u00 1.364359e+09 43.759134 -72.329238 0.0
2013-03-27 07:00:00+02:00 gps_u00 1.364361e+09 43.759135 -72.329240 0.0
2013-03-27 07:20:00+02:00 gps_u00 1.364362e+09 43.759135 -72.329240 0.0
... ... ... ... ... ...
2013-05-29 16:10:00+03:00 gps_u01 1.369833e+09 43.706711 -72.289205 0.0
2013-05-29 16:20:00+03:00 gps_u01 1.369834e+09 43.706708 -72.289162 0.0
2013-05-29 16:30:00+03:00 gps_u01 1.369834e+09 43.706725 -72.289149 0.0
2013-05-29 16:40:00+03:00 gps_u01 1.369835e+09 43.706697 -72.289165 0.0
2013-05-29 16:50:00+03:00 gps_u01 1.369836e+09 43.706713 -72.289191 0.0

9755 rows × 5 columns

After binning, the number of datapoints (bins) reduces to 9755.

Feature extraction

Here is the list of features niimpy extracts from location data:

  1. Distance based features (niimpy.preprocessing.location.location_distance_features):

Feature

Description

dist_total

Total distance a person traveled in meters

variance, log_variance

Variance is defined as sum of variance in latitudes and longitudes

speed_average, speed_variance, and speed_max

Statistics of speed (m/s). Speed, if not given, can be calculated by dividing the distance between two consequitive bins by their time difference

n_bins

Number of location bins that a user recorded in dataset

  1. Significant place related features (niimpy.preprocessing.location.location_significant_place_features):

Feature

Description

n_static

Number of static points. Static points are defined as bins whose speed is lower than a threshold

n_moving

Number of moving points. Equivalent to n_bins - n_static

n_home

Number of static bins which are close to the person’s home. Home is defined the place most visited during nights. More formally, all the locations recorded during 12 Am and 6 AM are clusterd and the center of largest cluster is assumed to be home

max_dist_home

Maximum distance from home

n_sps

Bumber of significant places. All of the static bins are clusterd using DBSCAN algorithm. Each cluster represents a Signicant Place (SP) for a user

n_rare

Number of rarely visited (referred as outliers in DBSCAN)

n_transitions

Number of transitions between significant places

n_top1, n_top2, n_top3, n_top4, n_top5

: Number of bins in the top N cluster. In other words, n_top1 shows the number of times the person has visited the most freqently visited place

entropy, normalized_entropy

: Entropy of time spent in clusters. Normalized entropy is the entropy divided by the number of clusters

[7]:
import warnings
warnings.filterwarnings('ignore', category=RuntimeWarning)

# extract all the available features
all_features = nilo.extract_features_location(binned_data)
all_features
[7]:
user n_significant_places n_sps n_static n_moving n_rare n_home max_dist_home n_transitions n_top1 ... n_top5 entropy normalized_entropy dist_total n_bins speed_average speed_variance speed_max variance log_variance
2013-03-31 00:00:00+02:00 gps_u00 6 5.0 280.0 8.0 3.0 106.0 2.074186e+04 48.0 106.0 ... 18.0 5.091668 3.163631 4.132581e+05 288.0 0.033496 0.044885 1.750000 0.003146 -5.761688
2013-04-30 00:00:00+03:00 gps_u00 10 10.0 1966.0 66.0 45.0 1010.0 2.914790e+05 194.0 1016.0 ... 38.0 7.284903 3.163793 2.179693e+06 2032.0 0.269932 6.129277 33.250000 0.237133 -1.439133
2013-05-31 00:00:00+03:00 gps_u00 15 12.0 1827.0 76.0 86.0 1028.0 1.041741e+06 107.0 1030.0 ... 46.0 6.701177 2.696752 6.986551e+06 1903.0 0.351280 7.590639 34.000000 8.288687 2.114892
2013-06-30 00:00:00+03:00 gps_u00 1 1.0 22.0 2.0 15.0 0.0 2.035837e+04 10.0 15.0 ... 0.0 0.000000 0.000000 2.252893e+05 24.0 0.044126 0.021490 0.559017 0.014991 -4.200287
2013-03-31 00:00:00+02:00 gps_u01 4 2.0 307.0 18.0 0.0 260.0 6.975303e+02 8.0 286.0 ... 0.0 3.044522 4.392317 1.328713e+04 325.0 0.056290 0.073370 2.692582 0.000004 -12.520989
2013-04-30 00:00:00+03:00 gps_u01 4 1.0 1999.0 71.0 1.0 1500.0 1.156568e+04 2.0 1998.0 ... 0.0 0.000000 0.000000 1.238429e+05 2070.0 0.066961 0.629393 32.750000 0.000027 -10.510017
2013-05-31 00:00:00+03:00 gps_u01 2 1.0 3079.0 34.0 1.0 45.0 3.957650e+03 2.0 3078.0 ... 0.0 0.000000 0.000000 1.228235e+05 3113.0 0.026392 0.261978 20.250000 0.000012 -11.364454

7 rows × 23 columns

[8]:
# extract only distance related features
features = {
    nilo.location_distance_features: {} # arguments
}
distance_features = nilo.extract_features_location(
    binned_data,
    features=features)
distance_features
[8]:
user dist_total n_bins speed_average speed_variance speed_max variance log_variance
2013-03-31 00:00:00+02:00 gps_u00 4.132581e+05 288.0 0.033496 0.044885 1.750000 0.003146 -5.761688
2013-04-30 00:00:00+03:00 gps_u00 2.179693e+06 2032.0 0.269932 6.129277 33.250000 0.237133 -1.439133
2013-05-31 00:00:00+03:00 gps_u00 6.986551e+06 1903.0 0.351280 7.590639 34.000000 8.288687 2.114892
2013-06-30 00:00:00+03:00 gps_u00 2.252893e+05 24.0 0.044126 0.021490 0.559017 0.014991 -4.200287
2013-03-31 00:00:00+02:00 gps_u01 1.328713e+04 325.0 0.056290 0.073370 2.692582 0.000004 -12.520989
2013-04-30 00:00:00+03:00 gps_u01 1.238429e+05 2070.0 0.066961 0.629393 32.750000 0.000027 -10.510017
2013-05-31 00:00:00+03:00 gps_u01 1.228235e+05 3113.0 0.026392 0.261978 20.250000 0.000012 -11.364454

The 2 rows correspond to the 2 users present in the dataset. Each column represents a feature. For example user gps_u00 has higher variance in speeds (speed_variance) and location variance (variance) compared to the user gps_u01.

Implementing your own features

If you want to implement a customized feature you can do so with defining a function that accepts a dataframe and returns a dataframe or a series. The returned object should be indexed by user. Then, when calling extract_features_location function, you add the newly implemented function to the feaqture_functions argument. The default feature functions implemented in niimpy are in this variable:

[9]:
nilo.ALL_FEATURES
[9]:
{<function niimpy.preprocessing.location.location_number_of_significant_places(df, config={})>: {'resample_args': {'rule': '1M'}},
 <function niimpy.preprocessing.location.location_significant_place_features(df, config={})>: {'resample_args': {'rule': '1M'}},
 <function niimpy.preprocessing.location.location_distance_features(df, config={})>: {'resample_args': {'rule': '1M'}}}
[10]:
binned_data.head()
[10]:
user time double_latitude double_longitude double_speed
2013-03-27 06:00:00+02:00 gps_u00 1.364357e+09 43.759135 -72.329240 0.0
2013-03-27 06:20:00+02:00 gps_u00 1.364358e+09 43.759503 -72.329018 0.0
2013-03-27 06:40:00+02:00 gps_u00 1.364359e+09 43.759134 -72.329238 0.0
2013-03-27 07:00:00+02:00 gps_u00 1.364361e+09 43.759135 -72.329240 0.0
2013-03-27 07:20:00+02:00 gps_u00 1.364362e+09 43.759135 -72.329240 0.0

You can add your new function to the nilo.ALL_FEATURES dictionary and call extract_features_location function. Or if you are interested in only extracting your desired feature you can pass a dictionary containing just that function, like here:

[11]:
# customized function
def max_speed(df, feature_arg):
    grouped = df.groupby('user')
    df = grouped['double_speed'].max().reset_index('user')
    return df

customized_features = nilo.extract_features_location(
    binned_data,
    features={max_speed: {}}
)
customized_features
[11]:
user double_speed
0 gps_u00 34.00
1 gps_u01 32.75