niimpy.preprocessing.location module

niimpy.preprocessing.location.cluster_locations(lats, lons, min_samples=5, eps=200)[source]

Performs clustering on the locations

Parameters:
latspd.DataFrame

Latitudes

lonspd.DataFrame

Longitudes

mins_samplesint

Minimum number of samples to form a cluster. Default is 5.

epsfloat

Epsilone parameter in DBSCAN. The maximum distance between two neighbour samples. Default is 200.

Returns:
clustersarray

Array of clusters. -1 indicates outlier.

niimpy.preprocessing.location.compute_nbin_maxdist_home(lats, lons, latlon_home, home_radius=50)[source]

Computes number of bins in home and maximum distance to home

Parameters:
latspd.DataFrame

Latitudes

lonspd.DataFrame

Longitudes

latlon_homearray

A tuple (lat, lon) showing the coordinate of home

Returns:
(n_home, max_dist_home)tuple

n_home: number of bins the person has been near the home max_dist_home: maximum distance that the person has been from home

niimpy.preprocessing.location.distance_matrix(lats, lons)[source]

Compute distance matrix using great-circle distance formula

https://en.wikipedia.org/wiki/Great-circle_distance#Formulae

Parameters:
latsarray

Latitudes

lonsarray

Longitudes

Returns:
distsmatrix

Entry (i, j) shows the great-circle distance between point i and j, i.e. distance between (lats[i], lons[i]) and (lats[j], lons[j]).

niimpy.preprocessing.location.extract_features_location(df, features=None)[source]

Calculates location features

Parameters:
dfpd.DataFrame

dataframe of location data. It must contain these columns: double_latitude, double_longitude, user, group. double_speed is optional. If not provided, it will be computed manually.

speed_thresholdfloat

Bins whose speed is lower than speed_threshold are considred static and the rest are moving.

featuresmap (dictionary) of functions that compute features.

it is a map of map, where the keys to the first map is the name of functions that compute features and the nested map contains the keyword arguments to that function. If there is no arguments use an empty map. Default is None. If None, all the available functions are used. Those functions are in the dict location.ALL_FEATURES. You can implement your own function and use it instead or add it to the mentioned map.

Returns:
featurespd.DataFrame

Dataframe of computed features where the index is users and columns are the the features.

niimpy.preprocessing.location.filter_location(location, remove_disabled=True, remove_zeros=True, remove_network=False, latitude_column='double_latitude', longitude_column='double_longitude', label_column='label', provider_column='provider')[source]

Remove low-quality or weird location samples

Parameters:
locationpd.DataFrame

DataFrame of locations

remove_disabledbool

Remove locations whose label is disabled

remove_zerobool

Remove locations which their latitude and longitueds are close to 0

remove_networkbool

Keep only locations whose provider is gps

Returns:
locationpd.DataFrame
niimpy.preprocessing.location.find_home(lats, lons, times)[source]

Find coordinates of the home of a person

Home is defined as the place most visited between 12am - 6am. Locations within this time period first clustered and then the center of largest clusetr shows the home.

Parameters:
latsarray-like

Latitudes

lonsarray-like

Longitudes

timesarray-like

Time of the recorderd coordinates

Returns
——
(lat_home, lon_home)tuple of floats

Coordinates of the home

niimpy.preprocessing.location.get_speeds_totaldist(lats, lons, times)[source]

Computes speed of bins with dividing distance by their time difference

Parameters:
latsarray-like

Array of latitudes

lonsarray-like

Array of longitudes

timesarray-like

Array of times associted with bins

Returns
——
(speeds, total_distances)tuple of speeds (array) and total distance travled (float)
niimpy.preprocessing.location.group_data(df)[source]

Group the dataframe by a standard set of columns listed in group_by_columns.

niimpy.preprocessing.location.location_distance_features(df, config={})[source]

Calculates features related to distance and speed.

Parameters:
df: dataframe with date index
config: A dictionary of optional arguments
Optional arguments in config:

longitude_column: The name of the column with longitude data in a floating point format. Defaults to ‘double_longitude’. latitude_column: The name of the column with latitude data in a floating point format. Defaults to ‘double_latitude’. speed_column: The name of the column with speed data in a floating point format. Defaults to ‘double_speed’. resample_args: a dictionary of arguments for the Pandas resample function. For example to resample by hour, you would pass {“rule”: “1h”}.

niimpy.preprocessing.location.location_number_of_significant_places(df, config={})[source]

Computes number of significant places

niimpy.preprocessing.location.location_significant_place_features(df, config={})[source]

Calculates features related to Significant Places.

Parameters:
df: dataframe with date index
config: A dictionary of optional arguments
Optional arguments in config:

longitude_column: The name of the column with longitude data in a floating point format. Defaults to ‘double_longitude’. latitude_column: The name of the column with latitude data in a floating point format. Defaults to ‘double_latitude’. speed_column: The name of the column with speed data in a floating point format. Defaults to ‘double_speed’. resample_args: a dictionary of arguments for the Pandas resample function. For example to resample by hour, you would pass {“rule”: “1h”}.

niimpy.preprocessing.location.number_of_significant_places(lats, lons, times)[source]

Computes number of significant places.

Number of significant plcaes is computed by first clustering the locations in each month and then taking the median of the number of clusters in each month.

It is assumed that lats and lons are the coordinates of static points.

Parameters:
latspd.DataFrame

Latitudes

lonspd.DataFrame

Longitudes

timesarray

Array of times

Returnsthe number of significant places discovered
niimpy.preprocessing.location.reset_groups(df)[source]

Group the dataframe by a standard set of columns listed in group_by_columns.