{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "47b843fb", "metadata": {}, "source": [ "# Location Data" ] }, { "attachments": {}, "cell_type": "markdown", "id": "7834ba9c", "metadata": {}, "source": [ "## Introduction\n", "\n", "GPS location data contain rich information about people's behavioral and mobility patterns. However, working with such data is a challenging task since there exists a lot of noise and missingness. Also, designing relevant features to gain knowledge about the mobility pattern of subjects is a crucial task.\n", "\n", "Location data is expected to have the following columns (column names can be different, but in that case they must be provided as parameters):\n", "- `user`: Subject ID\n", "- `device`: Device ID\n", "- `latitude`: Latitude as a floating point number\n", "- `longitude`: Longitude as a floating point number\n", "\n", "Optional columns include:\n", "- `speed`: Speed measured at the location\n", "\n", "\n", "`Niimpy` provides these main functions to clean, downsample, and extract features from GPS location data:\n", "\n", "- `niimpy.preprocessing.location.filter_location`: removes low-quality location data points\n", "- `niimpy.util.aggregate`: downsamples data points to reduce noise\n", "- `niimpy.preprocessing.location.extract_features_location`: feature extraction from location data\n", "\n", "In the following, we go through analysing a subset of location data provided in [StudentLife](https://studentlife.cs.dartmouth.edu/dataset.html) dataset." ] }, { "attachments": {}, "cell_type": "markdown", "id": "acf896ca", "metadata": {}, "source": [ "## Read data" ] }, { "cell_type": "code", "execution_count": 1, "id": "48d34157", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/u/24/rantahj1/unix/miniconda3/envs/niimpy/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n" ] } ], "source": [ "import niimpy\n", "from niimpy import config\n", "import niimpy.preprocessing.location as nilo\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "code", "execution_count": 2, "id": "2aaabde5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(9857, 6)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = niimpy.read_csv(config.GPS_PATH, tz='Europe/Helsinki')\n", "data.shape" ] }, { "attachments": {}, "cell_type": "markdown", "id": "521bb82a", "metadata": {}, "source": [ "There are 9857 location datapoints with 6 columns in the dataset. Let us have a quick look at the data:" ] }, { "cell_type": "code", "execution_count": 3, "id": "146f5533", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timedouble_latitudedouble_longitudedouble_speeduserdatetime
2013-03-27 06:03:29+02:00136435700943.706667-72.2890970.00gps_u012013-03-27 06:03:29+02:00
2013-03-27 06:23:29+02:00136435820943.706637-72.2890660.00gps_u012013-03-27 06:23:29+02:00
2013-03-27 06:43:25+02:00136435940543.706678-72.2890180.25gps_u012013-03-27 06:43:25+02:00
2013-03-27 07:03:29+02:00136436060943.706665-72.2890870.00gps_u012013-03-27 07:03:29+02:00
2013-03-27 07:23:25+02:00136436180543.706808-72.2893700.00gps_u012013-03-27 07:23:25+02:00
\n", "
" ], "text/plain": [ " time double_latitude double_longitude \\\n", "2013-03-27 06:03:29+02:00 1364357009 43.706667 -72.289097 \n", "2013-03-27 06:23:29+02:00 1364358209 43.706637 -72.289066 \n", "2013-03-27 06:43:25+02:00 1364359405 43.706678 -72.289018 \n", "2013-03-27 07:03:29+02:00 1364360609 43.706665 -72.289087 \n", "2013-03-27 07:23:25+02:00 1364361805 43.706808 -72.289370 \n", "\n", " double_speed user datetime \n", "2013-03-27 06:03:29+02:00 0.00 gps_u01 2013-03-27 06:03:29+02:00 \n", "2013-03-27 06:23:29+02:00 0.00 gps_u01 2013-03-27 06:23:29+02:00 \n", "2013-03-27 06:43:25+02:00 0.25 gps_u01 2013-03-27 06:43:25+02:00 \n", "2013-03-27 07:03:29+02:00 0.00 gps_u01 2013-03-27 07:03:29+02:00 \n", "2013-03-27 07:23:25+02:00 0.00 gps_u01 2013-03-27 07:23:25+02:00 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "86e7396c", "metadata": {}, "source": [ "For further analysis we need a `latitude`, `longitude`, `speed`, and `user` column. `user` refers to a unique identifier for a subject.\n", "\n", "These columsn exist in the data, but some column names are different. We could provide these column names as arguments, but it is easier to rename them here:" ] }, { "cell_type": "code", "execution_count": 4, "id": "df3a2a0a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timelatitudelongitudespeeduserdatetime
2013-03-27 06:03:29+02:00136435700943.706667-72.2890970.00gps_u012013-03-27 06:03:29+02:00
2013-03-27 06:23:29+02:00136435820943.706637-72.2890660.00gps_u012013-03-27 06:23:29+02:00
2013-03-27 06:43:25+02:00136435940543.706678-72.2890180.25gps_u012013-03-27 06:43:25+02:00
2013-03-27 07:03:29+02:00136436060943.706665-72.2890870.00gps_u012013-03-27 07:03:29+02:00
2013-03-27 07:23:25+02:00136436180543.706808-72.2893700.00gps_u012013-03-27 07:23:25+02:00
\n", "
" ], "text/plain": [ " time latitude longitude speed user \\\n", "2013-03-27 06:03:29+02:00 1364357009 43.706667 -72.289097 0.00 gps_u01 \n", "2013-03-27 06:23:29+02:00 1364358209 43.706637 -72.289066 0.00 gps_u01 \n", "2013-03-27 06:43:25+02:00 1364359405 43.706678 -72.289018 0.25 gps_u01 \n", "2013-03-27 07:03:29+02:00 1364360609 43.706665 -72.289087 0.00 gps_u01 \n", "2013-03-27 07:23:25+02:00 1364361805 43.706808 -72.289370 0.00 gps_u01 \n", "\n", " datetime \n", "2013-03-27 06:03:29+02:00 2013-03-27 06:03:29+02:00 \n", "2013-03-27 06:23:29+02:00 2013-03-27 06:23:29+02:00 \n", "2013-03-27 06:43:25+02:00 2013-03-27 06:43:25+02:00 \n", "2013-03-27 07:03:29+02:00 2013-03-27 07:03:29+02:00 \n", "2013-03-27 07:23:25+02:00 2013-03-27 07:23:25+02:00 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = data.rename(columns={\"double_latitude\": \"latitude\", \"double_longitude\": \"longitude\", \"double_speed\": \"speed\"})\n", "data.head()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "fe601920", "metadata": {}, "source": [ "## Filter data" ] }, { "attachments": {}, "cell_type": "markdown", "id": "4b7e09e0", "metadata": {}, "source": [ "Three different methods for filtering low-quality data points are implemented in `niimpy`:\n", "\n", "- `remove_disabled`: removes data points whose `disabled` column is `True`.\n", "- `remove_network`: removes data points whose `provider` column is `network`. This method keeps only `gps`-derived data points.\n", "- `remove_zeros`: removes data points close to the point \\." ] }, { "cell_type": "code", "execution_count": 5, "id": "a96bdaa6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(9857, 6)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = nilo.filter_location(data, remove_disabled=False, remove_network=False, remove_zeros=True)\n", "data.shape" ] }, { "attachments": {}, "cell_type": "markdown", "id": "224df8a5", "metadata": {}, "source": [ "There is no such data points in this dataset; therefore the dataset does not change after this step and the number of datapoints remains the same." ] }, { "attachments": {}, "cell_type": "markdown", "id": "c09e1ddc", "metadata": {}, "source": [ "## Downsample" ] }, { "attachments": {}, "cell_type": "markdown", "id": "dfce64a6", "metadata": {}, "source": [ "Because GPS records are not always very accurate and they have random errors, it is a good practice to downsample or aggregate data points which are recorded in close time windows. In other words, all the records in the same time window are aggregated to form one GPS record associated to that time window. There are a few parameters to adjust the aggregation setting:\n", "\n", "- `freq`: represents the length of time window. This parameter follows the formatting of the pandas [time offset aliases](https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-offset-aliases) function. For example '5T' means 5 minute intervals.\n", "- `method_numerical`: specifies how numerical columns should be aggregated. Options are 'mean', 'median', 'sum'.\n", "- `method_categorical`: specifies how categorical columns should be aggregated. Options are 'first', 'mode' (most frequent), 'last'.\n", "\n", "The aggregation is performed for each `user` (subject) separately." ] }, { "cell_type": "code", "execution_count": 6, "id": "01aefd90", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(9755, 5)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "binned_data = niimpy.util.aggregate(data, freq='5min', method_numerical='median')\n", "binned_data = binned_data.dropna()\n", "binned_data.shape" ] }, { "cell_type": "code", "execution_count": 7, "id": "d7027bec", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
usertimelatitudelongitudespeed
2013-03-27 06:00:00+02:00gps_u001.364357e+0943.759135-72.3292400.0
2013-03-27 06:20:00+02:00gps_u001.364358e+0943.759503-72.3290180.0
2013-03-27 06:40:00+02:00gps_u001.364359e+0943.759134-72.3292380.0
2013-03-27 07:00:00+02:00gps_u001.364361e+0943.759135-72.3292400.0
2013-03-27 07:20:00+02:00gps_u001.364362e+0943.759135-72.3292400.0
\n", "
" ], "text/plain": [ " user time latitude longitude speed\n", "2013-03-27 06:00:00+02:00 gps_u00 1.364357e+09 43.759135 -72.329240 0.0\n", "2013-03-27 06:20:00+02:00 gps_u00 1.364358e+09 43.759503 -72.329018 0.0\n", "2013-03-27 06:40:00+02:00 gps_u00 1.364359e+09 43.759134 -72.329238 0.0\n", "2013-03-27 07:00:00+02:00 gps_u00 1.364361e+09 43.759135 -72.329240 0.0\n", "2013-03-27 07:20:00+02:00 gps_u00 1.364362e+09 43.759135 -72.329240 0.0" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "binned_data.head()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "968b2039", "metadata": {}, "source": [ "After binning, the number of datapoints (bins) reduces to 9755." ] }, { "attachments": {}, "cell_type": "markdown", "id": "99e948dd", "metadata": {}, "source": [ "## Feature extraction" ] }, { "attachments": {}, "cell_type": "markdown", "id": "c66b48e1", "metadata": {}, "source": [ "Here is the list of features `niimpy` extracts from location data:\n", "\n", "1. Distance based features (`niimpy.preprocessing.location.location_distance_features`):\n", "\n", "| Feature | Description |\n", "|--------------|-----|\n", "| `dist_total` | Total distance a person traveled in meters |\n", "| `variance`, `log_variance` | Variance is defined as sum of variance in latitudes and longitudes |\n", "|`speed_average`, `speed_variance`, and `speed_max`| Statistics of speed (m/s). Speed, if not given, can be calculated by dividing the distance between two consequitive bins by their time difference|\n", "|`n_bins`|Number of location bins that a user recorded in dataset|\n", "\n", "2. Significant place related features (`niimpy.preprocessing.location.location_significant_place_features`):\n", " \n", " | Feature | Description |\n", "|--------------|-----|\n", "|`n_static`| Number of static points. Static points are defined as bins whose speed is lower than a threshold|\n", "|`n_moving`| Number of moving points. Equivalent to `n_bins - n_static`|\n", "|`n_home`| Number of static bins which are close to the person's home. Home is defined the place most visited during nights. More formally, all the locations recorded during 12 Am and 6 AM are clusterd and the center of largest cluster is assumed to be home|\n", "|`max_dist_home`| Maximum distance from home|\n", "|`n_sps`| Bumber of significant places. All of the static bins are clusterd using DBSCAN algorithm. Each cluster represents a Signicant Place (SP) for a user|\n", "|`n_rare`| Number of rarely visited (referred as outliers in DBSCAN)|\n", "|`n_transitions`| Number of transitions between significant places|\n", "|`n_top1`, `n_top2`, `n_top3`, `n_top4`, `n_top5`|: Number of bins in the top `N` cluster. In other words, `n_top1` shows the number of times the person has visited the most freqently visited place|\n", "|`entropy`, `normalized_entropy`|: Entropy of time spent in clusters. Normalized entropy is the entropy divided by the number of clusters|\n", "\n", "3. Local time feature (`niimpy.preprocessing.location.location_local_time`):\n", "\n", "| Feature | Description |\n", "|--------------|-----|\n", "|`time_local`| Local time at the location. This feature is calculated from the time index and the longitude of the location. |\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 8, "id": "5bf0185c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
usern_spsmax_dist_homen_raren_transitionsn_staticn_top3normalized_entropyentropyn_top1...n_movingn_top4dist_totaln_binsspeed_maxlog_variancespeed_averagespeed_variancevariancetimezone
2013-03-31 00:00:00+02:00gps_u005.02.074186e+043.048.0280.034.03.1636315.091668106.0...8.020.04.132581e+05288.01.750000-5.7616880.0334960.0448850.003146America/New_York
2013-04-30 00:00:00+03:00gps_u0010.02.914790e+0545.0194.01966.0135.03.1637937.2849031016.0...66.045.02.179693e+062032.033.250000-1.4391330.2699326.1292770.237133America/New_York
2013-05-31 00:00:00+03:00gps_u0012.01.041741e+0686.0107.01827.086.02.6967526.7011771030.0...76.065.06.986551e+061903.034.0000002.1148920.3512807.5906398.288687America/New_York
2013-06-30 00:00:00+03:00gps_u001.02.035837e+0415.010.022.00.00.0000000.00000015.0...2.00.02.252893e+0524.00.559017-4.2002870.0441260.0214900.014991America/New_York
2013-03-31 00:00:00+02:00gps_u012.06.975303e+020.08.0307.00.04.3923173.044522286.0...18.00.01.328713e+04325.02.692582-12.5209890.0562900.0733700.000004America/New_York
2013-04-30 00:00:00+03:00gps_u011.01.156568e+041.02.01999.00.00.0000000.0000001998.0...71.00.01.238429e+052070.032.750000-10.5100170.0669610.6293930.000027America/New_York
2013-05-31 00:00:00+03:00gps_u011.03.957650e+031.02.03079.00.00.0000000.0000003078.0...34.00.01.228235e+053113.020.250000-11.3644540.0263920.2619780.000012America/New_York
\n", "

7 rows × 23 columns

\n", "
" ], "text/plain": [ " user n_sps max_dist_home n_rare \\\n", "2013-03-31 00:00:00+02:00 gps_u00 5.0 2.074186e+04 3.0 \n", "2013-04-30 00:00:00+03:00 gps_u00 10.0 2.914790e+05 45.0 \n", "2013-05-31 00:00:00+03:00 gps_u00 12.0 1.041741e+06 86.0 \n", "2013-06-30 00:00:00+03:00 gps_u00 1.0 2.035837e+04 15.0 \n", "2013-03-31 00:00:00+02:00 gps_u01 2.0 6.975303e+02 0.0 \n", "2013-04-30 00:00:00+03:00 gps_u01 1.0 1.156568e+04 1.0 \n", "2013-05-31 00:00:00+03:00 gps_u01 1.0 3.957650e+03 1.0 \n", "\n", " n_transitions n_static n_top3 \\\n", "2013-03-31 00:00:00+02:00 48.0 280.0 34.0 \n", "2013-04-30 00:00:00+03:00 194.0 1966.0 135.0 \n", "2013-05-31 00:00:00+03:00 107.0 1827.0 86.0 \n", "2013-06-30 00:00:00+03:00 10.0 22.0 0.0 \n", "2013-03-31 00:00:00+02:00 8.0 307.0 0.0 \n", "2013-04-30 00:00:00+03:00 2.0 1999.0 0.0 \n", "2013-05-31 00:00:00+03:00 2.0 3079.0 0.0 \n", "\n", " normalized_entropy entropy n_top1 ... \\\n", "2013-03-31 00:00:00+02:00 3.163631 5.091668 106.0 ... \n", "2013-04-30 00:00:00+03:00 3.163793 7.284903 1016.0 ... \n", "2013-05-31 00:00:00+03:00 2.696752 6.701177 1030.0 ... \n", "2013-06-30 00:00:00+03:00 0.000000 0.000000 15.0 ... \n", "2013-03-31 00:00:00+02:00 4.392317 3.044522 286.0 ... \n", "2013-04-30 00:00:00+03:00 0.000000 0.000000 1998.0 ... \n", "2013-05-31 00:00:00+03:00 0.000000 0.000000 3078.0 ... \n", "\n", " n_moving n_top4 dist_total n_bins speed_max \\\n", "2013-03-31 00:00:00+02:00 8.0 20.0 4.132581e+05 288.0 1.750000 \n", "2013-04-30 00:00:00+03:00 66.0 45.0 2.179693e+06 2032.0 33.250000 \n", "2013-05-31 00:00:00+03:00 76.0 65.0 6.986551e+06 1903.0 34.000000 \n", "2013-06-30 00:00:00+03:00 2.0 0.0 2.252893e+05 24.0 0.559017 \n", "2013-03-31 00:00:00+02:00 18.0 0.0 1.328713e+04 325.0 2.692582 \n", "2013-04-30 00:00:00+03:00 71.0 0.0 1.238429e+05 2070.0 32.750000 \n", "2013-05-31 00:00:00+03:00 34.0 0.0 1.228235e+05 3113.0 20.250000 \n", "\n", " log_variance speed_average speed_variance \\\n", "2013-03-31 00:00:00+02:00 -5.761688 0.033496 0.044885 \n", "2013-04-30 00:00:00+03:00 -1.439133 0.269932 6.129277 \n", "2013-05-31 00:00:00+03:00 2.114892 0.351280 7.590639 \n", "2013-06-30 00:00:00+03:00 -4.200287 0.044126 0.021490 \n", "2013-03-31 00:00:00+02:00 -12.520989 0.056290 0.073370 \n", "2013-04-30 00:00:00+03:00 -10.510017 0.066961 0.629393 \n", "2013-05-31 00:00:00+03:00 -11.364454 0.026392 0.261978 \n", "\n", " variance timezone \n", "2013-03-31 00:00:00+02:00 0.003146 America/New_York \n", "2013-04-30 00:00:00+03:00 0.237133 America/New_York \n", "2013-05-31 00:00:00+03:00 8.288687 America/New_York \n", "2013-06-30 00:00:00+03:00 0.014991 America/New_York \n", "2013-03-31 00:00:00+02:00 0.000004 America/New_York \n", "2013-04-30 00:00:00+03:00 0.000027 America/New_York \n", "2013-05-31 00:00:00+03:00 0.000012 America/New_York \n", "\n", "[7 rows x 23 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import warnings\n", "warnings.filterwarnings('ignore', category=RuntimeWarning)\n", "\n", "# extract all the available features\n", "all_features = nilo.extract_features_location(binned_data)\n", "all_features" ] }, { "cell_type": "code", "execution_count": 9, "id": "d2b2e06b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dist_totaln_binsspeed_maxuserlog_variancespeed_averagespeed_variancevariance
2013-03-31 00:00:00+02:004.132581e+05288.01.750000gps_u00-5.7616880.0334960.0448850.003146
2013-04-30 00:00:00+03:002.179693e+062032.033.250000gps_u00-1.4391330.2699326.1292770.237133
2013-05-31 00:00:00+03:006.986551e+061903.034.000000gps_u002.1148920.3512807.5906398.288687
2013-06-30 00:00:00+03:002.252893e+0524.00.559017gps_u00-4.2002870.0441260.0214900.014991
2013-03-31 00:00:00+02:001.328713e+04325.02.692582gps_u01-12.5209890.0562900.0733700.000004
2013-04-30 00:00:00+03:001.238429e+052070.032.750000gps_u01-10.5100170.0669610.6293930.000027
2013-05-31 00:00:00+03:001.228235e+053113.020.250000gps_u01-11.3644540.0263920.2619780.000012
\n", "
" ], "text/plain": [ " dist_total n_bins speed_max user \\\n", "2013-03-31 00:00:00+02:00 4.132581e+05 288.0 1.750000 gps_u00 \n", "2013-04-30 00:00:00+03:00 2.179693e+06 2032.0 33.250000 gps_u00 \n", "2013-05-31 00:00:00+03:00 6.986551e+06 1903.0 34.000000 gps_u00 \n", "2013-06-30 00:00:00+03:00 2.252893e+05 24.0 0.559017 gps_u00 \n", "2013-03-31 00:00:00+02:00 1.328713e+04 325.0 2.692582 gps_u01 \n", "2013-04-30 00:00:00+03:00 1.238429e+05 2070.0 32.750000 gps_u01 \n", "2013-05-31 00:00:00+03:00 1.228235e+05 3113.0 20.250000 gps_u01 \n", "\n", " log_variance speed_average speed_variance \\\n", "2013-03-31 00:00:00+02:00 -5.761688 0.033496 0.044885 \n", "2013-04-30 00:00:00+03:00 -1.439133 0.269932 6.129277 \n", "2013-05-31 00:00:00+03:00 2.114892 0.351280 7.590639 \n", "2013-06-30 00:00:00+03:00 -4.200287 0.044126 0.021490 \n", "2013-03-31 00:00:00+02:00 -12.520989 0.056290 0.073370 \n", "2013-04-30 00:00:00+03:00 -10.510017 0.066961 0.629393 \n", "2013-05-31 00:00:00+03:00 -11.364454 0.026392 0.261978 \n", "\n", " variance \n", "2013-03-31 00:00:00+02:00 0.003146 \n", "2013-04-30 00:00:00+03:00 0.237133 \n", "2013-05-31 00:00:00+03:00 8.288687 \n", "2013-06-30 00:00:00+03:00 0.014991 \n", "2013-03-31 00:00:00+02:00 0.000004 \n", "2013-04-30 00:00:00+03:00 0.000027 \n", "2013-05-31 00:00:00+03:00 0.000012 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# extract only distance related features\n", "distance_features = nilo.location_distance_features(binned_data)\n", "distance_features" ] }, { "attachments": {}, "cell_type": "markdown", "id": "b5af300b", "metadata": {}, "source": [ "The 2 rows correspond to the 2 users present in the dataset. Each column represents a feature. For example user `gps_u00` has higher variance in speeds (`speed_variance`) and location variance (`variance`) compared to the user `gps_u01`." ] }, { "attachments": {}, "cell_type": "markdown", "id": "8a8c4ab6", "metadata": {}, "source": [ "## Implementing your own features" ] }, { "attachments": {}, "cell_type": "markdown", "id": "c97375ae", "metadata": {}, "source": [ "If you want to implement a customized feature you can do so with defining a function that accepts a dataframe and returns a dataframe or a series. The returned object should be indexed by `user`. Then, when calling `extract_features_location` function, you add the newly implemented function to the `feature_functions` argument. The default feature functions implemented in `niimpy` are in this variable:" ] }, { "cell_type": "code", "execution_count": 10, "id": "e602497e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{: {},\n", " : {},\n", " : {}}" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nilo.ALL_FEATURES" ] }, { "cell_type": "code", "execution_count": 11, "id": "6a042466", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
usertimelatitudelongitudespeed
2013-03-27 06:00:00+02:00gps_u002013-03-27 06:00:00+02:0043.759135-72.3292400.0
2013-03-27 06:20:00+02:00gps_u002013-03-27 06:20:00+02:0043.759503-72.3290180.0
2013-03-27 06:40:00+02:00gps_u002013-03-27 06:40:00+02:0043.759134-72.3292380.0
2013-03-27 07:00:00+02:00gps_u002013-03-27 07:00:00+02:0043.759135-72.3292400.0
2013-03-27 07:20:00+02:00gps_u002013-03-27 07:20:00+02:0043.759135-72.3292400.0
\n", "
" ], "text/plain": [ " user time latitude \\\n", "2013-03-27 06:00:00+02:00 gps_u00 2013-03-27 06:00:00+02:00 43.759135 \n", "2013-03-27 06:20:00+02:00 gps_u00 2013-03-27 06:20:00+02:00 43.759503 \n", "2013-03-27 06:40:00+02:00 gps_u00 2013-03-27 06:40:00+02:00 43.759134 \n", "2013-03-27 07:00:00+02:00 gps_u00 2013-03-27 07:00:00+02:00 43.759135 \n", "2013-03-27 07:20:00+02:00 gps_u00 2013-03-27 07:20:00+02:00 43.759135 \n", "\n", " longitude speed \n", "2013-03-27 06:00:00+02:00 -72.329240 0.0 \n", "2013-03-27 06:20:00+02:00 -72.329018 0.0 \n", "2013-03-27 06:40:00+02:00 -72.329238 0.0 \n", "2013-03-27 07:00:00+02:00 -72.329240 0.0 \n", "2013-03-27 07:20:00+02:00 -72.329240 0.0 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "binned_data.head()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "a2f91dd5", "metadata": {}, "source": [ "You can add your new function to the `nilo.ALL_FEATURES` dictionary and call `extract_features_location` function. Or if you are interested in only extracting your desired feature you can pass a dictionary containing just that function, like here:" ] }, { "cell_type": "code", "execution_count": 12, "id": "81d9d0d1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
userspeed
0gps_u0034.00
1gps_u0132.75
\n", "
" ], "text/plain": [ " user speed\n", "0 gps_u00 34.00\n", "1 gps_u01 32.75" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# customized function\n", "def max_speed(df):\n", " grouped = df.groupby('user')\n", " df = grouped['speed'].max().reset_index('user')\n", " return df\n", "\n", "customized_features = nilo.extract_features_location(\n", " binned_data,\n", " features={max_speed: {}}\n", ")\n", "customized_features" ] } ], "metadata": { "kernelspec": { "display_name": "niimpy", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.6" } }, "nbformat": 4, "nbformat_minor": 5 }