Survey Data

Surveys consist of columns * id for the question identifier * answer for the answer of the question * q which is the text of the question presented to the user (optional) * As usual, the DataFrame index is the timestamp of the answer. It is the convention that all responses in a one single survey instance have the same timestamp, and this is used to link surveys together.

The raw on-disk format is “long”, that is, one row per answer, which is “tidy data”. This provides the most flexible format, but often you need to do other transformations.

Load data

[1]:
# Artificial example survey data
import niimpy
from niimpy import config
import niimpy.preprocessing.survey as survey
from niimpy.preprocessing.survey import *
import warnings
warnings.filterwarnings("ignore")
[2]:
df = niimpy.read_csv(config.SURVEY_PATH, tz='Europe/Helsinki')
df.head()
[2]:
user age gender Little interest or pleasure in doing things. Feeling down; depressed or hopeless. Feeling nervous; anxious or on edge. Not being able to stop or control worrying. In the last month; how often have you felt that you were unable to control the important things in your life? In the last month; how often have you felt confident about your ability to handle your personal problems? In the last month; how often have you felt that things were going your way? In the last month; how often have you been able to control irritations in your life? In the last month; how often have you felt that you were on top of things? In the last month; how often have you been angered because of things that were outside of your control? In the last month; how often have you felt difficulties were piling up so high that you could not overcome them?
0 1 20 Male several-days more-than-half-the-days not-at-all nearly-every-day almost-never sometimes fairly-often never sometimes very-often fairly-often
1 2 32 Male more-than-half-the-days more-than-half-the-days not-at-all several-days never never very-often sometimes never fairly-often never
2 3 15 Male more-than-half-the-days not-at-all several-days not-at-all never very-often very-often fairly-often never never almost-never
3 4 35 Female not-at-all nearly-every-day not-at-all several-days very-often fairly-often very-often never sometimes never fairly-often
4 5 23 Male more-than-half-the-days not-at-all more-than-half-the-days several-days almost-never very-often almost-never sometimes sometimes very-often never

Preprocessing

Currently the dataframe columns are raw questions and answers from the survey. We will use Niimpy to convert them to a numerical format, but first the dataframe should follow the general Niimpy Schema. The rows should be indexed by a datetime index, rather than a number.

Since the data does not contain a timestamp, we must assume that each user has only completed the survey once. If the surveys were completed on January 1st 2020, for example, we would replace the index with this date.

[3]:
# Assign the same time index to all survey responses
df.index = [pd.Timestamp("1.1.2020", tz='Europe/Helsinki')]*df.shape[0]

Next we will convert the questions to a standard identifier format Niimpy will understand. The questions are from PHQ2, GAD2 and PSS10 standard surveys. Niimpy provides mappings from raw question text to question ids for these surveys. The identifiers is constructed from a prefix (the questionnaire category: GAD, PHQ, PSQI etc.), followed by the question number (1,2,3). You can define your own identifiers or use the ones provided by Niimpy.

Before applying the mapping, the column names should be cleaned using the clean_survey_column_names function. This removes punctuation in the question text.

[4]:
# For example, the mapping dictionary for PHQ2 is
PHQ2_MAP
[4]:
{'Little interest or pleasure in doing things': 'PHQ2_1',
 'Feeling down depressed or hopeless': 'PHQ2_2'}
[5]:
# Convert column name to id, based on provided mappers from niimpy
column_map = {**PHQ2_MAP, **PSS10_MAP, **GAD2_MAP}
df = survey.clean_survey_column_names(df)
df = df.rename(column_map, axis = 1)
df.head()
[5]:
user age gender PHQ2_1 PHQ2_2 GAD2_1 GAD2_2 PSS10_2 PSS10_4 PSS10_5 PSS10_6 PSS10_7 PSS10_8 PSS10_9
2020-01-01 00:00:00+02:00 1 20 Male several-days more-than-half-the-days not-at-all nearly-every-day almost-never sometimes fairly-often never sometimes very-often fairly-often
2020-01-01 00:00:00+02:00 2 32 Male more-than-half-the-days more-than-half-the-days not-at-all several-days never never very-often sometimes never fairly-often never
2020-01-01 00:00:00+02:00 3 15 Male more-than-half-the-days not-at-all several-days not-at-all never very-often very-often fairly-often never never almost-never
2020-01-01 00:00:00+02:00 4 35 Female not-at-all nearly-every-day not-at-all several-days very-often fairly-often very-often never sometimes never fairly-often
2020-01-01 00:00:00+02:00 5 23 Male more-than-half-the-days not-at-all more-than-half-the-days several-days almost-never very-often almost-never sometimes sometimes very-often never

Now the dataframe follows the Niimpy standard schema. Next we will use niimpy to convert the raw answers to numerical values for further analysis. For this, we need a mapping {raw_answer: numerical_answer}, which niimpy provides within the survey module. You can also use your own mapping.

Based on the question’s id, niimpy maps the raw answers to their numerical presentation.

[6]:
# The mapping dictionary included in Niimpy is
ID_MAP_PREFIX
[6]:
{'PSS': {'never': 0,
  'almost never': 1,
  'sometimes': 2,
  'fairly often': 3,
  'very often': 4},
 'PHQ2': {'not at all': 0,
  'several days': 1,
  'more than half the days': 2,
  'nearly every day': 3},
 'GAD2': {'not at all': 0,
  'several days': 1,
  'more than half the days': 2,
  'nearly every day': 3}}
[7]:
# Transform raw answers to numerical values
transformed_df = survey.convert_survey_to_numerical_answer(
    df, id_map=ID_MAP_PREFIX, use_prefix=True
)
transformed_df.head()
[7]:
user age gender PHQ2_1 PHQ2_2 GAD2_1 GAD2_2 PSS10_2 PSS10_4 PSS10_5 PSS10_6 PSS10_7 PSS10_8 PSS10_9
2020-01-01 00:00:00+02:00 1 20 Male 1 2 0 3 1 2 3 0 2 4 3
2020-01-01 00:00:00+02:00 2 32 Male 2 2 0 1 0 0 4 2 0 3 0
2020-01-01 00:00:00+02:00 3 15 Male 2 0 1 0 0 4 4 3 0 0 1
2020-01-01 00:00:00+02:00 4 35 Female 0 3 0 1 4 3 4 0 2 0 3
2020-01-01 00:00:00+02:00 5 23 Male 2 0 2 1 1 4 1 2 2 4 0

Survey score sums

Next we can calucate the sum of each survey using the survey ID in the column name.

[8]:
sum_df = sum_survey_scores(transformed_df, ["PHQ2", "PSS10", "GAD2"])
sum_df.head()
[8]:
user PHQ2 PSS10 GAD2
2020-01-01 00:00:00+02:00 1 3 15 3
2020-01-01 00:00:00+02:00 2 4 9 1
2020-01-01 00:00:00+02:00 3 2 12 1
2020-01-01 00:00:00+02:00 4 3 16 1
2020-01-01 00:00:00+02:00 5 2 14 3

Survey statistics

Another common preprocessing step is to resample results to reduce noise or simplify the data. The survey.survey_statistic function split the results by time intervals and return relevant statistics of each survey sum or each question column over that interval.

Note that since the example data contains a single time for each participant, the standard deviation is NaN and the other statistics are predictable.

[9]:
survey.survey_statistic(sum_df, {
    "columns": ["PHQ2", "PSS10", "GAD2"]
})
[9]:
user PHQ2_mean PHQ2_min PHQ2_max PHQ2_std PSS10_mean PSS10_min PSS10_max PSS10_std GAD2_mean GAD2_min GAD2_max GAD2_std
2020-01-01 00:00:00+02:00 1 3.0 3.0 3.0 NaN 15.0 15.0 15.0 NaN 3.0 3.0 3.0 NaN
2020-01-01 00:00:00+02:00 2 4.0 4.0 4.0 NaN 9.0 9.0 9.0 NaN 1.0 1.0 1.0 NaN
2020-01-01 00:00:00+02:00 3 2.0 2.0 2.0 NaN 12.0 12.0 12.0 NaN 1.0 1.0 1.0 NaN
2020-01-01 00:00:00+02:00 4 3.0 3.0 3.0 NaN 16.0 16.0 16.0 NaN 1.0 1.0 1.0 NaN
2020-01-01 00:00:00+02:00 5 2.0 2.0 2.0 NaN 14.0 14.0 14.0 NaN 3.0 3.0 3.0 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2020-01-01 00:00:00+02:00 996 3.0 3.0 3.0 NaN 17.0 17.0 17.0 NaN 2.0 2.0 2.0 NaN
2020-01-01 00:00:00+02:00 997 0.0 0.0 0.0 NaN 13.0 13.0 13.0 NaN 1.0 1.0 1.0 NaN
2020-01-01 00:00:00+02:00 998 2.0 2.0 2.0 NaN 13.0 13.0 13.0 NaN 2.0 2.0 2.0 NaN
2020-01-01 00:00:00+02:00 999 4.0 4.0 4.0 NaN 21.0 21.0 21.0 NaN 5.0 5.0 5.0 NaN
2020-01-01 00:00:00+02:00 1000 4.0 4.0 4.0 NaN 14.0 14.0 14.0 NaN 2.0 2.0 2.0 NaN

1000 rows × 13 columns

survey_statistic also works for indidual questions. You can specify the questionnaire that you want statistics of by passing a value into the prefix parameter or pass a list of questions as the columns parameter.

[10]:
d = survey.survey_statistic(transformed_df, {
    "prefix":'PHQ',
})
pd.DataFrame(d)
[10]:
user PHQ2_1_mean PHQ2_1_min PHQ2_1_max PHQ2_1_std PHQ2_2_mean PHQ2_2_min PHQ2_2_max PHQ2_2_std
2020-01-01 00:00:00+02:00 1 1.0 1.0 1.0 NaN 2.0 2.0 2.0 NaN
2020-01-01 00:00:00+02:00 2 2.0 2.0 2.0 NaN 2.0 2.0 2.0 NaN
2020-01-01 00:00:00+02:00 3 2.0 2.0 2.0 NaN 0.0 0.0 0.0 NaN
2020-01-01 00:00:00+02:00 4 0.0 0.0 0.0 NaN 3.0 3.0 3.0 NaN
2020-01-01 00:00:00+02:00 5 2.0 2.0 2.0 NaN 0.0 0.0 0.0 NaN
... ... ... ... ... ... ... ... ... ...
2020-01-01 00:00:00+02:00 996 0.0 0.0 0.0 NaN 3.0 3.0 3.0 NaN
2020-01-01 00:00:00+02:00 997 0.0 0.0 0.0 NaN 0.0 0.0 0.0 NaN
2020-01-01 00:00:00+02:00 998 1.0 1.0 1.0 NaN 1.0 1.0 1.0 NaN
2020-01-01 00:00:00+02:00 999 2.0 2.0 2.0 NaN 2.0 2.0 2.0 NaN
2020-01-01 00:00:00+02:00 1000 2.0 2.0 2.0 NaN 2.0 2.0 2.0 NaN

1000 rows × 9 columns