Survey Data
Surveys consist of columns * id
for the question identifier * answer
for the answer of the question * q
which is the text of the question presented to the user (optional) * As usual, the DataFrame index is the timestamp of the answer. It is the convention that all responses in a one single survey instance have the same timestamp, and this is used to link surveys together.
The raw on-disk format is “long”, that is, one row per answer, which is “tidy data”. This provides the most flexible format, but often you need to do other transformations.
Load data
[1]:
# Artificial example survey data
import niimpy
from niimpy import config
import niimpy.preprocessing.survey as survey
from niimpy.preprocessing.survey import *
import warnings
warnings.filterwarnings("ignore")
[2]:
df = niimpy.read_csv(config.SURVEY_PATH, tz='Europe/Helsinki')
df.head()
[2]:
user | age | gender | Little interest or pleasure in doing things. | Feeling down; depressed or hopeless. | Feeling nervous; anxious or on edge. | Not being able to stop or control worrying. | In the last month; how often have you felt that you were unable to control the important things in your life? | In the last month; how often have you felt confident about your ability to handle your personal problems? | In the last month; how often have you felt that things were going your way? | In the last month; how often have you been able to control irritations in your life? | In the last month; how often have you felt that you were on top of things? | In the last month; how often have you been angered because of things that were outside of your control? | In the last month; how often have you felt difficulties were piling up so high that you could not overcome them? | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 20 | Male | several-days | more-than-half-the-days | not-at-all | nearly-every-day | almost-never | sometimes | fairly-often | never | sometimes | very-often | fairly-often |
1 | 2 | 32 | Male | more-than-half-the-days | more-than-half-the-days | not-at-all | several-days | never | never | very-often | sometimes | never | fairly-often | never |
2 | 3 | 15 | Male | more-than-half-the-days | not-at-all | several-days | not-at-all | never | very-often | very-often | fairly-often | never | never | almost-never |
3 | 4 | 35 | Female | not-at-all | nearly-every-day | not-at-all | several-days | very-often | fairly-often | very-often | never | sometimes | never | fairly-often |
4 | 5 | 23 | Male | more-than-half-the-days | not-at-all | more-than-half-the-days | several-days | almost-never | very-often | almost-never | sometimes | sometimes | very-often | never |
Preprocessing
Currently the dataframe columns are raw questions and answers from the survey. We will use Niimpy
to convert them to a numerical format, but first the dataframe should follow the general Niimpy
Schema. The rows should be indexed by a datetime index, rather than a number.
Since the data does not contain a timestamp, we must assume that each user has only completed the survey once. If the surveys were completed on January 1st 2020, for example, we would replace the index with this date.
[3]:
# Assign the same time index to all survey responses
df.index = [pd.Timestamp("1.1.2020", tz='Europe/Helsinki')]*df.shape[0]
Next we will convert the questions to a standard identifier format Niimpy
will understand. The questions are from PHQ2, GAD2 and PSS10 standard surveys. Niimpy
provides mappings from raw question text to question ids for these surveys. The identifiers is constructed from a prefix (the questionnaire category: GAD, PHQ, PSQI etc.), followed by the question number (1,2,3). You can define your own identifiers or use the ones provided by Niimpy
.
Before applying the mapping, the column names should be cleaned using the clean_survey_column_names
function. This removes punctuation in the question text.
[4]:
# For example, the mapping dictionary for PHQ2 is
PHQ2_MAP
[4]:
{'Little interest or pleasure in doing things': 'PHQ2_1',
'Feeling down depressed or hopeless': 'PHQ2_2'}
[5]:
# Convert column name to id, based on provided mappers from niimpy
column_map = {**PHQ2_MAP, **PSS10_MAP, **GAD2_MAP}
df = survey.clean_survey_column_names(df)
df = df.rename(column_map, axis = 1)
df.head()
[5]:
user | age | gender | PHQ2_1 | PHQ2_2 | GAD2_1 | GAD2_2 | PSS10_2 | PSS10_4 | PSS10_5 | PSS10_6 | PSS10_7 | PSS10_8 | PSS10_9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2020-01-01 00:00:00+02:00 | 1 | 20 | Male | several-days | more-than-half-the-days | not-at-all | nearly-every-day | almost-never | sometimes | fairly-often | never | sometimes | very-often | fairly-often |
2020-01-01 00:00:00+02:00 | 2 | 32 | Male | more-than-half-the-days | more-than-half-the-days | not-at-all | several-days | never | never | very-often | sometimes | never | fairly-often | never |
2020-01-01 00:00:00+02:00 | 3 | 15 | Male | more-than-half-the-days | not-at-all | several-days | not-at-all | never | very-often | very-often | fairly-often | never | never | almost-never |
2020-01-01 00:00:00+02:00 | 4 | 35 | Female | not-at-all | nearly-every-day | not-at-all | several-days | very-often | fairly-often | very-often | never | sometimes | never | fairly-often |
2020-01-01 00:00:00+02:00 | 5 | 23 | Male | more-than-half-the-days | not-at-all | more-than-half-the-days | several-days | almost-never | very-often | almost-never | sometimes | sometimes | very-often | never |
Now the dataframe follows the Niimpy
standard schema. Next we will use niimpy
to convert the raw answers to numerical values for further analysis. For this, we need a mapping {raw_answer: numerical_answer}
, which niimpy
provides within the survey
module. You can also use your own mapping.
Based on the question’s id, niimpy
maps the raw answers to their numerical presentation.
[6]:
# The mapping dictionary included in Niimpy is
ID_MAP_PREFIX
[6]:
{'PSS': {'never': 0,
'almost never': 1,
'sometimes': 2,
'fairly often': 3,
'very often': 4},
'PHQ2': {'not at all': 0,
'several days': 1,
'more than half the days': 2,
'nearly every day': 3},
'GAD2': {'not at all': 0,
'several days': 1,
'more than half the days': 2,
'nearly every day': 3}}
[7]:
# Transform raw answers to numerical values
transformed_df = survey.convert_survey_to_numerical_answer(
df, id_map=ID_MAP_PREFIX, use_prefix=True
)
transformed_df.head()
[7]:
user | age | gender | PHQ2_1 | PHQ2_2 | GAD2_1 | GAD2_2 | PSS10_2 | PSS10_4 | PSS10_5 | PSS10_6 | PSS10_7 | PSS10_8 | PSS10_9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2020-01-01 00:00:00+02:00 | 1 | 20 | Male | 1 | 2 | 0 | 3 | 1 | 2 | 3 | 0 | 2 | 4 | 3 |
2020-01-01 00:00:00+02:00 | 2 | 32 | Male | 2 | 2 | 0 | 1 | 0 | 0 | 4 | 2 | 0 | 3 | 0 |
2020-01-01 00:00:00+02:00 | 3 | 15 | Male | 2 | 0 | 1 | 0 | 0 | 4 | 4 | 3 | 0 | 0 | 1 |
2020-01-01 00:00:00+02:00 | 4 | 35 | Female | 0 | 3 | 0 | 1 | 4 | 3 | 4 | 0 | 2 | 0 | 3 |
2020-01-01 00:00:00+02:00 | 5 | 23 | Male | 2 | 0 | 2 | 1 | 1 | 4 | 1 | 2 | 2 | 4 | 0 |
Survey score sums
Next we can calucate the sum of each survey using the survey ID in the column name.
[8]:
sum_df = sum_survey_scores(transformed_df, ["PHQ2", "PSS10", "GAD2"])
sum_df.head()
[8]:
user | PHQ2 | PSS10 | GAD2 | |
---|---|---|---|---|
2020-01-01 00:00:00+02:00 | 1 | 3 | 15 | 3 |
2020-01-01 00:00:00+02:00 | 2 | 4 | 9 | 1 |
2020-01-01 00:00:00+02:00 | 3 | 2 | 12 | 1 |
2020-01-01 00:00:00+02:00 | 4 | 3 | 16 | 1 |
2020-01-01 00:00:00+02:00 | 5 | 2 | 14 | 3 |
Survey statistics
Another common preprocessing step is to resample results to reduce noise or simplify the data. The survey.survey_statistic
function split the results by time intervals and return relevant statistics of each survey sum or each question column over that interval.
Note that since the example data contains a single time for each participant, the standard deviation is NaN
and the other statistics are predictable.
[9]:
survey.survey_statistic(sum_df, {
"columns": ["PHQ2", "PSS10", "GAD2"]
})
[9]:
user | PHQ2_mean | PHQ2_min | PHQ2_max | PHQ2_std | PSS10_mean | PSS10_min | PSS10_max | PSS10_std | GAD2_mean | GAD2_min | GAD2_max | GAD2_std | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2020-01-01 00:00:00+02:00 | 1 | 3.0 | 3.0 | 3.0 | NaN | 15.0 | 15.0 | 15.0 | NaN | 3.0 | 3.0 | 3.0 | NaN |
2020-01-01 00:00:00+02:00 | 2 | 4.0 | 4.0 | 4.0 | NaN | 9.0 | 9.0 | 9.0 | NaN | 1.0 | 1.0 | 1.0 | NaN |
2020-01-01 00:00:00+02:00 | 3 | 2.0 | 2.0 | 2.0 | NaN | 12.0 | 12.0 | 12.0 | NaN | 1.0 | 1.0 | 1.0 | NaN |
2020-01-01 00:00:00+02:00 | 4 | 3.0 | 3.0 | 3.0 | NaN | 16.0 | 16.0 | 16.0 | NaN | 1.0 | 1.0 | 1.0 | NaN |
2020-01-01 00:00:00+02:00 | 5 | 2.0 | 2.0 | 2.0 | NaN | 14.0 | 14.0 | 14.0 | NaN | 3.0 | 3.0 | 3.0 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2020-01-01 00:00:00+02:00 | 996 | 3.0 | 3.0 | 3.0 | NaN | 17.0 | 17.0 | 17.0 | NaN | 2.0 | 2.0 | 2.0 | NaN |
2020-01-01 00:00:00+02:00 | 997 | 0.0 | 0.0 | 0.0 | NaN | 13.0 | 13.0 | 13.0 | NaN | 1.0 | 1.0 | 1.0 | NaN |
2020-01-01 00:00:00+02:00 | 998 | 2.0 | 2.0 | 2.0 | NaN | 13.0 | 13.0 | 13.0 | NaN | 2.0 | 2.0 | 2.0 | NaN |
2020-01-01 00:00:00+02:00 | 999 | 4.0 | 4.0 | 4.0 | NaN | 21.0 | 21.0 | 21.0 | NaN | 5.0 | 5.0 | 5.0 | NaN |
2020-01-01 00:00:00+02:00 | 1000 | 4.0 | 4.0 | 4.0 | NaN | 14.0 | 14.0 | 14.0 | NaN | 2.0 | 2.0 | 2.0 | NaN |
1000 rows × 13 columns
survey_statistic also works for indidual questions. You can specify the questionnaire that you want statistics of by passing a value into the prefix
parameter or pass a list of questions as the columns
parameter.
[10]:
d = survey.survey_statistic(transformed_df, {
"prefix":'PHQ',
})
pd.DataFrame(d)
[10]:
user | PHQ2_1_mean | PHQ2_1_min | PHQ2_1_max | PHQ2_1_std | PHQ2_2_mean | PHQ2_2_min | PHQ2_2_max | PHQ2_2_std | |
---|---|---|---|---|---|---|---|---|---|
2020-01-01 00:00:00+02:00 | 1 | 1.0 | 1.0 | 1.0 | NaN | 2.0 | 2.0 | 2.0 | NaN |
2020-01-01 00:00:00+02:00 | 2 | 2.0 | 2.0 | 2.0 | NaN | 2.0 | 2.0 | 2.0 | NaN |
2020-01-01 00:00:00+02:00 | 3 | 2.0 | 2.0 | 2.0 | NaN | 0.0 | 0.0 | 0.0 | NaN |
2020-01-01 00:00:00+02:00 | 4 | 0.0 | 0.0 | 0.0 | NaN | 3.0 | 3.0 | 3.0 | NaN |
2020-01-01 00:00:00+02:00 | 5 | 2.0 | 2.0 | 2.0 | NaN | 0.0 | 0.0 | 0.0 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2020-01-01 00:00:00+02:00 | 996 | 0.0 | 0.0 | 0.0 | NaN | 3.0 | 3.0 | 3.0 | NaN |
2020-01-01 00:00:00+02:00 | 997 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | NaN |
2020-01-01 00:00:00+02:00 | 998 | 1.0 | 1.0 | 1.0 | NaN | 1.0 | 1.0 | 1.0 | NaN |
2020-01-01 00:00:00+02:00 | 999 | 2.0 | 2.0 | 2.0 | NaN | 2.0 | 2.0 | 2.0 | NaN |
2020-01-01 00:00:00+02:00 | 1000 | 2.0 | 2.0 | 2.0 | NaN | 2.0 | 2.0 | 2.0 | NaN |
1000 rows × 9 columns