Project

1. Project Description

Requirements

  • Choose one topic that interests you.

  • Explore the data, make an analysis and draw conclusions about what you find.

  • Write a report in the corresponding format (see the Report Format section below).

Submission Instructions

  • Attach all the codes in a zip archive, including your pdf report, and a brief README file on how to set the environment (if needed) and run your code.

  • Submit the zip archive to the corresponding MyCourses submission box under the Project section.

  • Note that the use of AI tools (e.g., ChatGPT) for generating text or computer code is strictly forbidden and is considered cheating/plagiarism.

  • You are strongly advised to focus on the quality of your analysis rather than the quantity of the word count.

2. Report

Report Format

  • Minimum 10 pages (excluding references). The last page(s) should contain references.

  • Reference page(s) does NOT count towards the total number of pages.

  • Including minimum 2 plots (must be meaningful and with proper size).

  • Font size 12, single column. written in English.

  • Report must be in pdf format.

  • The report should be anonymous, DO NOT keep any information that might tell your peers who you are.

  • DO NOT include any code/pseudocode/screenshots of your code in the report.

Report Outline

  • Introduction (context setting up, related work or literature review, etc.)

  • Problem Formulation (describe what is the issue that your want to solve)

  • Dataset Description (data types, missing data, observations, etc.)

  • Methods (your methods)

  • Results

  • Conclusion & Discussion

  • References (you can use any reference styles, but only use one style in your report)

Guide of how to make proper references with APA style: https://libguides.murdoch.edu.au/APA

3. Rubric

The submissions will be graded based on the rubric provided in the Project Grading page.

4. Topics

Here you are given a list of options for your final project. You can either explore one of these or a dataset of your choice. If you decide to go with your own choice of the dataset, you must get it approved by the course staff within the first two weeks. The tasks and analyses under each topic are suggestions to make your work easier. They might not be enough for a complete project. If you want to get full points, please either conduct your analysis in addition to the suggested tasks or give a in-depth study of the given tasks.

Topic 1: Daily Activity Analysis with Fitbit Tracker Data

Data: https://www.kaggle.com/arashnic/fitbit

You are given the data that was collected using the fitbit wristwatch. It describes the daily activities of 30 volunteers for 31 consecutive days. In the dailyActivity_merged.csv file you can find the overview information of the daily activities of the user. In other files which names start with “daily” the daily overview of the tracked data is presented.

In the “hourly” and “minute” files with the tracked data you can find how these actions are distributed around the day. weightLogInfo_merged.csv provides the background information about the weight of the user.

Suggested Tasks:

  1. Clean the data, remove N/A values and outliers if there are such

  2. Analyse the data and plot interesting observations:

    • Minimum one observation on the group level

    • Minimum one observation on the subject level

  3. Explore one (or more) of these topics:

    • Make conclusion about the subject’s and community’s lifestyle. Compare with the calories intake, sleep length and step count recommended by WHO.

    • Make conclusion about the tendencies in daily activities of the individuals (in which situations the individual eats more, sleeps more or exercises more). Explore the patterns.

    • Your own idea is welcome.

Inspiration: https://www.freecodecamp.org/news/how-i-analyzed-the-data-from-my-fitbit-to-improve-my-overall-health-a2e36426d8f9/

Topic 2: Depression Severity Analysis Using Reported Health Data

Data: https://zenodo.org/records/5085146

In this project, you will explore the data collected from a 1-year longitudinal study called the DiSCover Project. The dataset contains person-generated health data (PGHD) focused on tracking individual changes in depression status over time. The dataset comprises >150 columns, including aggregated features derived from wearable data (e.g., sleep and steps), lifestyle surveys, and Patient Health Questionnaire-9 (PHQ-9) monthly questionnaire scores. You are encouraged to choose an appropriate subset of features that align with your analysis goals.

Suggested Tasks:

  1. Clean the dataset and address issues such as handling missing values in PHQ-9 values, and aggregating the existing data for further analyses.

  2. Analyze the data to explore key trends, such as:

    • Investigating patterns in depression scores across time intervals.

    • Examining relationships between wearable data (steps, sleep) and changes in PHQ-9 scores.

  3. Explore one (or more) of these topics:

    • Conduct a time-series analysis of depression status changes over 3-month intervals to see how wearable PGHD and lifestyle changes correlate with PHQ-9 score variations.

    • Use machine learning models to predict depression severity based on the combination of wearable data and self-reported survey responses (see the inspiration link below).

    • Apply clustering techniques to group participants based on behavioral and lifestyle data, and investigate how they relate to depression severity scores.

    • Investigate the correlation between demographic factors, medication changes, and lifestyle modifications with depression status, identifying potential markers of mental health outcomes.

    • Your own interesting ideas are welcome. Get creative!

Reference: DiSCover Project: https://clinicaltrials.gov/ct2/show/NCT03421223

Inspiration: https://dl.acm.org/doi/10.1145/3469266.3469878

Topic 3: Sleep and Activity Analysis

Data: https://physionet.org/content/mmash/1.0.0/

You are given data from the Multilevel Monitoring of Activity and Sleep in Healthy people (MMASH) dataset, which links wearable activity, sleep quality, physiological signals, and psychological self-reports. The data was collected over two consecutive days for 22 healthy participants. The dataset includes heart rate, accelerometer, sleep, activity, saliva biomarker, and psychological measures, allowing analysis of the relationships between physical activity, sleep quality, and psychological characteristics.

Suggested Tasks:

  1. Loop through each participant folder and combine all participant data.

  2. Clean the data, remove N/A values if there are such, or replace the missing values with predictions.

  3. Analyse the data and plot interesting observations

    • Minimum one observation on the group level

    • Minimum one observation on the subject level

  4. Explore one (or more) of these topics:

    • Conduct a time-series analysis to examine how daytime activity patterns relate to sleep quality.

    • Use regression or machine learning models to test whether HRV or movement predicts sleep fragmentation across participants.

    • Investigate relationships between hormone levels and subjective and objective sleep measures.

    • Investigate associations between self-reported stress or mood scores and objective sleep quality.

    • Your own ideas are welcome.

Reference: Rossi, A., Da Pozzo, E., Menicagli, D., Tremolanti, C., Priami, C., Sirbu, A., Clifton, D., Martini, C., & Morelli, D. (2020). Multilevel Monitoring of Activity and Sleep in Healthy People (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/cerq-fc86

Topic 4: Interaction Analysis with Apple Watch Data

Data: https://physionet.org/content/sleep-accel/1.0.0/

You are given the data that consists of motion (acceleration), heart rate and steps was collected using Apple Watch and labeled sleep recorded from polysomnography (0-5, wake = 0, N1 = 1, N2 = 2, N3 = 3, REM = 5) for 31 participants during 7 to 14 days period. Date is recorded in seconds since PSG start.

Suggested Tasks:

  1. Clean the data, remove N/A values if there are such, or replace the missing values with predictions

  2. Analyse the data and plot interesting observations:

    • At least one observation on the group level

    • At least one observation on the subject level

  3. Explore one (or more) of these topics:

    • Create a model prediction if the individual is asleep or awake.

    • Create a model prediction if the individual’s sleep cycle.

    • Your own ideas are welcome.

  4. Make a conclusion about what are the conditions which support longer deep sleep phase, thus influence better quality sleep.

Note! Please don’t use labeled sleep values as parameters in your analysis, but rather treat them as a label.

Topic 5: Network Study

Data: https://figshare.com/articles/dataset/The_Copenhagen_Networks_Study_interaction_data/7267433/1

You are given the network data that was gathered using smartphones. The subjects are interacting via phone calls, SMS messages, social media (they are or are not Facebook friends) and in person (bt_symmetric.csv). The data also provides the timestamp of when the communication has occurred and the gender of the subjects.

Suggested Tasks:

  1. Clean the data, remove N/A values if there are such, or replace the missing values with predictions

  2. Analyse the data and plot interesting observations:

    • At least one observation on the community level

    • At least one observation on the subject level

    • Note! Use the background information

  3. Explore one (or more) of these topics:

    • Make conclusions about how the virus would spread. Find the individuals who could cause the largest spread of the virus (central nodes).

    • Explore the communication times and patterns between individuals.

    • Your own idea is welcome.

Inspiration: https://programminghistorian.org/en/lessons/exploring-and-analyzing-network-data-with-python

Topic 6: Stress Detection from Social Media Posts

Data: https://github.com/SenticNet/stress-detection

You are given four high-quality datasets collected from Reddit and Twitter. Each post/tweet is labeled as either Stress Positive (1) or Stress Negative (0) based on an automated DNN-based annotation strategy

Suggested Tasks:

  1. Clean the datasets: remove duplicates, handle missing/empty texts, normalize formatting (e.g., whitespace, URLs, hashtags, emojis).

  2. Explore the data and plot interesting observations

    • At least one observation on the platform level (Reddit and Twitter).

    • At least one observation on the text level (stress vs. non-stress posts)

  3. Explore one (or more) of these topics:

    • Train baseline models (e.g., Logistic Regression, SVM) and compare with transformer-based models (e.g., BERT, RoBERTa) for stress classification (you can even use local LLMs with in-context learning or fine-tuning).

    • Conduct cross-platform analysis: train on Reddit datasets and test on Twitter datasets (and vice versa) to assess generalization.

    • Investigate the impact of dataset variations (e.g., Twitter Full vs. Non-Advert, Reddit Title vs. Combi) on classification accuracy.

    • Conduct a linguistic analysis using LIWC (Linguistic Inquiry and Word Count), NRC emotion lexicon, or other tools to identify patterns of stress-related language.

Reference: Rastogi, A., Liu, Q., & Cambria, E. (2022, July). Stress detection from social media articles: New dataset benchmark and analytical study. In 2022 international joint conference on neural networks (IJCNN) (pp. 1-8). IEEE.

Topic 7: Reddit Data vs. Mental Health in the COVID-19 Pandemic

Data: https://zenodo.org/records/3941387

This dataset enables a comprehensive look into how mental health support groups on numerous subreddits were impacted by the onset of COVID-19. We can gain insights into shifts in discourse and user behavior during the pandemic by analyzing posts from Jan to Apr 2020 and comparing them to baseline posts from earlier timeframes (2018-2019).

Suggested Tasks:

  1. The dataset is large (~3GB) and diverse; therefore, implement a justified strategy to select a manageable subset of the dataset for analysis (based on criteria such as specific subreddits, post frequency, etc.). Ensure the selected subset is representative of the larger dataset.

  2. Clean the dataset and address issues such as handling missing values, removing irrelevant columns, and preprocessing text data for analysis.

  3. Analyze COVID-19 impact by comparing posts in the appropriate time frames. Explore shifts in topics, sentiment, and the way mental health concerns (e.g., anxiety, depression) were discussed pre- and post-COVID-19 outbreak.

  4. Explore one (or more) of these topics:

    • Conduct a linguistic analysis or a sentiment analysis of posts before and during the pandemic.

    • Perform unsupervised clustering to identify conversation clusters (such as Seeking Advice, Suicidality, and Medication) and how their representation varied across mental health subreddits.

    • Investigate user behavior trends, such as changes in post frequency, length, and engagement during the peak COVID-19 months.

    • Your own cool, challenging idea is welcome!

Inspiration: https://osf.io/preprints/psyarxiv/xvwcy

Reference: textblob: https://textblob.readthedocs.io/en/dev/