Project

1. Project Description

Requirements

  • Choose one topic that interests you.

  • Explore the data, make an analysis and draw conclusions about what you find.

  • Write a report in the corresponding format. (see the Report Format section below)

Submission Instruction

  • Attach all the codes in a zip archive, including your pdf report, and a brief README file on how to set the environment (if needed) and run your code.

  • Submit the zip archive to the corresponding MyCourses submission box under the Project section.

2. Report

Report Format

  • Minimum 10 pages (without references). The last page(s) should contain references.

  • Reference page(s) does NOT count towards the total number of pages.

  • Including minimum 2 plots (must be meaningful and with proper size).

  • Font size 12, single column. written in English.

  • Report must be in pdf format.

  • The report should be anonymous, DO NOT keep any information that might tell your peers who you are.

  • Do not include any code/pseudocode/screenshots of your code in the report.

Report Outline

  • Introduction (context setting up, related work or literature review, etc.)

  • Problem Formulation (describe what is the issue that your want to solve)

  • Dataset Description (data types, missing data, observations, etc.)

  • Methods (your methods)

  • Results

  • Conclusion & Discussion

  • References (you can use any reference styles, but only use one style in your report)

Guide of how to make proper references with APA style: https://libguides.murdoch.edu.au/APA

3. Topics

Here you are given a list of options for your final project. You can either explore one of these or a dataset of your choice. If you decide to go with your own choice of the dataset, you must get it approved by the course staff within the first two weeks. The tasks and analyses under each topic are suggestions to make your work easier. They might not be enough for a complete project. If you want to get full points, please either conduct your analysis in addition to the suggested tasks or give a in-depth study of the given tasks.

Topic 1: Daily Activity Analysis with Fitbit Tracker Data

Data: https://www.kaggle.com/arashnic/fitbit

You are given the data that was collected using the fitbit wristwatch. It describes the daily activities of 30 volunteers for 31 consecutive days. In the dailyActivity_merged.csv file you can find the overview information of the daily activities of the user. In other files which names start with “daily” the daily overview of the tracked data is presented.

In the “hourly” and “minute” files with the tracked data you can find how these actions are distributed around the day. weightLogInfo_merged.csv provides the background information about the weight of the user.

Suggested Tasks:

  1. Clean the data, remove N/A values and outliers if there are such

  2. Analyse the data and plot interesting observations: - Minimum one observation on the group level - Minimum one observation on the subject level

  3. Explore one (or more) of these topics: - Make conclusion about the subject’s and community’s lifestyle. Compare with the calories intake, sleep length and step count recommended by WHO. - Make conclusion about the tendencies in daily activities of the individuals (in which situations the individual eats more, sleeps more or exercises more). Explore the patterns. - Your own idea is welcome.

Inspiration: https://www.freecodecamp.org/news/how-i-analyzed-the-data-from-my-fitbit-to-improve-my-overall-health-a2e36426d8f9/

Topic 2: Depression Analysis

Data: https://www.kaggle.com/arashnic/the-depression-dataset

You are given the data that consists of the observations of the activity of the patients with depression (condition group) and without it (control group). The observations include the activity measurements from the actigraph watch recorded every minute during approximately 5-20 days.

The scores.csv provides information about the background data of the participants, such as gender (1 or 2 for female or male), age group, length of education, marital status, employment status, type (afftype (1: bipolar II, 2: unipolar depressive, 3: bipolar I), melanch (1: melancholia, 2: no melancholia), inpatient (1: inpatient, 2: outpatient)) and severity (madrs1 (MADRS score when measurement started), madrs2 (MADRS when measurement stopped)) of the depression (only for the condition group).

Suggested Tasks:

  1. Clean the data, remove N/A values if there are such, or replace the missing values with predictions

  2. Analyse the data and plot interesting observations:

    • Minimum one observation on the group level

    • Minimum one observation on the subject level

    • Note! Use the background information

  3. Explore one (or more) of these topics:

    • Make a comparison between control and condition groups.

    • Make a comparison between different types of depression.

    • Your own idea is welcome.

Topic 3: Sleep Analysis

Data: https://www.kaggle.com/danagerous/sleep-data

You are given the data that was collected through the Sleep Cycle app from Northcube on iOS between 2014-2018 for one user. In the dataset you can find the sleep quality, time spent in bed, general mood on waking up, notes about the events that could influence the sleep cycle and heartbeat.

Suggested Tasks:

  1. Clean the data, remove N/A values if there are such, or replace the missing values with predictions

  2. Analyse the data and plot interesting observations: - At least two observations on the subject level

  3. Explore associations or any other topics with techniques such as clustering, dimension reduction, kernel methods, etc. The more methods you use, the higher chance your will get full points.

  4. Conclude your results.

Topic 4: Interaction Analysis with Apple Watch Data

Data: https://physionet.org/content/sleep-accel/1.0.0/

You are given the data that consists of motion (acceleration), heart rate and steps was collected using Apple Watch and labeled sleep recorded from polysomnography (0-5, wake = 0, N1 = 1, N2 = 2, N3 = 3, REM = 5) for 31 participants during 7 to 14 days period. Date is recorded in seconds since PSG start.

Suggested Tasks:

  1. Clean the data, remove N/A values if there are such, or replace the missing values with predictions

  2. Analyse the data and plot interesting observations: - At least one observation on the group level - At least one observation on the subject level

  3. Explore one (or more) of these topics: - Create a model prediction if the individual is asleep or awake. - Create a model prediction if the individual’s sleep cycle. - Your own idea is welcome.

  4. Make a conclusion about what are the conditions which support longer deep sleep phase, thus influence better quality sleep.

Note! Please don’t use labeled sleep values as parameters in your analysis, but rather treat them as a label.

Topic 5: Network Study

Data: https://figshare.com/articles/dataset/The_Copenhagen_Networks_Study_interaction_data/7267433/1

You are given the network data that was gathered using smartphones. The subjects are interacting via phone calls, SMS messages, social media (they are or are not Facebook friends) and in person (bt_symmetric.csv). The data also provides the timestamp of when the communication has occurred and the gender of the subjects.

Suggested Tasks:

  1. Clean the data, remove N/A values if there are such, or replace the missing values with predictions

  2. Analyse the data and plot interesting observations:

    • At least one observation on the community level

    • At least one observation on the subject level

    • Note! Use the background information

  3. Explore one (or more) of these topics:

    • Make conclusions about how the virus would spread. Find the individuals who could cause the largest spread of the virus (central nodes).

    • Explore the communication times and patterns between individuals.

    • Your own idea is welcome.

Inspiration: https://programminghistorian.org/en/lessons/exploring-and-analyzing-network-data-with-python

Topic 6: Social Media Analysis on Vaccines Discourse

Data: https://www.kaggle.com/datasets/komalkhetlani/tweets-about-covid19-all-over-the-world

You are given the Twitter posts data that contains tweets from the all over Twitter accounts of the world that are talking about Covid-19. The dataset provide features such as original tweet, language and created time.

Suggested Tasks:

  1. Clean the dataset, removing any URLs, mentions, special characters, and N/A values if present.

  2. Explore the data and plot interesting observations

    • At least one observation on the post-activity (e.g. daily, weekly or monthly post frequency)

    • At least one observation on the text (wordcloud, most frequent words, most frequent hashtags, most frequent mentions, etc.)

  3. Explore one (or more) of these topics:

Reference: textblob: https://textblob.readthedocs.io/en/dev/

Topic 7: Social Media Analysis on Mental Health Discourse

Data: https://drive.google.com/drive/folders/1lL6DDzwWCTOF6SJI6fIcxrr2CfK3vvXS?usp=drive_link

Social media platforms like Twitter provide valuable insights into the mental well-being of users. In this project, you will delve deep into two datasets containing tweets collected by phrases indicating depression and anxiety.

Suggested Tasks:

  1. Clean the dataset and remove ULR, N/A values if there are such, mentions, and special characters.

  2. Analyze the data and plot interesting observations: - At least two observations surrounding depression versus anxiety, and explore any noteworthy differences or similarities.

  3. Explore one (or more) of these topics:

Reference: textblob: https://textblob.readthedocs.io/en/dev/