Bellabeat is the go-to wellness brand for women with an ecosystem of products and services that focuses on women’s health. They empower women with knowledge about their own health and habits by collecting data on activity, sleep, stress, and reproductive health. Urška Sršen, co-founder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company.
The key stakeholders for this project include:
These three questions will guide our analysis:
Analyze smart device fitness data to gain insight into consumers and identify trends that can inform Bellabeat App marketing strategy.
We are using R for our data analysis because it’s a powerful, all-in-one tool. We can explore all the datasets quickly, provide code, share visualizations, and keep everything organized all within this html file. This html file will serve as a record of the data cleaning process, data analysis and data visualization generation that preceded our stakeholder presentation.
The first thing we’ll do is install the needed packages:
install.packages()
Then we’ll load the libraries for these packages:
library()
We’ll be using public data that explores smart device users’ daily habits:
Important note: The FitBit dataset does have some limitations. Since the data is collected from thirty Fitbit users , it is not representative of the entire fitness tracker industry and thus suffers some sampling bias. We will address this limitation by using physical activity data from the World Health Organization (WHO) as a reference point for activity levels on a global scale. We’ll look at open data on global trends in physical activity levels. This data will be pulled from the WHO website and included in the final presentation. It won’t be used in this notebook, but we are including here for reference:
Back to the FitBit data: we’re going to explore data across fourteen CSV files. The data is organized in rows and columns, making it structured data. The majority of datasets are in long format, meaning each row is one time point per session ID. Each session ID has data in multiple rows. Individual reports can be parsed by export ID or time-stamp. These datasets fall into three different groups based on the time increments they were tracked by:
By analyzing the daily, hourly, and minute level data provided, we can begin to explore trends and search for insights that can be applied to Bellabeat customers.
Let’s look at the file sizes prior to importing. We don’t want to overload our RAM in Rstudio Cloud:
dailyActivity_merged.csv = 89437 bytes
dailyCalories_merged.csv = 25125 bytes
dailyIntensities_merged.csv = 70581 bytes
dailySteps_merged.csv = 25175 bytes
sleepDay_merged.csv = 18100 bytes
weightLogInfo_merged.csv = 6725 bytes
hourlyCalories_merged.csv = 801486 bytes
hourlyIntensities_merged.csv = 898778 bytes
hourlySteps_merged.csv = 796562 bytes
minuteCaloriesNarrow_merged.csv = 66443854 bytes
minuteIntensitiesNarrow_merged.csv = 46358789 bytes
minuteMETsNarrow_merged.csv = 47686683 bytes
minuteSleep_merged.csv = 8848370 bytes
minuteStepsNarrow_merged.csv = 46531957 bytes
Based on file sizes, it looks like the minute level data could cause issues with RAM usage in RStudio Cloud. Let’s compare the size of the smallest minute level dataset to the total combined size of all nine daily/hourly datasets:
minuteSleep_merged.csv
= 8,848,370 bytes
dailyActivity_merged.csv
,
dailyCalories_merged.csv
,
dailyIntensities_merged.csv
,
dailySteps_merged.csv
,
sleepDay_merged.csv
,
weightLogInfo_merged.csv
,
hourlyCalories_merged.csv
,
hourlyIntensities_merged.csv
,
hourlySteps_merged.csv
= 2,753,820 bytes
(2.75382^{6})
Even if we combine the bytes for all the daily and hourly datasets, it still doesn’t even come close in size to the smallest minute level dataset!
Let’s start with loading our daily and hourly datasets. We’ll create dataframes (df) for these now:
daily_activity <- read_csv("fitbit_data/dailyActivity_merged.csv")
daily_calories <- read_csv("fitbit_data/dailyCalories_merged.csv")
daily_intensities <- read_csv("fitbit_data/dailyIntensities_merged.csv")
daily_steps <- read_csv("fitbit_data/dailySteps_merged.csv")
daily_sleep <- read_csv("fitbit_data/sleepDay_merged.csv")
weight_log <- read_csv("fitbit_data/weightLogInfo_merged.csv")
hourly_calories <- read_csv("fitbit_data/hourlyCalories_merged.csv")
hourly_intensities <- read_csv("fitbit_data/hourlyIntensities_merged.csv")
hourly_steps <- read_csv("fitbit_data/hourlySteps_merged.csv")
Since the minute data is huge, we’re only going to import the smallest dataset:
minute_sleep <- read_csv("fitbit_data/minuteSleep_merged.csv")
Now that we’ve identified and loaded the ten datasets we’re going to use, we need to ensure that the data within each dataframe is clean and ready to analyze.
Let’s take a closer look at the different dataframes. We’ll start by using the head()
function to view the first few rows of each one:
`head(daily_activity)`
```
# A tibble: 6 × 15
Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie…
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1.50e9 4/12/16 13162 8.5 8.5 0
2 1.50e9 4/13/16 10735 6.97 6.97 0
3 1.50e9 4/14/16 10460 6.74 6.74 0
4 1.50e9 4/15/16 9762 6.28 6.28 0
5 1.50e9 4/16/16 12669 8.16 8.16 0
6 1.50e9 4/17/16 9705 6.48 6.48 0
# … with 9 more variables: VeryActiveDistance <dbl>,
# ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
# SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
# FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
# SedentaryMinutes <dbl>, Calories <dbl>
```
`head(daily_calories)`
```
# A tibble: 6 × 3
Id ActivityDay Calories
<dbl> <chr> <dbl>
1 1503960366 4/12/2016 1985
2 1503960366 4/13/2016 1797
3 1503960366 4/14/2016 1776
4 1503960366 4/15/2016 1745
5 1503960366 4/16/2016 1863
6 1503960366 4/17/2016 1728
```
`head(daily_intensities)`
```
# A tibble: 6 × 10
Id ActivityDay SedentaryMinutes LightlyActiveMinutes FairlyActiveMinu…
<dbl> <chr> <dbl> <dbl> <dbl>
1 1503960366 4/12/2016 728 328 13
2 1503960366 4/13/2016 776 217 19
3 1503960366 4/14/2016 1218 181 11
4 1503960366 4/15/2016 726 209 34
5 1503960366 4/16/2016 773 221 10
6 1503960366 4/17/2016 539 164 20
# … with 5 more variables: VeryActiveMinutes <dbl>,
# SedentaryActiveDistance <dbl>, LightActiveDistance <dbl>,
# ModeratelyActiveDistance <dbl>, VeryActiveDistance <dbl>
```
`head(daily_steps)`
```
# A tibble: 6 × 3
Id ActivityDay StepTotal
<dbl> <chr> <dbl>
1 1503960366 4/12/2016 13162
2 1503960366 4/13/2016 10735
3 1503960366 4/14/2016 10460
4 1503960366 4/15/2016 9762
5 1503960366 4/16/2016 12669
6 1503960366 4/17/2016 9705
```
`head(daily_sleep)`
```
# A tibble: 6 × 5
Id SleepDay TotalSleepRecor… TotalMinutesAsle… TotalTimeInBed
<dbl> <chr> <dbl> <dbl> <dbl>
1 1503960366 4/12/2016 12:00:… 1 327 346
2 1503960366 4/13/2016 12:00:… 2 384 407
3 1503960366 4/15/2016 12:00:… 1 412 442
4 1503960366 4/16/2016 12:00:… 2 340 367
5 1503960366 4/17/2016 12:00:… 1 700 712
6 1503960366 4/19/2016 12:00:… 1 304 320
```
`head(weight_log)`
```
# A tibble: 6 × 8
Id Date WeightKg WeightPounds Fat BMI IsManualReport LogId
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl>
1 1503960366 5/2/2016… 52.6 116. 22 22.6 TRUE 1.46e12
2 1503960366 5/3/2016… 52.6 116. NA 22.6 TRUE 1.46e12
3 1927972279 4/13/201… 134. 294. NA 47.5 FALSE 1.46e12
4 2873212765 4/21/201… 56.7 125. NA 21.5 TRUE 1.46e12
5 2873212765 5/12/201… 57.3 126. NA 21.7 TRUE 1.46e12
6 4319703577 4/17/201… 72.4 160. 25 27.5 TRUE 1.46e12
```
`head(hourly_calories)`
```
# A tibble: 6 × 3
Id ActivityHour Calories
<dbl> <chr> <dbl>
1 1503960366 4/12/2016 12:00:00 AM 81
2 1503960366 4/12/2016 1:00:00 AM 61
3 1503960366 4/12/2016 2:00:00 AM 59
4 1503960366 4/12/2016 3:00:00 AM 47
5 1503960366 4/12/2016 4:00:00 AM 48
6 1503960366 4/12/2016 5:00:00 AM 48
```
`head(hourly_intensities`
```
# A tibble: 6 × 4
Id ActivityHour TotalIntensity AverageIntensity
<dbl> <chr> <dbl> <dbl>
1 1503960366 4/12/2016 12:00:00 AM 20 0.333
2 1503960366 4/12/2016 1:00:00 AM 8 0.133
3 1503960366 4/12/2016 2:00:00 AM 7 0.117
4 1503960366 4/12/2016 3:00:00 AM 0 0
5 1503960366 4/12/2016 4:00:00 AM 0 0
6 1503960366 4/12/2016 5:00:00 AM 0 0
```
`head(hourly_steps)`
```
# A tibble: 6 × 3
Id ActivityHour StepTotal
<dbl> <chr> <dbl>
1 1503960366 4/12/2016 12:00:00 AM 373
2 1503960366 4/12/2016 1:00:00 AM 160
3 1503960366 4/12/2016 2:00:00 AM 151
4 1503960366 4/12/2016 3:00:00 AM 0
5 1503960366 4/12/2016 4:00:00 AM 0
6 1503960366 4/12/2016 5:00:00 AM 0
```
`head(minute_sleep)`
```
# A tibble: 6 × 4
Id date value logId
<dbl> <chr> <dbl> <dbl>
1 1503960366 4/12/2016 2:47:30 AM 3 11380564589
2 1503960366 4/12/2016 2:48:30 AM 2 11380564589
3 1503960366 4/12/2016 2:49:30 AM 1 11380564589
4 1503960366 4/12/2016 2:50:30 AM 1 11380564589
5 1503960366 4/12/2016 2:51:30 AM 1 11380564589
6 1503960366 4/12/2016 2:52:30 AM 1 11380564589
```
Now we’ll use the colnames()
function to view the column names for each dataframe:
`colnames(daily_activity)`
```
[1] "Id" "ActivityDate"
[3] "TotalSteps" "TotalDistance"
[5] "TrackerDistance" "LoggedActivitiesDistance"
[7] "VeryActiveDistance" "ModeratelyActiveDistance"
[9] "LightActiveDistance" "SedentaryActiveDistance"
[11] "VeryActiveMinutes" "FairlyActiveMinutes"
[13] "LightlyActiveMinutes" "SedentaryMinutes"
[15] "Calories"
```
`colnames(daily_calories)`
```
[1] "Id" "ActivityDay" "Calories"
```
`colnames(daily_intensities)`
```
[1] "Id" "ActivityDay"
[3] "SedentaryMinutes" "LightlyActiveMinutes"
[5] "FairlyActiveMinutes" "VeryActiveMinutes"
[7] "SedentaryActiveDistance" "LightActiveDistance"
[9] "ModeratelyActiveDistance" "VeryActiveDistance"
```
`colnames(daily_steps)`
```
[1] "Id" "ActivityDay" "StepTotal"
```
`colnames(daily_sleep)`
```
[1] "Id" "SleepDay" "TotalSleepRecords"
[4] "TotalMinutesAsleep" "TotalTimeInBed"
```
`colnames(weight_log)`
```
[1] "Id" "Date" "WeightKg" "WeightPounds"
[5] "Fat" "BMI" "IsManualReport" "LogId"
```
`colnames(hourly_calories)`
```
[1] "Id" "ActivityHour" "Calories"
```
`colnames(hourly_intensities)`
```
[1] "Id" "ActivityHour" "TotalIntensity" "AverageIntensity"
```
`colnames(hourly_steps)`
```
[1] "Id" "ActivityHour" "StepTotal"
```
`colnames(minute_sleep)`
```
[1] "Id" "date" "value" "logId"
```
All these dataframes have the Id
field, which can be used to merge them into one another to create overall dataframes (i.e. daily df, hourly df). However, the activity date/time variable formatting looks to be different across some of the dataframes. Let’s select()
those variables to take a closer look.
```r
daily_activity %>%
select(Id, ActivityDate)
```
```
# A tibble: 940 × 2
Id ActivityDate
<dbl> <chr>
1 1503960366 4/12/16
2 1503960366 4/13/16
3 1503960366 4/14/16
4 1503960366 4/15/16
5 1503960366 4/16/16
6 1503960366 4/17/16
7 1503960366 4/18/16
8 1503960366 4/19/16
9 1503960366 4/20/16
10 1503960366 4/21/16
# … with 930 more rows
```
```r
daily_calories %>%
select(Id, ActivityDay)
```
```
# A tibble: 940 × 2
Id ActivityDay
<dbl> <chr>
1 1503960366 4/12/2016
2 1503960366 4/13/2016
3 1503960366 4/14/2016
4 1503960366 4/15/2016
5 1503960366 4/16/2016
6 1503960366 4/17/2016
7 1503960366 4/18/2016
8 1503960366 4/19/2016
9 1503960366 4/20/2016
10 1503960366 4/21/2016
# … with 930 more rows
```
```r
daily_intensities %>%
select(Id, ActivityDay)
```
```
# A tibble: 940 × 2
Id ActivityDay
<dbl> <chr>
1 1503960366 4/12/2016
2 1503960366 4/13/2016
3 1503960366 4/14/2016
4 1503960366 4/15/2016
5 1503960366 4/16/2016
6 1503960366 4/17/2016
7 1503960366 4/18/2016
8 1503960366 4/19/2016
9 1503960366 4/20/2016
10 1503960366 4/21/2016
# … with 930 more rows
```
```r
daily_steps %>%
select(Id, ActivityDay)
```
```
# A tibble: 940 × 2
Id ActivityDay
<dbl> <chr>
1 1503960366 4/12/2016
2 1503960366 4/13/2016
3 1503960366 4/14/2016
4 1503960366 4/15/2016
5 1503960366 4/16/2016
6 1503960366 4/17/2016
7 1503960366 4/18/2016
8 1503960366 4/19/2016
9 1503960366 4/20/2016
10 1503960366 4/21/2016
# … with 930 more rows
```
```r
daily_sleep %>%
select(Id, SleepDay)
```
```
# A tibble: 413 × 2
Id SleepDay
<dbl> <chr>
1 1503960366 4/12/2016 12:00:00 AM
2 1503960366 4/13/2016 12:00:00 AM
3 1503960366 4/15/2016 12:00:00 AM
4 1503960366 4/16/2016 12:00:00 AM
5 1503960366 4/17/2016 12:00:00 AM
6 1503960366 4/19/2016 12:00:00 AM
7 1503960366 4/20/2016 12:00:00 AM
8 1503960366 4/21/2016 12:00:00 AM
9 1503960366 4/23/2016 12:00:00 AM
10 1503960366 4/24/2016 12:00:00 AM
# … with 403 more rows
```
```r
weight_log %>%
select(Id, Date)
```
```
# A tibble: 67 × 2
Id Date
<dbl> <chr>
1 1503960366 5/2/2016 11:59:59 PM
2 1503960366 5/3/2016 11:59:59 PM
3 1927972279 4/13/2016 1:08:52 AM
4 2873212765 4/21/2016 11:59:59 PM
5 2873212765 5/12/2016 11:59:59 PM
6 4319703577 4/17/2016 11:59:59 PM
7 4319703577 5/4/2016 11:59:59 PM
8 4558609924 4/18/2016 11:59:59 PM
9 4558609924 4/25/2016 11:59:59 PM
10 4558609924 5/1/2016 11:59:59 PM
# … with 57 more rows
```
```r
hourly_calories %>%
select(Id, ActivityHour)
```
```
# A tibble: 22,099 × 2
Id ActivityHour
<dbl> <chr>
1 1503960366 4/12/2016 12:00:00 AM
2 1503960366 4/12/2016 1:00:00 AM
3 1503960366 4/12/2016 2:00:00 AM
4 1503960366 4/12/2016 3:00:00 AM
5 1503960366 4/12/2016 4:00:00 AM
6 1503960366 4/12/2016 5:00:00 AM
7 1503960366 4/12/2016 6:00:00 AM
8 1503960366 4/12/2016 7:00:00 AM
9 1503960366 4/12/2016 8:00:00 AM
10 1503960366 4/12/2016 9:00:00 AM
# … with 22,089 more rows
```
```r
hourly_intensities %>%
select(Id, ActivityHour)
```
```
# A tibble: 22,099 × 2
Id ActivityHour
<dbl> <chr>
1 1503960366 4/12/2016 12:00:00 AM
2 1503960366 4/12/2016 1:00:00 AM
3 1503960366 4/12/2016 2:00:00 AM
4 1503960366 4/12/2016 3:00:00 AM
5 1503960366 4/12/2016 4:00:00 AM
6 1503960366 4/12/2016 5:00:00 AM
7 1503960366 4/12/2016 6:00:00 AM
8 1503960366 4/12/2016 7:00:00 AM
9 1503960366 4/12/2016 8:00:00 AM
10 1503960366 4/12/2016 9:00:00 AM
# … with 22,089 more rows
```
```r
hourly_steps %>%
select(Id, ActivityHour)
```
```
# A tibble: 22,099 × 2
Id ActivityHour
<dbl> <chr>
1 1503960366 4/12/2016 12:00:00 AM
2 1503960366 4/12/2016 1:00:00 AM
3 1503960366 4/12/2016 2:00:00 AM
4 1503960366 4/12/2016 3:00:00 AM
5 1503960366 4/12/2016 4:00:00 AM
6 1503960366 4/12/2016 5:00:00 AM
7 1503960366 4/12/2016 6:00:00 AM
8 1503960366 4/12/2016 7:00:00 AM
9 1503960366 4/12/2016 8:00:00 AM
10 1503960366 4/12/2016 9:00:00 AM
# … with 22,089 more rows
```
```r
minute_sleep %>%
select(Id, date)
```
```
# A tibble: 188,521 × 2
Id date
<dbl> <chr>
1 1503960366 4/12/2016 2:47:30 AM
2 1503960366 4/12/2016 2:48:30 AM
3 1503960366 4/12/2016 2:49:30 AM
4 1503960366 4/12/2016 2:50:30 AM
5 1503960366 4/12/2016 2:51:30 AM
6 1503960366 4/12/2016 2:52:30 AM
7 1503960366 4/12/2016 2:53:30 AM
8 1503960366 4/12/2016 2:54:30 AM
9 1503960366 4/12/2016 2:55:30 AM
10 1503960366 4/12/2016 2:56:30 AM
# … with 188,511 more rows
```
Now that we’ve familiarized ourselves with the data, we can identify a few data cleaning tasks to move forward with. Here are the main areas we’ll address:
Naming conventions: we’ll adjust the variable names according to the tidyverse style guide. All variable names will use only lowercase letters and underscores _
(known as snake case) to separate words within a name. This will make our code easier to read.
Fix formatting issues: we’ll make the date-time format consistent across each dataframe. This will prevent problems down the line during analysis, as well as when we begin merging the data together.
Create new variables: we’ll create a few new columns within each dataframe.
Create new dataframe: in order to better analyze the minute_sleep
data, we’ll first do some thorough transformation by reformatting and creating new variables. Then we’ll aggregate all the sleep data we have into a more manageable structure.
# daily_activity ---------------------------
daily_activity <-
daily_activity %>%
rename(
activity_date = ActivityDate,
total_steps = TotalSteps,
total_distance = TotalDistance,
tracker_distance = TrackerDistance,
logged_activities_d = LoggedActivitiesDistance,
very_active_d = VeryActiveDistance,
moderately_active_d = ModeratelyActiveDistance,
light_active_d = LightActiveDistance,
sedentary_d = SedentaryActiveDistance,
very_active_m = VeryActiveMinutes,
fairly_active_m = FairlyActiveMinutes,
lightly_active_m = LightlyActiveMinutes,
sedentary_m = SedentaryMinutes,
calories = Calories
) %>%
rename_with(
tolower, starts_with("Id")
) %>%
mutate(
# reformat variable as POSIXct to represent date and time
activity_date = parse_date_time(activity_date, "%m/%d/%y"),
# create new variable and format as date only
activity_date_ymd = as.Date(activity_date, "%Y/%m/%d"),
# create new variables for day of week and time of week
day_of_week = weekdays(as.Date(activity_date)),
time_of_week = case_when(
day_of_week %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") ~ "Weekday",
day_of_week %in% c("Saturday", "Sunday") ~ "Weekend")
)
# daily_calories ---------------------------
daily_calories <-
daily_calories %>%
rename(
activity_date = ActivityDay,
calories = Calories
) %>%
rename_with(
tolower, starts_with("Id")
) %>%
mutate(
activity_date = parse_date_time(activity_date, "%m/%d/%Y"),
activity_date_ymd = as.Date(activity_date, "%Y/%m/%d"),
day_of_week = weekdays(as.Date(activity_date)),
time_of_week = case_when(
day_of_week %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") ~ "Weekday",
day_of_week %in% c("Saturday", "Sunday") ~ "Weekend")
)
# daily_intensities ---------------------------
daily_intensities <-
daily_intensities %>%
rename(
activity_date = ActivityDay,
very_active_d = VeryActiveDistance,
moderately_active_d = ModeratelyActiveDistance,
light_active_d = LightActiveDistance,
sedentary_d = SedentaryActiveDistance,
very_active_m = VeryActiveMinutes,
fairly_active_m = FairlyActiveMinutes,
lightly_active_m = LightlyActiveMinutes,
sedentary_m = SedentaryMinutes,
) %>%
rename_with(
tolower, starts_with("Id")
) %>%
mutate(
activity_date = parse_date_time(activity_date, "%m/%d/%Y"),
activity_date_ymd = as.Date(activity_date, "%Y/%m/%d"),
day_of_week = weekdays(as.Date(activity_date)),
time_of_week = case_when(
day_of_week %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") ~ "Weekday",
day_of_week %in% c("Saturday", "Sunday") ~ "Weekend")
)
# daily_sleep ---------------------------
daily_sleep <-
daily_sleep %>%
rename(
activity_date = SleepDay,
total_sleep_records = TotalSleepRecords,
total_minutes_asleep = TotalMinutesAsleep,
total_time_in_bed = TotalTimeInBed
) %>%
rename_with(
tolower, starts_with("Id")
) %>%
mutate(
activity_date = parse_date_time(activity_date, "%m/%d/%Y %I:%M:%S %p"),
activity_date_ymd = as.Date(activity_date, "%Y/%m/%d"),
day_of_week = weekdays(as.Date(activity_date)),
time_of_week = case_when(
day_of_week %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") ~ "Weekday",
day_of_week %in% c("Saturday", "Sunday") ~ "Weekend")
)
# daily_steps ---------------------------
daily_steps <-
daily_steps %>%
rename(
activity_date = ActivityDay,
step_total = StepTotal
) %>%
rename_with(
tolower, starts_with("Id")
) %>%
mutate(
activity_date = parse_date_time(activity_date, "%m/%d/%Y"),
activity_date_ymd = as.Date(activity_date, "%Y/%m/%d"),
day_of_week = weekdays(as.Date(activity_date)),
time_of_week = case_when(
day_of_week %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") ~ "Weekday",
day_of_week %in% c("Saturday", "Sunday") ~ "Weekend")
)
# weight_log ---------------------------
weight_log <-
weight_log %>%
rename(
activity_date = Date,
weight_kg = WeightKg,
weight_lb = WeightPounds,
fat = Fat,
bmi = BMI,
manual_report = IsManualReport,
log_id = LogId
) %>%
rename_with(
tolower, starts_with("Id")
) %>%
mutate(
activity_date = parse_date_time(activity_date, "%m/%d/%Y %I:%M:%S %p"),
activity_date_ymd = as.Date(activity_date, "%Y/%m/%d"),
activity_time = format(activity_date, format = "%I:%M:%S %p"),
day_of_week = weekdays(as.Date(activity_date)),
time_of_week = case_when(
day_of_week %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") ~ "Weekday",
day_of_week %in% c("Saturday", "Sunday") ~ "Weekend"),
hour_of_day = as.POSIXct(activity_date, format = "%I:%M:%S %p")
)
# hourly_calories ---------------------------
hourly_calories <-
hourly_calories %>%
rename(
activity_hour = ActivityHour,
calories = Calories
) %>%
rename_with(
tolower, starts_with("Id")
) %>%
mutate(
activity_hour = parse_date_time(activity_hour, "%m/%d/%Y %I:%M:%S %p"),
activity_date_ymd = as.Date(activity_hour, "%Y/%m/%d"),
activity_time = format(activity_hour, format = "%I:%M:%S %p"),
day_of_week = weekdays(as.Date(activity_hour)),
time_of_week = case_when(
day_of_week %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") ~ "Weekday",
day_of_week %in% c("Saturday", "Sunday") ~ "Weekend"),
hour_of_day = as.POSIXct(activity_time, format = "%I:%M:%S %p")
)
# hourly_intensities ---------------------------
hourly_intensities <-
hourly_intensities %>%
rename(
activity_hour = ActivityHour,
total_intensity = TotalIntensity,
average_intensity = AverageIntensity
) %>%
rename_with(
tolower, starts_with("Id")
) %>%
mutate(
activity_hour = parse_date_time(activity_hour, "%m/%d/%Y %I:%M:%S %p"),
activity_date_ymd = as.Date(activity_hour, "%Y/%m/%d"),
activity_time = format(activity_hour, format = "%I:%M:%S %p"),
day_of_week = weekdays(as.Date(activity_hour)),
time_of_week = case_when(
day_of_week %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") ~ "Weekday",
day_of_week %in% c("Saturday", "Sunday") ~ "Weekend"),
hour_of_day = as.POSIXct(activity_time, format = "%I:%M:%S %p")
)
# hourly_steps ---------------------------
hourly_steps <-
hourly_steps %>%
rename(
activity_hour = ActivityHour,
step_total = StepTotal
) %>%
rename_with(
tolower, starts_with("Id")
) %>%
mutate(
activity_hour = parse_date_time(activity_hour, "%m/%d/%Y %I:%M:%S %p"),
activity_date_ymd = as.Date(activity_hour, "%Y/%m/%d"),
activity_time = format(activity_hour, format = "%I:%M:%S %p"),
day_of_week = weekdays(as.Date(activity_hour)),
time_of_week = case_when(
day_of_week %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") ~ "Weekday",
day_of_week %in% c("Saturday", "Sunday") ~ "Weekend"),
hour_of_day = as.POSIXct(activity_time, format = "%I:%M:%S %p")
)
# minute_sleep ---------------------------
minute_sleep <-
minute_sleep %>%
rename(
activity_date = date,
sleep_value = value,
log_id = logId
) %>%
rename_with(
tolower, starts_with("Id")
) %>%
mutate(
activity_date = parse_date_time(activity_date, "%m/%d/%Y %I:%M:%S %p"),
activity_date_ymd = as.Date(activity_date, "%Y/%m/%d"),
activity_time = format(activity_date, format = "%I:%M:00 %p"),
day_of_week = weekdays(as.Date(activity_date)),
time_of_week = case_when(
day_of_week %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") ~ "Weekday",
day_of_week %in% c("Saturday", "Sunday") ~ "Weekend"),
hour_of_day = as.POSIXct(activity_time, format = "%I:%M:%S %p"),
# sleep_id will be used generate totals per sleep log
sleep_id = str_c(id, "-", log_id),
# Create new variables for sleep values - asleep, restless, and awake
asleep = ifelse(sleep_value == 1, 1, 0),
restless = ifelse(sleep_value == 2, 1, 0),
awake = ifelse(sleep_value == 3, 1, 0)
)
# Create sleep_summary_0 df ---------------------------
# that shows totals for 3 sleep values per sleep log
sleep_summary_0 <-
minute_sleep %>%
# id_date will be used to generate a daily total
mutate(
id_date = str_c(id, "-", activity_date_ymd)
) %>%
group_by(sleep_id, activity_date_ymd, id_date, id) %>%
dplyr::summarize(
total_asleep = sum(sleep_value == "1"),
total_restless = sum(sleep_value == "2"),
total_awake = sum(sleep_value == "3")
)
# Create sleep_summary df ---------------------------
# that combines totals for each day per id
sleep_summary <-
sleep_summary_0 %>%
# activity_date will be used to merge with daily_sleep df
mutate(
activity_date = parse_date_time(activity_date_ymd, "%Y/%m/%d")
) %>%
group_by(id_date, activity_date, id) %>%
dplyr::summarize(
total_asleep_merged = sum(total_asleep),
total_restless_merged = sum(total_restless),
total_awake_merged = sum(total_awake)
)
# Merge these two daily sleep dfs into one ---------------------------
sleep_data <- merge(x = daily_sleep, y = sleep_summary, by = c("id", "activity_date"), all = TRUE)
Now that we’ve done some deeper cleaning and formatting, let’s get to know this data from a big-picture perspective. We can do this by obtaining some high-level summary statistics that can show us how these dataframes relate to one another, as well as what trends might exist within each:
How many unique participants are there in each dataframe?
# There are 33 users (one user per unique id) in the daily activity df
n_distinct(daily_activity$id)
[1] 33
# There are 24 users (one user per unique id) in the sleep dfs
n_distinct(daily_sleep$id)
[1] 24
n_distinct(sleep_data$id)
[1] 24
# There are 8 users (one user per unique id) in the weight df
n_distinct(weight_log$id)
[1] 8
# There are 33 users (one user per unique id) in the hourly dfs
n_distinct(hourly_calories$id)
[1] 33
n_distinct(hourly_intensities$id)
[1] 33
n_distinct(hourly_steps$id)
[1] 33
# There are 24 users (one user per unique id) in the minute df
n_distinct(minute_sleep$id)
[1] 24
How many observations are there in each dataframe?
# The observations vary across the daily dfs
nrow(daily_activity)
[1] 940
nrow(daily_sleep)
[1] 413
nrow(sleep_data)
[1] 452
nrow(weight_log)
[1] 67
# There are 22099 observations in each hourly dataframe
nrow(hourly_calories)
[1] 22099
nrow(hourly_intensities)
[1] 22099
nrow(hourly_steps)
[1] 22099
# There are 188521 observations in the minute dataframe
nrow(minute_sleep)
[1] 188521
What are some quick summary statistics we’d want to know about each dataframe?
# Daily totals for steps, distance, calories ---------------------------
daily_activity %>%
select(
total_steps,
total_distance,
calories
) %>%
summary()
total_steps total_distance calories
Min. : 0 Min. : 0.000 Min. : 0
1st Qu.: 3790 1st Qu.: 2.620 1st Qu.:1828
Median : 7406 Median : 5.245 Median :2134
Mean : 7638 Mean : 5.490 Mean :2304
3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:2793
Max. :36019 Max. :28.030 Max. :4900
# Active minute levels per category ---------------------------
daily_activity %>%
select(
very_active_m,
fairly_active_m,
lightly_active_m,
sedentary_m
) %>%
summary()
very_active_m fairly_active_m lightly_active_m sedentary_m
Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127.0 1st Qu.: 729.8
Median : 4.00 Median : 6.00 Median :199.0 Median :1057.5
Mean : 21.16 Mean : 13.56 Mean :192.8 Mean : 991.2
3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264.0 3rd Qu.:1229.5
Max. :210.00 Max. :143.00 Max. :518.0 Max. :1440.0
# Sleep totals ---------------------------
# for records, minutes asleep, and time in bed
daily_sleep %>%
select(
total_sleep_records,
total_minutes_asleep,
total_time_in_bed
) %>%
summary()
total_sleep_records total_minutes_asleep total_time_in_bed
Min. :1.000 Min. : 58.0 Min. : 61.0
1st Qu.:1.000 1st Qu.:361.0 1st Qu.:403.0
Median :1.000 Median :433.0 Median :463.0
Mean :1.119 Mean :419.5 Mean :458.6
3rd Qu.:1.000 3rd Qu.:490.0 3rd Qu.:526.0
Max. :3.000 Max. :796.0 Max. :961.0
# Sleep totals per category ---------------------------
sleep_data %>%
select(
total_minutes_asleep,
total_time_in_bed,
total_asleep_merged,
total_restless_merged,
total_awake_merged
) %>%
summary()
total_minutes_asleep total_time_in_bed total_asleep_merged
Min. : 58.0 Min. : 61.0 Min. : 0.0
1st Qu.:361.0 1st Qu.:403.0 1st Qu.:325.0
Median :433.0 Median :463.0 Median :407.5
Mean :419.5 Mean :458.6 Mean :384.8
3rd Qu.:490.0 3rd Qu.:526.0 3rd Qu.:477.0
Max. :796.0 Max. :961.0 Max. :778.0
NA's :39 NA's :39
total_restless_merged total_awake_merged
Min. : 0.00 Min. : 0.000
1st Qu.: 11.00 1st Qu.: 0.000
Median : 20.00 Median : 3.000
Mean : 31.17 Mean : 4.469
3rd Qu.: 31.00 3rd Qu.: 5.000
Max. :227.00 Max. :129.000
# Weight totals ---------------------------
weight_log %>%
select(
weight_lb,
fat,
bmi
) %>%
summary()
weight_lb fat bmi
Min. :116.0 Min. :22.00 Min. :21.45
1st Qu.:135.4 1st Qu.:22.75 1st Qu.:23.96
Median :137.8 Median :23.50 Median :24.39
Mean :158.8 Mean :23.50 Mean :25.19
3rd Qu.:187.5 3rd Qu.:24.25 3rd Qu.:25.56
Max. :294.3 Max. :25.00 Max. :47.54
NA's :65
# Hourly calories summary ---------------------------
hourly_calories %>%
select(calories) %>%
summary()
calories
Min. : 42.00
1st Qu.: 63.00
Median : 83.00
Mean : 97.39
3rd Qu.:108.00
Max. :948.00
# Hourly intensities summary ---------------------------
hourly_intensities %>%
select(
total_intensity,
average_intensity
) %>%
summary()
total_intensity average_intensity
Min. : 0.00 Min. :0.0000
1st Qu.: 0.00 1st Qu.:0.0000
Median : 3.00 Median :0.0500
Mean : 12.04 Mean :0.2006
3rd Qu.: 16.00 3rd Qu.:0.2667
Max. :180.00 Max. :3.0000
# Hourly steps summary ---------------------------
hourly_steps %>%
select(step_total) %>%
summary()
step_total
Min. : 0.0
1st Qu.: 0.0
Median : 40.0
Mean : 320.2
3rd Qu.: 357.0
Max. :10554.0
# Minute sleep summary ---------------------------
minute_sleep %>%
select(
sleep_value,
asleep,
restless,
awake
) %>%
summary()
sleep_value asleep restless awake
Min. :1.000 Min. :0.0000 Min. :0.00000 Min. :0.0000
1st Qu.:1.000 1st Qu.:1.0000 1st Qu.:0.00000 1st Qu.:0.0000
Median :1.000 Median :1.0000 Median :0.00000 Median :0.0000
Mean :1.096 Mean :0.9149 Mean :0.07438 Mean :0.0107
3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
Max. :3.000 Max. :1.0000 Max. :1.00000 Max. :1.0000
What does this tell us about this sample of people’s activities? A few interesting notes:
The goal of 10,000 steps per day has been widely proposed across the smart device industry. But how important of a health metric is daily steps really? For example, what’s the relationship between total steps and calories burned?
# Total steps vs calories plot ---------------------------
ggplot(data = daily_activity, aes(x = total_steps, y = calories)) +
geom_point() +
geom_smooth() + # Trend line with a shadow representing 95% confidence interval
labs(title = "Total steps vs calories") +
ylab("Calories") +
xlab("Total Steps") +
theme_minimal()
cor.test(daily_activity$total_steps, daily_activity$calories, method = "pearson")
Pearson's product-moment correlation
data: daily_activity$total_steps and daily_activity$calories
t = 22.472, df = 938, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5483688 0.6316184
sample estimates:
cor
0.5915681
As expected, we see a positive correlation here between total_steps
and calories
. This makes sense because the more active someone is, the more calories they will burn. But what about the relationship between very_active_m
and calories
?
# Very active minutes vs calories plot ---------------------------
ggplot(data = daily_activity, aes(x = very_active_m, y = calories)) +
geom_point() +
geom_smooth() + # Trend line with a shadow representing 95% confidence interval
labs(title = "Very active minutes vs calories") +
ylab("Calories") +
xlab("Very active minutes") +
theme_minimal()
cor.test(daily_activity$very_active_m, daily_activity$calories, method = "pearson")
Pearson's product-moment correlation
data: daily_activity$very_active_m and daily_activity$calories
t = 23.939, df = 938, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5745197 0.6540260
sample estimates:
cor
0.6158383
Unsurprisingly, we see another positive correlation. This time between very_active_m
and calories
. Very active people will burn more calories. But we know from earlier that the average very_active_m
per day of 21.16
is much lower than then average lightly_active_m
per day of 192.80
. So what’s the relationship between lightly_active_m
and calories
?
# Lightly active minutes vs calories plot ---------------------------
ggplot(data = daily_activity, aes(x = lightly_active_m, y = calories)) +
geom_point() +
geom_smooth() + # Trend line with a shadow representing 95% confidence interval
labs(title = "Lightly active minutes vs calories") +
ylab("Calories") +
xlab("Lightly active minutes") +
theme_minimal()
cor.test(daily_activity$lightly_active_m, daily_activity$calories, method = "pearson")
Pearson's product-moment correlation
data: daily_activity$lightly_active_m and daily_activity$calories
t = 9.1661, df = 938, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.2269361 0.3443464
sample estimates:
cor
0.2867175
The relationship between lightly_active_m
and calories
is still positive, albeit a lot weaker. This could mean that lightly_active_m
has less of an impact on calories burned throughout a day.
What’s the relationship between calories and day of the week?
# Create labels and limits for plot ---------------------------
labels_weekdays <- c(
"Monday" = "Monday", "Tuesday" = "Tuesday",
"Wednesday" = "Wednesday", "Thursday" = "Thursday",
"Friday" = "Friday", "Saturday" = "Saturday",
"Sunday" = "Sunday"
)
limits_weekdays <- c(
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday",
"Saturday", "Sunday"
)
# Calories vs day of week plot ---------------------------
ggplot(data = daily_activity) +
geom_point(
aes(x = day_of_week, y = calories, color = as.factor(time_of_week)),
position = "jitter",
alpha = 0.3
) +
labs(title = "Daily calories throughout the week", color = "Time of week") +
ylab("Total Calories") +
scale_x_discrete(
"Day",
labels = labels_weekdays,
limits = limits_weekdays,
guide = guide_axis(angle = 45)
) +
stat_summary(
aes(x = day_of_week, y = calories),
fun = mean,
geom = "point",
color = "red",
size = 2,
alpha = 0.7
) +
theme_minimal()
There doesn’t look to be a large difference in the average calories burned per day. These averages are represented by the red dots on each day.
What’s the relationship between sleep quality and time of week? You might expect people to get better sleep over the weekends vs weekdays - are there any unexpected trends?
# Total minutes asleep vs time of week ---------------------------
ggplot(data = sleep_data) +
geom_point(
aes(
x = weekdays.POSIXt(activity_date),
y = total_asleep_merged,
color = as.factor(time_of_week)
),
position = "jitter",
alpha = 0.3
) +
labs(title = "Total minutes asleep throughout week") +
guides(color = "none") +
ylab("Minutes asleep") +
scale_x_discrete(
"Day",
labels = labels_weekdays,
limits = limits_weekdays,
guide = guide_axis(angle = 45)
) +
stat_summary(
aes(x = weekdays.POSIXt(activity_date), y = total_asleep_merged),
fun = mean,
geom = "point",
color = "red",
size = 2,
alpha = 0.7
) +
theme_minimal()
# Total minutes restless vs time of week ---------------------------
ggplot(data = sleep_data) +
geom_point(
aes(
x = weekdays.POSIXt(activity_date),
y = total_restless_merged,
color = as.factor(time_of_week)
),
position = "jitter",
alpha = 0.3
) +
labs(title = "Total minutes restless throughout week") +
guides(color = "none") +
ylab("Minutes restless") +
scale_x_discrete(
"Day",
labels = labels_weekdays,
limits = limits_weekdays,
guide = guide_axis(angle = 45)
) +
stat_summary(
aes(x = weekdays.POSIXt(activity_date), y = total_restless_merged),
fun = mean,
geom = "point",
color = "red",
size = 2,
alpha = 0.7
) +
theme_minimal()
# Total minutes awake vs time of week ---------------------------
ggplot(data = sleep_data) +
geom_point(
aes(
x = weekdays.POSIXt(activity_date),
y = total_awake_merged,
color = as.factor(time_of_week)
),
position = "jitter",
alpha = 0.3
) +
labs(title = "Total minutes awake throughout week") +
guides(color = "none") +
ylab("Minutes awake") +
scale_x_discrete(
"Day",
labels = labels_weekdays,
limits = limits_weekdays,
guide = guide_axis(angle = 45)
) +
stat_summary(
aes(x = weekdays.POSIXt(activity_date), y = total_awake_merged),
fun = mean,
geom = "point",
color = "red",
size = 2,
alpha = 0.7
) +
theme_minimal()
Based on the averages for each day on these graphs (red dots), we can see that:
People do seem to get sleep more on the weekends. However, it doesn’t look like the quality of sleep is greater based minutes awake and restless.
Based on these initial explorations, the impact of active minutes seems like it could reveal deeper customer insights than total steps, calories or sleep. It would benefit us to merge these dataframes and explore that metric further.
Let’s start with combining daily_activity
, sleep_data
, and weight_log
into one daily_data
dataframe. We will disregard the daily_calories
, daily_intensities
, and daily_steps
since their data is already aggregated within the the daily_activity
dataframe:
# Merging daily data into one df ---------------------------
daily_data_0 <-
merge(
x = daily_activity, y = sleep_data,
by = c("id", "activity_date_ymd", "day_of_week", "time_of_week"),
all.x = TRUE
)
daily_data <-
merge(
x = daily_data_0, y = weight_log,
by = c("id", "activity_date_ymd", "day_of_week", "time_of_week"), all.x = TRUE
)
# Add two new active minute variables to daily data ---------------------------
daily_data <-
daily_data %>%
mutate(
daily_mod_vig_m = fairly_active_m + very_active_m,
daily_light_sed_m = lightly_active_m + sedentary_m
)
Now we’ll combine hourly_calories
, hourly_intensities
, and hourly_steps
into one hourly_data
dataframe:
# Merging hourly data into one df ---------------------------
hourly_data_0 <-
merge(
x = hourly_calories, y = hourly_intensities,
by = c(
"id", "activity_hour", "activity_date_ymd", "activity_time", "hour_of_day",
"day_of_week", "time_of_week"
),
all.x = TRUE, no.dups = TRUE
)
hourly_data <-
merge(
x = hourly_data_0, y = hourly_steps,
by = c(
"id", "activity_hour", "activity_date_ymd", "activity_time", "hour_of_day",
"day_of_week", "time_of_week"
),
all.x = TRUE
)
Finally, we’ll create a weekly_data
dataframe using data from daily data:
# Create new weekly_data_00 df ---------------------------
weekly_data_0 <-
daily_data %>%
select(
id, activity_date_ymd, day_of_week, activity_date, total_steps, total_distance,
very_active_d, moderately_active_d, light_active_d,
sedentary_d, very_active_m, fairly_active_m,
lightly_active_m, sedentary_m, calories
) %>%
mutate(
week_number = paste0(
"Week ", (as.numeric(format(activity_date_ymd, "%U"))), ", ",
format(activity_date_ymd, "%b %Y")
)
)
# Summarize totals by week in new weekly_data df ---------------------------
weekly_data <-
weekly_data_0 %>%
group_by(id, week_number) %>%
dplyr::summarize(
weekly_total_steps = sum(total_steps),
weekly_total_distance = sum(total_distance),
weekly_vigorous_d = sum(very_active_d),
weekly_moderate_d = sum(moderately_active_d),
weekly_light_d = sum(light_active_d),
weekly_sedentary_d = sum(sedentary_d),
weekly_vigorous_m = sum(very_active_m),
weekly_moderate_m = sum(fairly_active_m),
weekly_mod_vig_m = weekly_vigorous_m + weekly_moderate_m,
weekly_light_m = sum(lightly_active_m),
weekly_sedentary_m = sum(sedentary_m),
weekly_light_sed_m = weekly_light_m + weekly_sedentary_m,
weekly_calories = sum(calories)
)
Before continuing with our analysis on these merged dataframes, let’s do some more data cleaning.
Let’s free up some RAM space by deleting old dataframes that we’ll no longer need or aren’t using moving forward with our analysis:
# List all objects in current R workspace ---------------------------
ls()
[1] "daily_activity" "daily_calories" "daily_data"
[4] "daily_data_0" "daily_intensities" "daily_sleep"
[7] "daily_steps" "encoding" "hourly_calories"
[10] "hourly_data" "hourly_data_0" "hourly_intensities"
[13] "hourly_steps" "input_file" "labels_weekdays"
[16] "limits_weekdays" "minute_sleep" "out_dir"
[19] "sleep_data" "sleep_summary" "sleep_summary_0"
[22] "weekly_data" "weekly_data_0" "weight_log"
# Remove old objects ---------------------------
# we'll no longer need or use moving forward
rm("daily_calories", "daily_intensities", "daily_steps", "daily_sleep",
"weight_log")
rm("hourly_calories", "hourly_intensities", "hourly_steps", "minute_sleep")
rm("daily_data_0", "hourly_data_0", "sleep_summary_0", "weekly_data_0")
# List all objects again in current R workspace ---------------------------
ls()
[1] "daily_activity" "daily_data" "encoding" "hourly_data"
[5] "input_file" "labels_weekdays" "limits_weekdays" "out_dir"
[9] "sleep_data" "sleep_summary" "weekly_data"
In order to create plots that are faceted by day of week or time of day in correct order, we’ll need to add a new variable to each of the merged dataframes:
# Create day_list for ordered facet grid/wrapping ---------------------------
hourly_data$day_list <-
factor(hourly_data$day_of_week, levels = c(
"Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday"
))
daily_data$day_list <-
factor(daily_data$day_of_week, levels = c(
"Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday"
))
# Create time_list for ordered facet grid/wrapping ---------------------------
hourly_data$time_list <-
factor(hourly_data$activity_time,
levels = c(
"12:00:00 AM", "01:00:00 AM", "02:00:00 AM", "03:00:00 AM",
"04:00:00 AM", "05:00:00 AM", "06:00:00 AM", "07:00:00 AM",
"08:00:00 AM", "09:00:00 AM", "10:00:00 AM", "11:00:00 AM",
"12:00:00 PM", "01:00:00 PM", "02:00:00 PM", "03:00:00 PM",
"04:00:00 PM", "05:00:00 PM", "06:00:00 PM", "07:00:00 PM",
"08:00:00 PM", "09:00:00 PM", "10:00:00 PM", "11:00:00 PM"
)
)
Before we launch into the new summary stats, we need to answer an important question:
Well, the answer to this question depends on how strongly the outliers are influencing the mean. We can visualize with histograms of each new dataframe:
# Daily histogram plot ---------------------------
daily_hplot <- ggplot(data = daily_data)
# Histogram plot for daily_mod_vig_m ---------------------------
(bw <- nclass.FD(daily_data$daily_mod_vig_m)) # Optimize binwidth for each plot
[1] 24
daily_hplot +
geom_histogram(aes(x = daily_mod_vig_m), bins = bw) +
labs(
x = "Minutes",
y = "Count",
title = "Daily moderate/vigorous minutes",
caption = "Median: 21.00 , Mean: 34.87"
)
# Histogram plot for daily_light_sed_m ---------------------------
(bw <- nclass.FD(daily_data$daily_light_sed_m))
[1] 15
daily_hplot +
geom_histogram(aes(x = daily_light_sed_m), bins = bw) +
labs(
x = "Minutes",
y = "Count",
title = "Daily light/sedentary minutes",
caption = "Median: 1333 , Mean: 1183"
)
# Histogram plot for total_steps ---------------------------
(bw <- nclass.FD(daily_data$total_steps))
[1] 26
daily_hplot +
geom_histogram(aes(x = total_steps), bins = bw) +
labs(
x = "Total Steps",
y = "Count",
title = "Daily total steps",
caption = "Median: 7439 , Mean: 7652"
)
# Histogram plot for calories ---------------------------
(bw <- nclass.FD(daily_data$calories))
[1] 25
daily_hplot +
geom_histogram(aes(x = calories), bins = bw) +
labs(
x = "Calories",
y = "Count",
title = "Daily total calories",
caption = "Median: 2140 , Mean: 2308"
)
# Histogram plot for total_minutes_asleep ---------------------------
daily_hplot +
geom_histogram(aes(x = total_minutes_asleep)) +
labs(
x = "Total Minutes Asleep",
y = "Count",
title = "Daily minutes asleep",
caption = "Median: 433.0 , Mean: 419.5"
)
Warning: Removed 530 rows containing non-finite values (stat_bin).
# Histogram plot for total_time_in_bed ---------------------------
daily_hplot +
geom_histogram(aes(x = total_time_in_bed)) +
labs(
x = "Total Time in Bed",
y = "Count",
title = "Daily total time in bed",
caption = "Median: 463.0 , Mean: 458.6"
)
Warning: Removed 530 rows containing non-finite values (stat_bin).
# Remove bindwith object ---------------------------
rm(bw)
Since a few outliers are greatly affecting the mean for each variable, we should use median for all these daily summary statistics: daily moderate/vigorous minutes, daily light/sedentary minutes, and daily total steps. However, we can still use mean for the sleep variables and calories.
# Hourly histogram plots ---------------------------
hourly_hplot <- ggplot(data = hourly_data)
# Histogram plot for calories ---------------------------
(bw <- nclass.FD(hourly_data$calories)) # Optimize binwidth for each plot
[1] 283
hourly_hplot +
geom_histogram(aes(x = calories), bins = bw) +
labs(
title = "Hourly calories",
caption = "Median: 83.00 , Mean: 97.39",
x = "Calories",
y = "Count"
)
# Histogram plot for total_intensity ---------------------------
(bw <- nclass.FD(hourly_data$total_intensity))
[1] 158
hourly_hplot +
geom_histogram(aes(x = total_intensity), bins = bw) +
labs(
title = "Hourly total intensity",
caption = "Median: 3.00, Mean: 12.04",
x = "Total Intensity",
y = "Count"
)
# Histogram plot for average_intensity ---------------------------
(bw <- nclass.FD(hourly_data$average_intensity))
[1] 158
hourly_hplot +
geom_histogram(aes(x = average_intensity), bins = bw) +
labs(
title = "Hourly average intensity",
caption = "Median: 0.0500, Mean: 0.2006",
x = "Average Intensity",
y = "Count"
)
# Histogram plot for step_total ---------------------------
(bw <- nclass.FD(hourly_data$step_total))
[1] 415
hourly_hplot +
geom_histogram(aes(x = step_total), bins = bw) +
labs(
x = "Step Total",
y = "Count",
title = "Hourly step total",
caption = "Median: 40.0 , Mean: 320.2"
)
# Remove bindwith object ---------------------------
rm(bw)
It looks like we should use the median for all the hourly summary statistics as well: hourly calories, hourly total intensity, hourly average intensity and hourly step total.
# Weekly histogram plots ---------------------------
weekly_hplot <- ggplot(data = weekly_data)
# Histogram plot for weekly_vigorous_m ---------------------------
(bw <- nclass.FD(weekly_data$weekly_vigorous_m)) # Optimize binwidth for each plot
[1] 12
weekly_hplot +
geom_histogram(aes(x = weekly_vigorous_m), bins = bw) +
labs(
x = "Minutes",
y = "Count",
title = "Weekly vigorous minutes",
caption = "Median: 68.50 , Mean: 130.10"
)
# Histogram plot for weekly_moderate_m ---------------------------
(bw <- nclass.FD(weekly_data$weekly_moderate_m))
[1] 15
weekly_hplot +
geom_histogram(aes(x = weekly_moderate_m), bins = bw) +
labs(
x = "Minutes",
y = "Count",
title = "Weekly moderate minutes",
caption = "Median: 65.50 , Mean: 83.45"
)
# Histogram plot for weekly_light_m ---------------------------
(bw <- nclass.FD(weekly_data$weekly_light_m))
[1] 10
weekly_hplot +
geom_histogram(aes(x = weekly_light_m), bins = bw) +
labs(
x = "Minutes",
y = "Count",
title = "Weekly light minutes",
caption = "Median: 1188 , Mean: 1182"
)
# Histogram plot for weekly_sedentary_m ---------------------------
(bw <- nclass.FD(weekly_data$weekly_sedentary_m))
[1] 9
weekly_hplot +
geom_histogram(aes(x = weekly_sedentary_m), bins = bw) +
labs(
x = "Minutes",
y = "Count",
title = "Weekly sedentary minutes",
caption = "Median: 5754 , Mean: 6064"
)
# Histogram plot for weekly_mod_vig_m ---------------------------
(bw <- nclass.FD(weekly_data$weekly_mod_vig_m))
[1] 11
weekly_hplot +
geom_histogram(aes(x = weekly_mod_vig_m), bins = bw) +
labs(
x = "Minutes",
y = "Count",
title = "Weekly moderate/vigorous minutes",
caption = "Median: 182.00 , Mean: 213.51"
)
# Histogram plot for weekly_light_sed_m ---------------------------
(bw <- nclass.FD(weekly_data$weekly_light_sed_m))
[1] 7
weekly_hplot +
geom_histogram(aes(x = weekly_light_sed_m), bins = bw) +
labs(
x = "Minutes",
y = "Count",
title = "Weekly light/sedentary minutes",
caption = "Median: 6956 , Mean: 7246"
)
# Histogram plot for weekly_total_steps ---------------------------
(bw <- nclass.FD(weekly_data$weekly_total_steps))
[1] 10
weekly_hplot +
geom_histogram(aes(x = weekly_total_steps), bins = bw) +
labs(
x = "Total Steps",
y = "Count",
title = "Weekly total steps",
caption = "Median: 44903 , Mean: 46857"
)
# Histogram plot for weekly_calories ---------------------------
(bw <- nclass.FD(weekly_data$weekly_calories))
[1] 10
weekly_hplot +
geom_histogram(aes(x = weekly_calories), bins = bw) +
labs(
x = "Calories",
y = "Count",
title = "Weekly calories",
caption = "Median: 13615 , Mean: 14130"
)
# Remove bindwith object ---------------------------
rm(bw)
It looks like we should use median for three of the weekly summary statistics: weekly vigorous minutes, weekly moderate minutes, and weekly moderate/vigorous minutes. We should use mean for the remaining five: weekly sedentary minutes, weekly light minutes, weekly light/sedentary minutes, weekly total steps and weekly calories.
Let’s look at the quick summary statistics we’d want to know about each new dataframe:
# Daily steps, distance and calories ---------------------------
daily_data %>%
select(
total_steps,
total_distance,
calories
) %>%
summary()
total_steps total_distance calories
Min. : 0 Min. : 0.000 Min. : 0
1st Qu.: 3795 1st Qu.: 2.620 1st Qu.:1830
Median : 7439 Median : 5.260 Median :2140
Mean : 7652 Mean : 5.503 Mean :2308
3rd Qu.:10734 3rd Qu.: 7.720 3rd Qu.:2796
Max. :36019 Max. :28.030 Max. :4900
# Active minute levels per category ---------------------------
daily_data %>%
select(
very_active_m,
fairly_active_m,
lightly_active_m,
sedentary_m,
daily_mod_vig_m,
daily_light_sed_m
) %>%
summary()
very_active_m fairly_active_m lightly_active_m sedentary_m
Min. : 0.00 Min. : 0.00 Min. : 0 Min. : 0.0
1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127 1st Qu.: 729.0
Median : 4.00 Median : 7.00 Median :199 Median :1057.0
Mean : 21.24 Mean : 13.63 Mean :193 Mean : 990.4
3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264 3rd Qu.:1229.0
Max. :210.00 Max. :143.00 Max. :518 Max. :1440.0
daily_mod_vig_m daily_light_sed_m
Min. : 0.00 Min. : 2
1st Qu.: 0.00 1st Qu.: 947
Median : 21.00 Median :1333
Mean : 34.87 Mean :1183
3rd Qu.: 57.00 3rd Qu.:1432
Max. :275.00 Max. :1440
# Daily overall sleep ---------------------------
daily_data %>%
select(
total_sleep_records,
total_minutes_asleep,
total_time_in_bed
) %>%
summary()
total_sleep_records total_minutes_asleep total_time_in_bed
Min. :1.000 Min. : 58.0 Min. : 61.0
1st Qu.:1.000 1st Qu.:361.0 1st Qu.:403.0
Median :1.000 Median :433.0 Median :463.0
Mean :1.119 Mean :419.5 Mean :458.6
3rd Qu.:1.000 3rd Qu.:490.0 3rd Qu.:526.0
Max. :3.000 Max. :796.0 Max. :961.0
NA's :530 NA's :530 NA's :530
# Daily sleep by category ---------------------------
daily_data %>%
select(
total_asleep_merged,
total_restless_merged,
total_awake_merged
) %>%
summary()
total_asleep_merged total_restless_merged total_awake_merged
Min. : 35.0 Min. : 0.00 Min. : 0.000
1st Qu.:353.0 1st Qu.: 13.00 1st Qu.: 1.000
Median :420.0 Median : 21.00 Median : 3.000
Mean :413.2 Mean : 32.76 Mean : 4.208
3rd Qu.:485.0 3rd Qu.: 32.00 3rd Qu.: 5.000
Max. :778.0 Max. :227.00 Max. :110.000
NA's :530 NA's :530 NA's :530
# Weight logs ---------------------------
daily_data %>%
select(
weight_lb,
fat,
bmi
) %>%
summary()
weight_lb fat bmi
Min. :116.0 Min. :22.00 Min. :21.45
1st Qu.:135.4 1st Qu.:22.75 1st Qu.:23.96
Median :137.8 Median :23.50 Median :24.39
Mean :158.8 Mean :23.50 Mean :25.19
3rd Qu.:187.5 3rd Qu.:24.25 3rd Qu.:25.56
Max. :294.3 Max. :25.00 Max. :47.54
NA's :876 NA's :941 NA's :876
# Hourly calories, intensities, step totals ---------------------------
hourly_data %>%
select(
calories,
total_intensity,
average_intensity,
step_total
) %>%
summary()
calories total_intensity average_intensity step_total
Min. : 42.00 Min. : 0.00 Min. :0.0000 Min. : 0.0
1st Qu.: 63.00 1st Qu.: 0.00 1st Qu.:0.0000 1st Qu.: 0.0
Median : 83.00 Median : 3.00 Median :0.0500 Median : 40.0
Mean : 97.39 Mean : 12.04 Mean :0.2006 Mean : 320.2
3rd Qu.:108.00 3rd Qu.: 16.00 3rd Qu.:0.2667 3rd Qu.: 357.0
Max. :948.00 Max. :180.00 Max. :3.0000 Max. :10554.0
# Weekly data for original active minute totals ---------------------------
# vigorous, moderate, light and sedentary
weekly_data %>%
ungroup() %>%
select(
weekly_vigorous_m,
weekly_moderate_m,
weekly_light_m,
weekly_sedentary_m
) %>%
summary()
weekly_vigorous_m weekly_moderate_m weekly_light_m weekly_sedentary_m
Min. : 0.0 Min. : 0.00 Min. : 0 Min. : 642
1st Qu.: 8.0 1st Qu.: 17.00 1st Qu.: 802 1st Qu.: 4778
Median : 68.5 Median : 65.50 Median :1188 Median : 5754
Mean :130.1 Mean : 83.45 Mean :1182 Mean : 6064
3rd Qu.:182.5 3rd Qu.:122.00 3rd Qu.:1554 3rd Qu.: 7749
Max. :775.0 Max. :549.00 Max. :2590 Max. :10080
# Weekly data for new combo active minute totals ---------------------------
# moderate/vigorous and light/sedentary
weekly_data %>%
ungroup() %>%
select(
weekly_mod_vig_m,
weekly_light_sed_m
) %>%
summary()
weekly_mod_vig_m weekly_light_sed_m
Min. : 0.00 Min. : 1014
1st Qu.: 36.25 1st Qu.: 6029
Median : 182.00 Median : 6956
Mean : 213.51 Mean : 7246
3rd Qu.: 303.00 3rd Qu.: 9543
Max. :1058.00 Max. :10080
# Weekly data for total steps and calories ---------------------------
weekly_data %>%
ungroup() %>%
select(
weekly_total_steps,
weekly_calories
) %>%
summary()
weekly_total_steps weekly_calories
Min. : 0 Min. : 1237
1st Qu.: 30671 1st Qu.:10607
Median : 44903 Median :13615
Mean : 46857 Mean :14130
3rd Qu.: 63219 3rd Qu.:17529
Max. :116758 Max. :26101
What does this tell us about this sample of people’s weekly activities? A few interesting notes:
Let’s look at some possible relationships between hour of day and different variables. We’ll create labels_hour_of_day
to use for these plots:
# Create labels for hour of day ---------------------------
# Note that we are assigning times to the current date of analysis to ensure
# that all times are within the same 24 hour day
labels_hour_of_day <- c(
"2022-01-18 00:00:00" = "00:00",
"2022-01-18 01:00:00" = "01:00",
"2022-01-18 02:00:00" = "02:00",
"2022-01-18 03:00:00" = "03:00",
"2022-01-18 04:00:00" = "04:00",
"2022-01-18 05:00:00" = "05:00",
"2022-01-18 06:00:00" = "06:00",
"2022-01-18 07:00:00" = "07:00",
"2022-01-18 08:00:00" = "08:00",
"2022-01-18 09:00:00" = "09:00",
"2022-01-18 10:00:00" = "10:00",
"2022-01-18 11:00:00" = "11:00",
"2022-01-18 12:00:00" = "12:00",
"2022-01-18 13:00:00" = "13:00",
"2022-01-18 14:00:00" = "14:00",
"2022-01-18 15:00:00" = "15:00",
"2022-01-18 16:00:00" = "16:00",
"2022-01-18 17:00:00" = "17:00",
"2022-01-18 18:00:00" = "18:00",
"2022-01-18 19:00:00" = "19:00",
"2022-01-18 20:00:00" = "20:00",
"2022-01-18 21:00:00" = "21:00",
"2022-01-18 22:00:00" = "22:00",
"2022-01-18 23:00:00" = "23:00",
"2022-01-18 24:00:00" = "24:00"
)
What hours of the day have the highest observations per data point? Are there certain times where people are most active?
# step_total observations per hour ---------------------------
ggplot(data = hourly_data, aes(x = as.factor(hour_of_day))) +
geom_bar(aes(weight = step_total), stat = "count") +
labs(title = "'step_total' observations throughout a 24 hour day") +
scale_y_continuous(name = "Count", labels = scales::comma) +
scale_x_discrete(
"Hour of day",
labels = labels_hour_of_day,
guide = guide_axis(angle = 45)
) +
theme_minimal()
# calories observations per hour ---------------------------
ggplot(data = hourly_data, aes(x = as.factor(hour_of_day))) +
geom_bar(aes(weight = calories), stat = "count") +
labs(title = "'calories' observations throughout a 24 hour day") +
scale_y_continuous(name = "Count", labels = scales::comma) +
scale_x_discrete(
"Hour of day",
labels = labels_hour_of_day,
guide = guide_axis(angle = 45)
) +
theme_minimal()
# average_intensity observations per hour ---------------------------
ggplot(data = hourly_data, aes(x = as.factor(hour_of_day))) +
geom_bar(aes(weight = average_intensity), stat = "count") +
labs(title = "'average_intensity' observations throughout a 24 hour day") +
scale_y_continuous(name = "Count", labels = scales::comma) +
scale_x_discrete(
"Hour of day",
labels = labels_hour_of_day,
guide = guide_axis(angle = 45)
) +
theme_minimal()
All three variables — step_total
, calories
, average_intensity
— have similar shapes. The counts are lowest from 12:00-4:00 AM (00:00-04:00), which is when most people are asleep. The two time windows with highest counts of observations are the same across all three graphs:
What’s the relationship between time and average intensity? For example, are people’s activities more intense in the morning vs. the evening? How does this info compare with the observation counts from the plots above?
# Average intensity throughout the day ---------------------------
average_intensity_p <-
ggplot(
data = hourly_data,
aes(x = time_list, y = average_intensity, color = average_intensity)
)
average_intensity_p +
geom_jitter(width = 0.3, size = 0.2) +
labs(
x = "Hour",
y = "Average intensity",
color = "Average Intensity",
title = "Average intensity throughout the day"
) +
stat_summary(
aes(x = time_list, y = average_intensity),
fun = mean,
geom = "point",
color = "white",
size = 1,
alpha = 0.7
) +
scale_color_viridis_c() +
theme_solarized(light = FALSE) +
theme(
axis.title.x = element_text(
margin = margin(5, 0, 5, 0),
size = 12
),
axis.title.y = element_text(
margin = margin(0, 5, 0, 5),
size = 12
),
plot.title = element_text(
margin = margin(10, 0, 10, 0),
size = 14
),
axis.text.x = element_text(
angle = 45,
vjust = 1,
hjust = 1,
size = 8
)
)
As you might expect, the average intensity begins to rise around 5:00 AM and stays relatively high throughout the workday before lowering around 8:00 PM. This is evident by looking at the mean for each hour, which is represented by the white dots in each column. The peak times look to be the same as what we saw earlier for observation counts:
How about day-to-day?
# Average intensity throughout the week ---------------------------
average_intensity_p <-
ggplot(
data = hourly_data,
aes(
x = reorder(day_list, desc(day_list)),
y = average_intensity,
color = average_intensity
)
)
average_intensity_p +
geom_jitter(width = 0.3, size = 0.1) +
labs(
x = "Day",
y = "Average intensity",
color = "Average Intensity",
title = "Average intensity throughout the week"
) +
stat_summary(
aes(
x = reorder(day_list, desc(day_list)),
y = average_intensity
),
fun = mean,
geom = "point",
color = "white",
size = 1,
alpha = 0.7
) +
scale_color_viridis_c() +
theme_solarized(light = FALSE) +
theme(
axis.title.x = element_text(
margin = margin(5, 0, 5, 0),
size = 12
),
axis.title.y = element_text(
margin = margin(0, 5, 0, 5),
size = 12
),
plot.title = element_text(
margin = margin(10, 0, 10, 0),
size = 14
),
axis.text.x = element_text(
size = 12
)
) +
coord_flip()
Unlike the hour-to-hour observations, there isn’t as clear a pattern day-to-day.
This is once again evident my looking at the mean for each day, which is represented by the white dots in each column. The difference in position for each dot is minuscule here.
Let’s go back to intensity and hour of day. Are the trends similar when we look at the weekend vs. week?
# average_intensity throughout a 24 hour day ---------------------------
hd_A <- dplyr::select(hourly_data, -time_of_week)
ggplot(
data = hourly_data,
aes(x = as.factor(hour_of_day), y = average_intensity)
) +
geom_jitter(data = hd_A, color = "grey70", width = 0.4, alpha = 0.6, size = 2) +
geom_jitter(aes(color = time_of_week), width = 0.4, alpha = 0.6, size = 2) +
scale_color_viridis_d(begin = 0, end = .5) +
labs(
title = "Average intensity throughout a 24 hour day",
x = "Hour"
) +
ylab("Average intensity") +
scale_x_discrete(
"Hour of day",
expand = c(0.0002, 0.0002),
labels = labels_hour_of_day,
guide = guide_axis(angle = 45)
) +
stat_summary(
aes(x = as.factor(hour_of_day), y = average_intensity),
fun = mean,
geom = "point",
color = "#fde725",
size = 4
) +
theme_pander() +
theme(
legend.position = "none",
axis.title.x = element_text(
margin = margin(15, 0, 10, 0),
size = 16
),
axis.title.y = element_text(
margin = margin(0, 10, 0, 10),
size = 16
),
plot.title = element_text(
margin = margin(10, 0, 10, 0),
size = 20
),
strip.text.x = element_text(
margin = margin(10, 0, 10, 0),
size = 14
),
axis.text.x = element_text(
size = 12
),
axis.text.y = element_text(
size = 12
)
) +
facet_wrap(vars(time_of_week), nrow = 2)
# total_intensity throughout a 24 hour day ---------------------------
jitter_pos <-
position_jitter(width = 0.25, height = 0, seed = 42)
set.seed(42)
hd_A <-
dplyr::select(hourly_data, -time_of_week)
set.seed(42)
hd_total_intensity_p <-
ggplot(
data = hourly_data,
aes(x = as.factor(hour_of_day), y = total_intensity, color = time_of_week)
)
# total_intensity throughout a 24 hour day ---------------------------
hd_total_intensity_p +
geom_point(
data = transform(hourly_data, time_of_week = NULL),
color = "grey70",
position = position_jitter(width = 0.4, height = 0, seed = 42),
size = 2
) +
geom_point(
position = position_jitter(width = 0.4, height = 0, seed = 42),
size = 2,
alpha = 0.6
) +
labs(
title = "Total intensity throughout a 24 hour day",
x = "Hour"
) +
ylab("Total intensity") +
scale_color_viridis_d(begin = 0, end = .5, guide = "legend") +
scale_x_discrete(
"Hour of day",
expand = c(0.0002, 0.0002),
labels = labels_hour_of_day,
guide = guide_axis(angle = 45)
) +
stat_summary(
aes(x = as.factor(hour_of_day), y = total_intensity),
fun = mean,
geom = "point",
color = "#fde725",
size = 4
) +
theme_pander() +
theme(
legend.position = "none",
axis.title.x = element_text(
margin = margin(15, 0, 10, 0),
size = 16
),
axis.title.y = element_text(
margin = margin(0, 10, 0, 10),
size = 16
),
plot.title = element_text(
margin = margin(10, 0, 10, 0),
size = 20
),
strip.text.x = element_text(
margin = margin(10, 0, 10, 0),
size = 14
),
axis.text.x = element_text(
size = 12
),
axis.text.y = element_text(
size = 12
)
) +
facet_wrap(vars(time_of_week), nrow = 2)
How about active minutes vs. sedentary minutes throughout the week?
# Create theme for plot formatting ---------------------------
theme_margin <-
theme(
axis.title.x = element_text(margin = margin(15, 0, 15, 0)),
axis.title.y = element_text(margin = margin(0, 10, 0, 10)),
plot.title = element_text(margin = margin(10, 0, 10, 0)),
strip.text.x = element_text(margin = margin(10, 0, 10, 0)),
)
# Very active minutes vs sedentary minutes ---------------------------
dd02_A <- dplyr::select(daily_data, -time_of_week)
ggplot(daily_data, aes(x = very_active_m, y = sedentary_m)) +
geom_jitter(data = dd02_A, color = "grey70") +
geom_jitter(
aes(x = very_active_m, y = sedentary_m),
color = "#112E51",
width = 0.4,
size = 1
) +
labs(title = "Very active minutes vs sedentary minutes throughout the week") +
ylab("Sedentary minutes") +
xlab("Very active minutes") +
geom_smooth(
aes(x = very_active_m, y = sedentary_m),
color = "#00CFFF",
linetype = "longdash",
alpha = 0.4,
se = FALSE
) +
theme_minimal() +
theme_margin +
facet_grid(~time_of_week ~ .)
# Fairly active minutes vs sedentary minutes ---------------------------
dd02_A <- dplyr::select(daily_data, -time_of_week)
ggplot(daily_data, aes(x = fairly_active_m, y = sedentary_m)) +
geom_jitter(data = dd02_A, color = "grey70") +
geom_jitter(
aes(x = fairly_active_m, y = sedentary_m),
color = "#112E51",
width = 0.4,
size = 1
) +
labs(
title = "Fairly active minutes vs sedentary minutes throughout the week"
) +
ylab("Sedentary minutes") +
xlab("Fairly active minutes") +
geom_smooth(
aes(x = fairly_active_m, y = sedentary_m),
color = "#00CFFF",
linetype = "longdash",
alpha = 0.6,
se = FALSE
) +
theme_minimal() +
theme_margin +
facet_grid(~time_of_week ~ .)
# Lightly active minutes vs sedentary minutes ---------------------------
dd02_A <- dplyr::select(daily_data, -time_of_week)
ggplot(daily_data, aes(x = lightly_active_m, y = sedentary_m)) +
geom_jitter(data = dd02_A, color = "grey70") +
geom_jitter(
aes(x = lightly_active_m, y = sedentary_m),
color = "#112E51",
width = 0.4,
size = 1
) +
labs(
title = "Lightly active minutes vs sedentary minutes throughout the week"
) +
ylab("Sedentary minutes") +
xlab("Lightly active minutes") +
geom_smooth(
aes(x = lightly_active_m, y = sedentary_m),
color = "#00CFFF",
linetype = "longdash",
alpha = 0.6,
se = FALSE
) +
theme_minimal() +
theme_margin +
facet_grid(~time_of_week ~ .)
Of all the metrics we’ve explored thus far, active minutes is the one we’re going to focus on moving forward. We’ll use this section to generate our supporting visualizations and key findings. Then, we’ll provide our top high-level content recommendations based on this analysis to our key stakeholders.
Experts recommend adults get at least 150 to 300 minutes of moderate to vigorous aerobic activity per week (see: Every move counts towards better health – says WHO). From the summary analysis above, we can see that the median amount of moderate/vigorous minutes per week these participants are getting is 182 minutes. How many minutes per week are participants actually meeting this criteria? How can we encourage this more?
Let’s look at the 154 individual weekly activity totals to gauge activity levels across the participant group. We’ll start with weekly_moderate_m
totals:
# Create labels for week-month-year ---------------------------
labels_wmy <- c(
"Week 15\nApr-16", "Week 16\nApr-16", "Week 17\nApr-16", "Week 18\nMay-16",
"Week 19\nMay-16"
)
# create df to use in next plot ---------------------------
weekly_moderate_150_300 <- as.data.frame(
weekly_data %>%
group_by(
weekly_moderate_m < 150,
(weekly_moderate_m >= 150 & weekly_moderate_m < 300),
weekly_moderate_m >= 300
) %>%
tally()
)
weekly_moderate_150_300_table <-
cbind(c("300+ ", "150 - 300 ", "0 - 150 "), weekly_moderate_150_300[, 4])
# Create weekly_moderate_p ---------------------------
weekly_moderate_p <-
ggplot(weekly_data, aes(x = week_number, y = weekly_moderate_m))
# Add layers to weekly_moderate_p ---------------------------
weekly_moderate_p +
geom_point(
data = weekly_data[which(weekly_data$weekly_moderate_m >= 300), ],
aes(color = "300+ "),
size = 2,
position = position_jitter(width = .25, height = 0),
alpha = 0.5
) +
geom_point(
data = weekly_data[which(weekly_data$weekly_moderate_m >= 150 & weekly_data$weekly_moderate_m < 300), ],
aes(color = "150 - 300 "),
size = 2,
position = position_jitter(width = .25, height = 0),
alpha = 0.5
) +
geom_point(
data = weekly_data[which(weekly_data$weekly_moderate_m < 150), ],
aes(color = "0 - 150 "),
size = 2,
position = position_jitter(width = .25, height = 0),
alpha = 0.5
) +
scale_x_discrete(name = NULL, labels = labels_wmy) +
scale_y_continuous(
breaks = c(0, 150, 300, 450, 600, 750, 900, 1050),
limits = c(0, 1060)) +
scale_color_manual(
name = "Minutes",
labels = c("300+ ", "150 - 300 ", "0 - 150 "),
values = c(
"300+ " = "#225ea8",
"150 - 300 " = "#41b6c4",
"0 - 150 " = "#707070"
)
) +
labs(
title = "Moderate activity",
subtitle = "(minutes per week)"
) +
geom_hline(yintercept = 150, linetype = "dashed") +
geom_hline(yintercept = 300, linetype = "dashed") +
ylab("Activity minutes") +
theme_classic() +
theme(
plot.subtitle = element_text(
margin = margin(1, 5, 5, 0),
size = 10,
hjust = 1,
face = "italic"
),
axis.title.y = element_text(
margin = margin(0, 5, 0, 5),
size = 12
),
plot.title = element_text(
margin = margin(5, 5, 5, 0),
size = 14,
hjust = 1,
face = "bold"
),
legend.position = "bottom"
) +
guides(color = guide_legend(override.aes = list(size = 3))) +
# Add summary table to weekly_moderate_p ---------------------------
coord_cartesian(clip = "off") +
theme(
plot.margin = margin(1, 25, 1, 1)
) +
annotation_custom(
grob = tableGrob(weekly_moderate_150_300_table, theme = ttheme_minimal(
base_size = 9,
base_color = "black",
base_family = "",
parse = FALSE,
padding = unit(c(4, 4), "mm"),
core = list(
bg_params = list(fill = c("#225ea8", "#41b6c4", "#707070"),
alpha = .3,
col = NA),
fg_params = list(hjust = 0, x = 0.15, fontface = c(rep("plain", 3))),
colhead = list(fg_params = list(hjust = 0, x = 0.15))
),
vp = NULL
)),
xmin = 4.5, xmax = 6, ymin = 600, ymax = 1050
)
Now let’s look at weekly_vigorous_m
totals:
# Create df to use in next plot ---------------------------
weekly_vigorous_75_150 <-
as.data.frame(
weekly_data %>%
group_by(
weekly_vigorous_m < 75,
(weekly_vigorous_m >= 75 & weekly_vigorous_m < 150),
weekly_vigorous_m >= 150
) %>%
tally()
)
weekly_vigorous_75_150_table <-
cbind(c("150+ ", "75 - 150 ", "0 - 75 "), weekly_vigorous_75_150[, 4])
# Create weekly_vigorous_p ---------------------------
weekly_vigorous_p <-
ggplot(weekly_data, aes(x = week_number, y = weekly_vigorous_m))
# Add layers to weekly_vigorous_p ---------------------------
weekly_vigorous_p +
geom_point(
data = weekly_data[which(weekly_data$weekly_vigorous_m >= 150), ],
aes(color = "150+ "),
size = 2,
position = position_jitter(width = .25, height = 0),
alpha = 0.4
) +
geom_point(
data = weekly_data[which(weekly_data$weekly_vigorous_m >= 75 & weekly_data$weekly_vigorous_m < 150), ],
aes(color = "75 - 150 "),
size = 2,
position = position_jitter(width = .25, height = 0),
alpha = 0.4
) +
geom_point(
data = weekly_data[which(weekly_data$weekly_vigorous_m < 75), ],
aes(color = "0 - 75 "),
size = 2,
position = position_jitter(width = .25, height = 0),
alpha = 0.4
) +
scale_x_discrete(name = NULL, labels = labels_wmy) +
scale_y_continuous(breaks = c(0, 75, 150, 300, 450, 600, 750, 900, 1050),
limits = c(0, 1060)) +
scale_color_manual(
name = "Minutes",
labels = c("150+ ", "75 - 150 ", "0 - 75 "),
values = c(
"150+ " = "#225ea8",
"75 - 150 " = "#41b6c4",
"0 - 75 " = "#707070"
)
) +
labs(
title = "Vigorous activity",
subtitle = "(minutes per week)"
) +
geom_hline(yintercept = 75, linetype = "dashed") +
geom_hline(yintercept = 150, linetype = "dashed") +
ylab("Activity minutes") +
theme_classic() +
theme(
plot.subtitle = element_text(
margin = margin(1, 5, 5, 0),
size = 10,
hjust = 1,
face = "italic"
),
axis.title.y = element_text(
margin = margin(0, 5, 0, 5),
size = 12
),
plot.title = element_text(
margin = margin(5, 5, 5, 0),
size = 14,
hjust = 1,
face = "bold"
),
legend.position = "bottom"
) +
guides(color = guide_legend(override.aes = list(size = 3))) +
# Add summary table to weekly_vigorous_p ---------------------------
coord_cartesian(clip = "off") +
theme(
plot.margin = margin(1, 25, 1, 1)
) +
annotation_custom(
grob = tableGrob(
weekly_vigorous_75_150_table,
theme = ttheme_minimal(
base_size = 9,
base_color = "black",
base_family = "",
parse = FALSE,
padding = unit(c(4, 4), "mm"),
core = list(
bg_params = list(
fill = c("#225ea8", "#41b6c4", "#707070"),
alpha = .3,
col = NA),
fg_params = list(hjust = 0, x = 0.15, fontface = c(rep("plain", 3))),
colhead = list(fg_params = list(hjust = 0, x = 0.15))
),
vp = NULL
)
),
xmin = 4.5,
xmax = 6,
ymin = 600,
ymax = 1050
)
While the above graphs for moderate minutes and vigorous minutes are interesting, it would be more beneficial to combine these metrics together when considering the 150 to 300 minute goal for moderate to vigorous physical activity per week:
# Create df to use in next plot ---------------------------
weekly_moderate_vigorous_150_300 <-
as.data.frame(
weekly_data %>%
group_by(
weekly_mod_vig_m < 150,
(weekly_mod_vig_m >= 150 & weekly_mod_vig_m < 300),
weekly_mod_vig_m >= 300
) %>%
tally()
)
weekly_moderate_vigorous_150_300_table <-
cbind(c("300+ ", "150 - 300 ", "0 - 150 "), weekly_moderate_vigorous_150_300[, 4])
# Create weekly_moderate_vigorous_p ---------------------------
weekly_moderate_vigorous_p <-
ggplot(weekly_data, aes(x = week_number, y = weekly_mod_vig_m))
# Add layers to weekly_moderate_vigorous_p ---------------------------
weekly_moderate_vigorous_p +
geom_point(
data = weekly_data[which(weekly_data$weekly_mod_vig_m >= 300), ],
aes(color = "300+ "),
size = 2,
position = position_jitter(width = .25, height = 0),
alpha = 0.5
) +
geom_point(
data = weekly_data[which(weekly_data$weekly_mod_vig_m >= 150 & weekly_data$weekly_mod_vig_m < 300), ],
aes(color = "150 - 300 "),
size = 2,
position = position_jitter(width = .25, height = 0),
alpha = 0.5
) +
geom_point(
data = weekly_data[which(weekly_data$weekly_mod_vig_m < 150), ],
aes(color = "0 - 150 "),
size = 2,
position = position_jitter(width = .25, height = 0),
alpha = 0.5
) +
scale_x_discrete(name = NULL, labels = labels_wmy) +
scale_y_continuous(breaks = c(0, 150, 300, 450, 600, 750, 900, 1050), limits = c(0, 1060)) +
scale_color_manual(
name = "Minutes",
labels = c("300+ ", "150 - 300 ", "0 - 150 "),
values = c(
"300+ " = "#225ea8",
"150 - 300 " = "#41b6c4",
"0 - 150 " = "#707070"
)
) +
labs(
title = "Combined moderate/vigorous activity",
subtitle = "(minutes per week)"
) +
geom_hline(yintercept = 150, linetype = "dashed") +
geom_hline(yintercept = 300, linetype = "dashed") +
ylab("Activity minutes") +
theme_classic() +
theme(
plot.subtitle = element_text(
margin = margin(1, 5, 5, 0),
size = 10,
hjust = 1,
face = "italic"
),
axis.title.y = element_text(
margin = margin(0, 5, 0, 5),
size = 12
),
plot.title = element_text(
margin = margin(5, 5, 5, 0),
size = 14,
hjust = 1,
face = "bold"
),
legend.position = "bottom"
) +
guides(color = guide_legend(override.aes = list(size = 3))) +
# Add summary table to weekly_moderate_vigorous_p ---------------------------
coord_cartesian(clip = "off") +
theme(
plot.margin = margin(1, 25, 1, 1)
) +
annotation_custom(
grob = tableGrob(
weekly_moderate_vigorous_150_300_table,
theme = ttheme_minimal(
base_size = 9,
base_color = "black",
base_family = "",
parse = FALSE,
padding = unit(c(4, 4), "mm"),
core = list(
bg_params = list(
fill = c("#225ea8", "#41b6c4", "#707070"),
alpha = .3,
col = NA),
fg_params = list(hjust = 0,
x = 0.15,
fontface = c(rep("plain", 3))),
colhead = list(fg_params = list(hjust = 0, x = 0.15))
),
vp = NULL
)
),
xmin = 4.5,
xmax = 6,
ymin = 600,
ymax = 1050
)
ggsave(filename = "Moderate_vigorous_activity_per_week_color.png", path = "images")
# Create basic graph showing 154 totals in grey ---------------------------
weekly_moderate_vigorous_p <-
ggplot(weekly_data, aes(x = week_number, y = weekly_mod_vig_m))
# Add layers ---------------------------
weekly_moderate_vigorous_p +
geom_point(
data = weekly_data[which(weekly_data$weekly_mod_vig_m >= 300), ],
aes(color = "300+ "),
size = 2,
position = position_jitter(width = .25, height = 0),
alpha = 0.5
) +
geom_point(
data = weekly_data[which(weekly_data$weekly_mod_vig_m >= 150 & weekly_data$weekly_mod_vig_m < 300), ],
aes(color = "150 - 300 "),
size = 2,
position = position_jitter(width = .25, height = 0),
alpha = 0.5
) +
geom_point(
data = weekly_data[which(weekly_data$weekly_mod_vig_m < 150), ],
aes(color = "0 - 150 "),
size = 2,
position = position_jitter(width = .25, height = 0),
alpha = 0.5
) +
scale_x_discrete(name = NULL, labels = labels_wmy) +
scale_y_continuous(breaks = c(0, 150, 300, 450, 600, 750, 900, 1050),
limits = c(0, 1060)) +
scale_color_manual(
name = "Minutes",
labels = c("300+ ", "150 - 300 ", "0 - 150 "),
values = c(
"300+ " = "#707070",
"150 - 300 " = "#707070",
"0 - 150 " = "#707070"
)
) +
labs(
title = "Combined moderate/vigorous activity",
subtitle = "(minutes per week)"
) +
ylab("Activity minutes") +
theme_classic() +
theme(
plot.subtitle = element_text(
margin = margin(1, 5, 5, 0),
size = 10,
hjust = 1,
face = "italic"
),
axis.title.y = element_text(
margin = margin(0, 5, 0, 5),
size = 12
),
plot.title = element_text(
margin = margin(5, 5, 5, 0),
size = 14,
hjust = 1,
face = "bold"
),
legend.position = "none"
) +
guides(color = guide_legend(override.aes = list(size = 3)))
ggsave(filename = "Moderate_vigorous_activity_per_week_gs.png", path = "images")
There are some interesting discoveries here:
So, the largest percentage (45%) of these 154 weekly activity totals are less than 150 minutes. This trend in the physical activity of smart device users is fascinating. However, plotting the 154 weekly totals like this can only take us so far in learning about these users at an individual level. For further understanding, we need to delve deeper…
# Create labels and limits for week numbers ---------------------------
labels_week_n <- c("W\n1", "W\n2", "W\n3", "W\n4", "W\n5")
limits_week_n <- c(
"Week 15, Apr 2016", "Week 16, Apr 2016", "Week 17, Apr 2016",
"Week 18, May 2016", "Week 19, May 2016"
)
# Weekly_moderate_vigorous plot ---------------------------
# for all 33 participants
weekly_moderate_vigorous_p <-
weekly_data %>%
{
ggplot(.) +
geom_line(
aes(
x = as.factor(week_number),
y = weekly_mod_vig_m,
group = id,
color = as.factor(id)
),
linejoin = "mitre",
lineend = "butt",
show.legend = FALSE,
size = .5
) +
geom_point(
aes(
x = as.factor(week_number),
y = weekly_mod_vig_m,
group = id,
color = as.factor(id)
),
show.legend = FALSE,
size = 1) +
labs(
title = "Moderate/vigourous activity",
subtitle = "Weekly totals for 33 participants",
y = "Minutes"
) +
scale_x_discrete(
name = NULL,
limits = limits_week_n,
labels = labels_week_n,
expand = c(0, 0)
) +
scale_y_continuous(
breaks = c(0, 150, 300, 450, 600, 750, 900),
limits = c(0, 1060)
) +
scale_color_discrete() +
geom_hline(yintercept = 150, linetype = "dashed") +
geom_hline(yintercept = 300, linetype = "dashed") +
theme_minimal() +
facet_wrap(~id, drop = TRUE, nrow = 3, ncol = 11, scales = "fixed") +
theme(
plot.title = element_text(
margin = margin(5, 5, 5, 0),
size = 14,
hjust = 1,
face = "bold"
),
plot.subtitle = element_text(
margin = margin(1, 5, 5, 0),
size = 10,
hjust = 1,
face = "italic"
),
axis.title.y = element_text(
margin = margin(0, 5, 0, 5),
size = 12
),
panel.spacing.x = unit(.75, "lines")
)
}
ggsave(file = "weekly_activity_33.png", path = "images")
weekly_moderate_vigorous_p
The main takeaway for the above plot is to show we have 33 graphs for the 33 users. With this data, we can learn more about each user’s active minute totals week-to-week, which brings us to what I really want to discuss: Achievement Rate
Achievement Rate = percentage of weeks a participant is successful in achieving ≥ 150 moderate/vigorous activity minutes
How do these smart device users group across the below three achievement rates?
Group A: high weekly achievement rate
Group B: regular achievement rate
Group C: low weekly achievement rate
Let’s use personograph
to create icon charts that visualize these groups. We’ll apply their respective percentages to 100:
# weekly_achievement_groups plots ---------------------------
weekly_achievement_groups <-
c("Group A" = .455, "Group B" = .06, "Group C" = .485)
personograph(
data = weekly_achievement_groups,
fig.title = NULL,
fig.cap = NULL,
draw.legend = F,
icon = NULL,
icon.dim = NULL,
icon.style = 6,
n.icons = 100,
plot.width = 0.75,
dimensions = ceiling(sqrt(c(100, 100))),
fudge = 0.0075,
legend.show.zeros = TRUE,
force.fill = "ignore",
colors = c("Group A" = "#052049", "Group B" = "#007cbe","Group C" = "#058488")
)
png(file = "achievement_groups.png")
dev.off()
png
2
# Group a ---------------------------
weekly_achievement_group_a <- c("Group B and C" = .545, "Group A" = .455)
personograph(
data = weekly_achievement_group_a,
fig.title = "Group A: high weekly achievement rate (80-100%)",
fig.cap = NULL,
draw.legend = F,
icon = NULL,
icon.dim = NULL,
icon.style = 6,
n.icons = 100,
plot.width = 0.75,
dimensions = ceiling(sqrt(c(100, 100))),
fudge = 0.0075,
legend.show.zeros = TRUE,
force.fill = "ignore",
colors = c("Group B and C" = "grey", "Group A" = "#052049")
)
# Group b ---------------------------
weekly_achievement_group_b <- c("Group A and C" = .94, "Group B" = .06)
personograph(
data = weekly_achievement_group_b,
fig.title = "Group B: regular weekly achievement rate (60-79%)",
fig.cap = NULL,
draw.legend = F,
icon = NULL,
icon.dim = NULL,
icon.style = 6,
n.icons = 100,
plot.width = 0.75,
dimensions = ceiling(sqrt(c(100, 100))),
fudge = 0.0075,
legend.show.zeros = TRUE,
force.fill = "ignore",
colors = c("Group A and C" = "grey", "Group B" = "#007cbe")
)
# Group c ---------------------------
weekly_achievement_group_c <- c("Group A and B" = .515, "Group C" = .485)
personograph(
data = weekly_achievement_group_c,
fig.title = "Group C: low weekly achievement rate (0-59%)",
fig.cap = NULL,
draw.legend = F,
icon = NULL,
icon.dim = NULL,
icon.style = 6,
n.icons = 100,
plot.width = 0.75,
dimensions = ceiling(sqrt(c(100, 100))),
fudge = 0.0075,
legend.show.zeros = TRUE,
force.fill = "ignore",
colors = c("Group A and B" = "grey", "Group C" = "#058488")
)