Ask


About

Bellabeat is the go-to wellness brand for women with an ecosystem of products and services that focuses on women’s health. They empower women with knowledge about their own health and habits by collecting data on activity, sleep, stress, and reproductive health. Urška Sršen, co-founder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company.

Key stakeholders

The key stakeholders for this project include:

  • Urška Sršen: Bellabeat’s co-founder and Chief Creative Officer
  • Sando Mur: Mathematician and Bellabeat’s co-founder; key member of the Bellabeat executive team
  • Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy

Guiding questions

These three questions will guide our analysis:

  • What are some trends in smart device usage?
  • How could these trends apply to Bellabeat customers?
  • How could these trends help influence Bellabeat marketing strategy?

Business task

Analyze smart device fitness data to gain insight into consumers and identify trends that can inform Bellabeat App marketing strategy.

Prepare


Why R?

We are using R for our data analysis because it’s a powerful, all-in-one tool. We can explore all the datasets quickly, provide code, share visualizations, and keep everything organized all within this html file. This html file will serve as a record of the data cleaning process, data analysis and data visualization generation that preceded our stakeholder presentation.

Install and load packages

The first thing we’ll do is install the needed packages:

  • Install R packages
    • install.packages()
    • Packages: tidyverse, ggplot2, here , hms, janitor, lubridate, skimr,
      gridExtra, paletteer, crayon, stringr, patchwork, ggthemes, ggforce,
      personograph, ggbeeswarm, knitr

Then we’ll load the libraries for these packages:

  • Load R libraries
    • library()
    • Packages: tidyverse, ggplot2, here , hms, janitor, lubridate, skimr,
      gridExtra, paletteer, crayon, stringr, patchwork, ggthemes, ggforce,
      personograph, ggbeeswarm, knitr

Data sources

We’ll be using public data that explores smart device users’ daily habits:

  • FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal tracker data from thirty Fitbit users. These Fitbit users consented to submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.

Important note: The FitBit dataset does have some limitations. Since the data is collected from thirty Fitbit users , it is not representative of the entire fitness tracker industry and thus suffers some sampling bias. We will address this limitation by using physical activity data from the World Health Organization (WHO) as a reference point for activity levels on a global scale. We’ll look at open data on global trends in physical activity levels. This data will be pulled from the WHO website and included in the final presentation. It won’t be used in this notebook, but we are including here for reference:

  • GHO’s data on prevalence of insufficient physical activity (CC BY-NC-SA 3.0 IGO: dataset made available through GHO): This WHO data set contains data on the prevalence of insufficient physical activity among adults aged 18+ years. The estimates are based on self-reported physical activity captured using questionnaires covering activity at work/in the household, for transport, and during leisure time. Since the WHO data is self-reported, it suffers from all the biases of self-reported data (i.e.  response bias, social-desirability bias).

Importing data

Back to the FitBit data: we’re going to explore data across fourteen CSV files. The data is organized in rows and columns, making it structured data. The majority of datasets are in long format, meaning each row is one time point per session ID. Each session ID has data in multiple rows. Individual reports can be parsed by export ID or time-stamp. These datasets fall into three different groups based on the time increments they were tracked by:

  1. Daily data
  2. Hourly data
  3. Minute data

By analyzing the daily, hourly, and minute level data provided, we can begin to explore trends and search for insights that can be applied to Bellabeat customers.

Let’s look at the file sizes prior to importing. We don’t want to overload our RAM in Rstudio Cloud:

dailyActivity_merged.csv = 89437 bytes
dailyCalories_merged.csv = 25125 bytes
dailyIntensities_merged.csv = 70581 bytes
dailySteps_merged.csv = 25175 bytes
sleepDay_merged.csv = 18100 bytes
weightLogInfo_merged.csv = 6725 bytes
hourlyCalories_merged.csv = 801486 bytes
hourlyIntensities_merged.csv = 898778 bytes
hourlySteps_merged.csv = 796562 bytes
minuteCaloriesNarrow_merged.csv = 66443854 bytes
minuteIntensitiesNarrow_merged.csv = 46358789 bytes
minuteMETsNarrow_merged.csv = 47686683 bytes
minuteSleep_merged.csv = 8848370 bytes
minuteStepsNarrow_merged.csv = 46531957 bytes

Based on file sizes, it looks like the minute level data could cause issues with RAM usage in RStudio Cloud. Let’s compare the size of the smallest minute level dataset to the total combined size of all nine daily/hourly datasets:

  • minuteSleep_merged.csv = 8,848,370 bytes

  • dailyActivity_merged.csv,
    dailyCalories_merged.csv,
    dailyIntensities_merged.csv,
    dailySteps_merged.csv,
    sleepDay_merged.csv,
    weightLogInfo_merged.csv,
    hourlyCalories_merged.csv,
    hourlyIntensities_merged.csv,
    hourlySteps_merged.csv = 2,753,820 bytes (2.75382^{6})

Even if we combine the bytes for all the daily and hourly datasets, it still doesn’t even come close in size to the smallest minute level dataset!

Let’s start with loading our daily and hourly datasets. We’ll create dataframes (df) for these now:

daily_activity <- read_csv("fitbit_data/dailyActivity_merged.csv")
daily_calories <- read_csv("fitbit_data/dailyCalories_merged.csv")
daily_intensities <- read_csv("fitbit_data/dailyIntensities_merged.csv")
daily_steps <- read_csv("fitbit_data/dailySteps_merged.csv")
daily_sleep <- read_csv("fitbit_data/sleepDay_merged.csv")
weight_log <- read_csv("fitbit_data/weightLogInfo_merged.csv")
hourly_calories <- read_csv("fitbit_data/hourlyCalories_merged.csv")
hourly_intensities <- read_csv("fitbit_data/hourlyIntensities_merged.csv")
hourly_steps <- read_csv("fitbit_data/hourlySteps_merged.csv")


Since the minute data is huge, we’re only going to import the smallest dataset:

minute_sleep <- read_csv("fitbit_data/minuteSleep_merged.csv")

Process


Cleaning data

Now that we’ve identified and loaded the ten datasets we’re going to use, we need to ensure that the data within each dataframe is clean and ready to analyze.

Create tibbles

Let’s take a closer look at the different dataframes. We’ll start by using the head() function to view the first few rows of each one:

Daily dataframes
`head(daily_activity)`  

```
# A tibble: 6 × 15
       Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie…
    <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
1  1.50e9 4/12/16           13162          8.5             8.5                 0
2  1.50e9 4/13/16           10735          6.97            6.97                0
3  1.50e9 4/14/16           10460          6.74            6.74                0
4  1.50e9 4/15/16            9762          6.28            6.28                0
5  1.50e9 4/16/16           12669          8.16            8.16                0
6  1.50e9 4/17/16            9705          6.48            6.48                0
# … with 9 more variables: VeryActiveDistance <dbl>,
#   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
#   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
#   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
#   SedentaryMinutes <dbl>, Calories <dbl>
```

`head(daily_calories)`  

```
# A tibble: 6 × 3
          Id ActivityDay Calories
       <dbl> <chr>          <dbl>
1 1503960366 4/12/2016       1985
2 1503960366 4/13/2016       1797
3 1503960366 4/14/2016       1776
4 1503960366 4/15/2016       1745
5 1503960366 4/16/2016       1863
6 1503960366 4/17/2016       1728
```

`head(daily_intensities)`  

```
# A tibble: 6 × 10
          Id ActivityDay SedentaryMinutes LightlyActiveMinutes FairlyActiveMinu…
       <dbl> <chr>                  <dbl>                <dbl>             <dbl>
1 1503960366 4/12/2016                728                  328                13
2 1503960366 4/13/2016                776                  217                19
3 1503960366 4/14/2016               1218                  181                11
4 1503960366 4/15/2016                726                  209                34
5 1503960366 4/16/2016                773                  221                10
6 1503960366 4/17/2016                539                  164                20
# … with 5 more variables: VeryActiveMinutes <dbl>,
#   SedentaryActiveDistance <dbl>, LightActiveDistance <dbl>,
#   ModeratelyActiveDistance <dbl>, VeryActiveDistance <dbl>
```

`head(daily_steps)`  

```
# A tibble: 6 × 3
          Id ActivityDay StepTotal
       <dbl> <chr>           <dbl>
1 1503960366 4/12/2016       13162
2 1503960366 4/13/2016       10735
3 1503960366 4/14/2016       10460
4 1503960366 4/15/2016        9762
5 1503960366 4/16/2016       12669
6 1503960366 4/17/2016        9705
```

`head(daily_sleep)`  

```
# A tibble: 6 × 5
          Id SleepDay          TotalSleepRecor… TotalMinutesAsle… TotalTimeInBed
       <dbl> <chr>                        <dbl>             <dbl>          <dbl>
1 1503960366 4/12/2016 12:00:…                1               327            346
2 1503960366 4/13/2016 12:00:…                2               384            407
3 1503960366 4/15/2016 12:00:…                1               412            442
4 1503960366 4/16/2016 12:00:…                2               340            367
5 1503960366 4/17/2016 12:00:…                1               700            712
6 1503960366 4/19/2016 12:00:…                1               304            320
```

`head(weight_log)`  

```
# A tibble: 6 × 8
          Id Date      WeightKg WeightPounds   Fat   BMI IsManualReport    LogId
       <dbl> <chr>        <dbl>        <dbl> <dbl> <dbl> <lgl>             <dbl>
1 1503960366 5/2/2016…     52.6         116.    22  22.6 TRUE            1.46e12
2 1503960366 5/3/2016…     52.6         116.    NA  22.6 TRUE            1.46e12
3 1927972279 4/13/201…    134.          294.    NA  47.5 FALSE           1.46e12
4 2873212765 4/21/201…     56.7         125.    NA  21.5 TRUE            1.46e12
5 2873212765 5/12/201…     57.3         126.    NA  21.7 TRUE            1.46e12
6 4319703577 4/17/201…     72.4         160.    25  27.5 TRUE            1.46e12
```
Hourly dataframes
`head(hourly_calories)`  

```
# A tibble: 6 × 3
          Id ActivityHour          Calories
       <dbl> <chr>                    <dbl>
1 1503960366 4/12/2016 12:00:00 AM       81
2 1503960366 4/12/2016 1:00:00 AM        61
3 1503960366 4/12/2016 2:00:00 AM        59
4 1503960366 4/12/2016 3:00:00 AM        47
5 1503960366 4/12/2016 4:00:00 AM        48
6 1503960366 4/12/2016 5:00:00 AM        48
```

`head(hourly_intensities`  

```
# A tibble: 6 × 4
          Id ActivityHour          TotalIntensity AverageIntensity
       <dbl> <chr>                          <dbl>            <dbl>
1 1503960366 4/12/2016 12:00:00 AM             20            0.333
2 1503960366 4/12/2016 1:00:00 AM               8            0.133
3 1503960366 4/12/2016 2:00:00 AM               7            0.117
4 1503960366 4/12/2016 3:00:00 AM               0            0    
5 1503960366 4/12/2016 4:00:00 AM               0            0    
6 1503960366 4/12/2016 5:00:00 AM               0            0    
```

`head(hourly_steps)`  

```
# A tibble: 6 × 3
          Id ActivityHour          StepTotal
       <dbl> <chr>                     <dbl>
1 1503960366 4/12/2016 12:00:00 AM       373
2 1503960366 4/12/2016 1:00:00 AM        160
3 1503960366 4/12/2016 2:00:00 AM        151
4 1503960366 4/12/2016 3:00:00 AM          0
5 1503960366 4/12/2016 4:00:00 AM          0
6 1503960366 4/12/2016 5:00:00 AM          0
```
Minute dataframe
`head(minute_sleep)`  

```
# A tibble: 6 × 4
          Id date                 value       logId
       <dbl> <chr>                <dbl>       <dbl>
1 1503960366 4/12/2016 2:47:30 AM     3 11380564589
2 1503960366 4/12/2016 2:48:30 AM     2 11380564589
3 1503960366 4/12/2016 2:49:30 AM     1 11380564589
4 1503960366 4/12/2016 2:50:30 AM     1 11380564589
5 1503960366 4/12/2016 2:51:30 AM     1 11380564589
6 1503960366 4/12/2016 2:52:30 AM     1 11380564589
```

View column names

Now we’ll use the colnames() function to view the column names for each dataframe:

Daily dataframes
`colnames(daily_activity)`

```
 [1] "Id"                       "ActivityDate"            
 [3] "TotalSteps"               "TotalDistance"           
 [5] "TrackerDistance"          "LoggedActivitiesDistance"
 [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
 [9] "LightActiveDistance"      "SedentaryActiveDistance" 
[11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
[13] "LightlyActiveMinutes"     "SedentaryMinutes"        
[15] "Calories"                
```

`colnames(daily_calories)`  

```
[1] "Id"          "ActivityDay" "Calories"   
```

`colnames(daily_intensities)`

```
 [1] "Id"                       "ActivityDay"             
 [3] "SedentaryMinutes"         "LightlyActiveMinutes"    
 [5] "FairlyActiveMinutes"      "VeryActiveMinutes"       
 [7] "SedentaryActiveDistance"  "LightActiveDistance"     
 [9] "ModeratelyActiveDistance" "VeryActiveDistance"      
```

`colnames(daily_steps)`

```
[1] "Id"          "ActivityDay" "StepTotal"  
```

`colnames(daily_sleep)`

```
[1] "Id"                 "SleepDay"           "TotalSleepRecords" 
[4] "TotalMinutesAsleep" "TotalTimeInBed"    
```

`colnames(weight_log)`

```
[1] "Id"             "Date"           "WeightKg"       "WeightPounds"  
[5] "Fat"            "BMI"            "IsManualReport" "LogId"         
```
Hourly dataframes
`colnames(hourly_calories)`  

```
[1] "Id"           "ActivityHour" "Calories"    
```

`colnames(hourly_intensities)`   

```
[1] "Id"               "ActivityHour"     "TotalIntensity"   "AverageIntensity"
```

`colnames(hourly_steps)`  

```
[1] "Id"           "ActivityHour" "StepTotal"   
```
Minute dataframe
`colnames(minute_sleep)`

```
[1] "Id"    "date"  "value" "logId"
```

View variables

All these dataframes have the Id field, which can be used to merge them into one another to create overall dataframes (i.e. daily df, hourly df). However, the activity date/time variable formatting looks to be different across some of the dataframes. Let’s select() those variables to take a closer look.

Daily dataframes
```r
daily_activity %>%
  select(Id, ActivityDate)
```

```
# A tibble: 940 × 2
           Id ActivityDate
        <dbl> <chr>       
 1 1503960366 4/12/16     
 2 1503960366 4/13/16     
 3 1503960366 4/14/16     
 4 1503960366 4/15/16     
 5 1503960366 4/16/16     
 6 1503960366 4/17/16     
 7 1503960366 4/18/16     
 8 1503960366 4/19/16     
 9 1503960366 4/20/16     
10 1503960366 4/21/16     
# … with 930 more rows
```

```r
daily_calories %>%
  select(Id, ActivityDay)
```

```
# A tibble: 940 × 2
           Id ActivityDay
        <dbl> <chr>      
 1 1503960366 4/12/2016  
 2 1503960366 4/13/2016  
 3 1503960366 4/14/2016  
 4 1503960366 4/15/2016  
 5 1503960366 4/16/2016  
 6 1503960366 4/17/2016  
 7 1503960366 4/18/2016  
 8 1503960366 4/19/2016  
 9 1503960366 4/20/2016  
10 1503960366 4/21/2016  
# … with 930 more rows
```

```r
daily_intensities %>%
  select(Id, ActivityDay)
```

```
# A tibble: 940 × 2
           Id ActivityDay
        <dbl> <chr>      
 1 1503960366 4/12/2016  
 2 1503960366 4/13/2016  
 3 1503960366 4/14/2016  
 4 1503960366 4/15/2016  
 5 1503960366 4/16/2016  
 6 1503960366 4/17/2016  
 7 1503960366 4/18/2016  
 8 1503960366 4/19/2016  
 9 1503960366 4/20/2016  
10 1503960366 4/21/2016  
# … with 930 more rows
```

```r
daily_steps %>%
  select(Id, ActivityDay)
```

```
# A tibble: 940 × 2
           Id ActivityDay
        <dbl> <chr>      
 1 1503960366 4/12/2016  
 2 1503960366 4/13/2016  
 3 1503960366 4/14/2016  
 4 1503960366 4/15/2016  
 5 1503960366 4/16/2016  
 6 1503960366 4/17/2016  
 7 1503960366 4/18/2016  
 8 1503960366 4/19/2016  
 9 1503960366 4/20/2016  
10 1503960366 4/21/2016  
# … with 930 more rows
```

```r
daily_sleep %>%
  select(Id, SleepDay)
```

```
# A tibble: 413 × 2
           Id SleepDay             
        <dbl> <chr>                
 1 1503960366 4/12/2016 12:00:00 AM
 2 1503960366 4/13/2016 12:00:00 AM
 3 1503960366 4/15/2016 12:00:00 AM
 4 1503960366 4/16/2016 12:00:00 AM
 5 1503960366 4/17/2016 12:00:00 AM
 6 1503960366 4/19/2016 12:00:00 AM
 7 1503960366 4/20/2016 12:00:00 AM
 8 1503960366 4/21/2016 12:00:00 AM
 9 1503960366 4/23/2016 12:00:00 AM
10 1503960366 4/24/2016 12:00:00 AM
# … with 403 more rows
```

```r
weight_log %>%
  select(Id, Date)
```

```
# A tibble: 67 × 2
           Id Date                 
        <dbl> <chr>                
 1 1503960366 5/2/2016 11:59:59 PM 
 2 1503960366 5/3/2016 11:59:59 PM 
 3 1927972279 4/13/2016 1:08:52 AM 
 4 2873212765 4/21/2016 11:59:59 PM
 5 2873212765 5/12/2016 11:59:59 PM
 6 4319703577 4/17/2016 11:59:59 PM
 7 4319703577 5/4/2016 11:59:59 PM 
 8 4558609924 4/18/2016 11:59:59 PM
 9 4558609924 4/25/2016 11:59:59 PM
10 4558609924 5/1/2016 11:59:59 PM 
# … with 57 more rows
```
Hourly dataframes
```r
hourly_calories %>%
  select(Id, ActivityHour)
```

```
# A tibble: 22,099 × 2
           Id ActivityHour         
        <dbl> <chr>                
 1 1503960366 4/12/2016 12:00:00 AM
 2 1503960366 4/12/2016 1:00:00 AM 
 3 1503960366 4/12/2016 2:00:00 AM 
 4 1503960366 4/12/2016 3:00:00 AM 
 5 1503960366 4/12/2016 4:00:00 AM 
 6 1503960366 4/12/2016 5:00:00 AM 
 7 1503960366 4/12/2016 6:00:00 AM 
 8 1503960366 4/12/2016 7:00:00 AM 
 9 1503960366 4/12/2016 8:00:00 AM 
10 1503960366 4/12/2016 9:00:00 AM 
# … with 22,089 more rows
```

```r
hourly_intensities %>%
  select(Id, ActivityHour)
```

```
# A tibble: 22,099 × 2
           Id ActivityHour         
        <dbl> <chr>                
 1 1503960366 4/12/2016 12:00:00 AM
 2 1503960366 4/12/2016 1:00:00 AM 
 3 1503960366 4/12/2016 2:00:00 AM 
 4 1503960366 4/12/2016 3:00:00 AM 
 5 1503960366 4/12/2016 4:00:00 AM 
 6 1503960366 4/12/2016 5:00:00 AM 
 7 1503960366 4/12/2016 6:00:00 AM 
 8 1503960366 4/12/2016 7:00:00 AM 
 9 1503960366 4/12/2016 8:00:00 AM 
10 1503960366 4/12/2016 9:00:00 AM 
# … with 22,089 more rows
```

```r
hourly_steps %>%
  select(Id, ActivityHour)
```

```
# A tibble: 22,099 × 2
           Id ActivityHour         
        <dbl> <chr>                
 1 1503960366 4/12/2016 12:00:00 AM
 2 1503960366 4/12/2016 1:00:00 AM 
 3 1503960366 4/12/2016 2:00:00 AM 
 4 1503960366 4/12/2016 3:00:00 AM 
 5 1503960366 4/12/2016 4:00:00 AM 
 6 1503960366 4/12/2016 5:00:00 AM 
 7 1503960366 4/12/2016 6:00:00 AM 
 8 1503960366 4/12/2016 7:00:00 AM 
 9 1503960366 4/12/2016 8:00:00 AM 
10 1503960366 4/12/2016 9:00:00 AM 
# … with 22,089 more rows
```
Minute dataframe
```r
minute_sleep %>%
  select(Id, date)
```

```
# A tibble: 188,521 × 2
           Id date                
        <dbl> <chr>               
 1 1503960366 4/12/2016 2:47:30 AM
 2 1503960366 4/12/2016 2:48:30 AM
 3 1503960366 4/12/2016 2:49:30 AM
 4 1503960366 4/12/2016 2:50:30 AM
 5 1503960366 4/12/2016 2:51:30 AM
 6 1503960366 4/12/2016 2:52:30 AM
 7 1503960366 4/12/2016 2:53:30 AM
 8 1503960366 4/12/2016 2:54:30 AM
 9 1503960366 4/12/2016 2:55:30 AM
10 1503960366 4/12/2016 2:56:30 AM
# … with 188,511 more rows
```

Data cleaning

Now that we’ve familiarized ourselves with the data, we can identify a few data cleaning tasks to move forward with. Here are the main areas we’ll address:

  1. Naming conventions: we’ll adjust the variable names according to the tidyverse style guide. All variable names will use only lowercase letters and underscores _ (known as snake case) to separate words within a name. This will make our code easier to read.

  2. Fix formatting issues: we’ll make the date-time format consistent across each dataframe. This will prevent problems down the line during analysis, as well as when we begin merging the data together.

  3. Create new variables: we’ll create a few new columns within each dataframe.

  4. Create new dataframe: in order to better analyze the minute_sleep data, we’ll first do some thorough transformation by reformatting and creating new variables. Then we’ll aggregate all the sleep data we have into a more manageable structure.

Daily dataframes
# daily_activity ---------------------------
daily_activity <-
  daily_activity %>% 
  rename(
    activity_date = ActivityDate, 
    total_steps = TotalSteps, 
    total_distance = TotalDistance,
    tracker_distance = TrackerDistance,
    logged_activities_d = LoggedActivitiesDistance,
    very_active_d = VeryActiveDistance, 
    moderately_active_d = ModeratelyActiveDistance, 
    light_active_d = LightActiveDistance, 
    sedentary_d = SedentaryActiveDistance, 
    very_active_m = VeryActiveMinutes, 
    fairly_active_m = FairlyActiveMinutes, 
    lightly_active_m = LightlyActiveMinutes, 
    sedentary_m = SedentaryMinutes, 
    calories = Calories
    ) %>% 
  rename_with(
    tolower, starts_with("Id")
  ) %>% 
  mutate(
    # reformat variable as POSIXct to represent date and time
    activity_date = parse_date_time(activity_date, "%m/%d/%y"),
    # create new variable and format as date only
    activity_date_ymd = as.Date(activity_date, "%Y/%m/%d"),
    # create new variables for day of week and time of week
    day_of_week = weekdays(as.Date(activity_date)),
    time_of_week = case_when(
      day_of_week %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") ~ "Weekday", 
      day_of_week %in% c("Saturday", "Sunday") ~ "Weekend")
    )
# daily_calories ---------------------------
daily_calories <-
  daily_calories %>% 
  rename(
    activity_date = ActivityDay,
    calories = Calories
  ) %>% 
   rename_with(
    tolower, starts_with("Id")
  ) %>% 
  mutate(
    activity_date = parse_date_time(activity_date, "%m/%d/%Y"),
    activity_date_ymd = as.Date(activity_date, "%Y/%m/%d"),
    day_of_week = weekdays(as.Date(activity_date)),
    time_of_week = case_when(
      day_of_week %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") ~ "Weekday", 
      day_of_week %in% c("Saturday", "Sunday") ~ "Weekend")
  )
# daily_intensities ---------------------------
daily_intensities <-
  daily_intensities %>% 
  rename(
    activity_date = ActivityDay,
    very_active_d = VeryActiveDistance, 
    moderately_active_d = ModeratelyActiveDistance, 
    light_active_d = LightActiveDistance, 
    sedentary_d = SedentaryActiveDistance, 
    very_active_m = VeryActiveMinutes, 
    fairly_active_m = FairlyActiveMinutes, 
    lightly_active_m = LightlyActiveMinutes, 
    sedentary_m = SedentaryMinutes, 
  ) %>% 
   rename_with(
    tolower, starts_with("Id")
  ) %>% 
  mutate(
    activity_date = parse_date_time(activity_date, "%m/%d/%Y"),
    activity_date_ymd = as.Date(activity_date, "%Y/%m/%d"),
    day_of_week = weekdays(as.Date(activity_date)),
    time_of_week = case_when(
      day_of_week %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") ~ "Weekday", 
      day_of_week %in% c("Saturday", "Sunday") ~ "Weekend")
  )
# daily_sleep ---------------------------
daily_sleep <-
  daily_sleep %>% 
  rename(
    activity_date = SleepDay,
    total_sleep_records = TotalSleepRecords,
    total_minutes_asleep = TotalMinutesAsleep,
    total_time_in_bed = TotalTimeInBed
  ) %>% 
   rename_with(
    tolower, starts_with("Id")
  ) %>% 
  mutate(
    activity_date = parse_date_time(activity_date, "%m/%d/%Y %I:%M:%S %p"),
    activity_date_ymd = as.Date(activity_date, "%Y/%m/%d"),
    day_of_week = weekdays(as.Date(activity_date)),
    time_of_week = case_when(
      day_of_week %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") ~ "Weekday", 
      day_of_week %in% c("Saturday", "Sunday") ~ "Weekend")
    )
# daily_steps ---------------------------
daily_steps <-
  daily_steps %>% 
  rename(
    activity_date = ActivityDay,
    step_total = StepTotal
  ) %>% 
   rename_with(
    tolower, starts_with("Id")
  ) %>% 
  mutate(
    activity_date = parse_date_time(activity_date, "%m/%d/%Y"),
    activity_date_ymd = as.Date(activity_date, "%Y/%m/%d"),
    day_of_week = weekdays(as.Date(activity_date)),
    time_of_week = case_when(
      day_of_week %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") ~ "Weekday", 
      day_of_week %in% c("Saturday", "Sunday") ~ "Weekend")
  )
# weight_log ---------------------------
weight_log <-
  weight_log %>% 
  rename(
    activity_date = Date,
    weight_kg = WeightKg,
    weight_lb = WeightPounds,
    fat = Fat,
    bmi = BMI,
    manual_report = IsManualReport,
    log_id = LogId
  ) %>% 
   rename_with(
    tolower, starts_with("Id")
  ) %>% 
  mutate(
    activity_date = parse_date_time(activity_date, "%m/%d/%Y %I:%M:%S %p"),
    activity_date_ymd = as.Date(activity_date, "%Y/%m/%d"),
    activity_time = format(activity_date, format = "%I:%M:%S %p"), 
    day_of_week = weekdays(as.Date(activity_date)),
    time_of_week = case_when(
      day_of_week %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") ~ "Weekday", 
      day_of_week %in% c("Saturday", "Sunday") ~ "Weekend"),
    hour_of_day = as.POSIXct(activity_date, format = "%I:%M:%S %p")
  )
Hourly dataframes
# hourly_calories ---------------------------
hourly_calories <-
  hourly_calories %>% 
  rename(
    activity_hour = ActivityHour,
    calories = Calories
  ) %>% 
   rename_with(
    tolower, starts_with("Id")
  ) %>% 
  mutate(
    activity_hour = parse_date_time(activity_hour, "%m/%d/%Y %I:%M:%S %p"),
    activity_date_ymd = as.Date(activity_hour, "%Y/%m/%d"), 
    activity_time = format(activity_hour, format = "%I:%M:%S %p"),
    day_of_week = weekdays(as.Date(activity_hour)),
    time_of_week = case_when(
      day_of_week %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") ~ "Weekday", 
      day_of_week %in% c("Saturday", "Sunday") ~ "Weekend"),
    hour_of_day = as.POSIXct(activity_time, format = "%I:%M:%S %p")
  )
# hourly_intensities ---------------------------
hourly_intensities <- 
  hourly_intensities %>% 
  rename(
    activity_hour = ActivityHour,
    total_intensity = TotalIntensity,
    average_intensity = AverageIntensity
  ) %>% 
   rename_with(
    tolower, starts_with("Id")
  ) %>% 
  mutate(
    activity_hour = parse_date_time(activity_hour, "%m/%d/%Y %I:%M:%S %p"),
    activity_date_ymd = as.Date(activity_hour, "%Y/%m/%d"), 
    activity_time = format(activity_hour, format = "%I:%M:%S %p"),
    day_of_week = weekdays(as.Date(activity_hour)),
    time_of_week = case_when(
      day_of_week %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") ~ "Weekday", 
      day_of_week %in% c("Saturday", "Sunday") ~ "Weekend"),
    hour_of_day = as.POSIXct(activity_time, format = "%I:%M:%S %p")
  )
# hourly_steps --------------------------- 
hourly_steps <- 
  hourly_steps %>% 
  rename(
    activity_hour = ActivityHour,
    step_total = StepTotal
  ) %>% 
   rename_with(
    tolower, starts_with("Id")
  ) %>% 
  mutate(
    activity_hour = parse_date_time(activity_hour, "%m/%d/%Y %I:%M:%S %p"),
    activity_date_ymd = as.Date(activity_hour, "%Y/%m/%d"), 
    activity_time = format(activity_hour, format = "%I:%M:%S %p"),
    day_of_week = weekdays(as.Date(activity_hour)),
    time_of_week = case_when(
      day_of_week %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") ~ "Weekday", 
      day_of_week %in% c("Saturday", "Sunday") ~ "Weekend"),
    hour_of_day = as.POSIXct(activity_time, format = "%I:%M:%S %p")
  )
Minute dataframes
# minute_sleep ---------------------------
minute_sleep <-
  minute_sleep %>% 
  rename(
    activity_date = date,
    sleep_value = value,
    log_id = logId
  ) %>% 
   rename_with(
    tolower, starts_with("Id")
  ) %>% 
  mutate(
    activity_date = parse_date_time(activity_date, "%m/%d/%Y %I:%M:%S %p"),
    activity_date_ymd = as.Date(activity_date, "%Y/%m/%d"),
    activity_time = format(activity_date, format = "%I:%M:00 %p"), 
    day_of_week = weekdays(as.Date(activity_date)),
    time_of_week = case_when(
      day_of_week %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") ~ "Weekday", 
      day_of_week %in% c("Saturday", "Sunday") ~ "Weekend"),
    hour_of_day = as.POSIXct(activity_time, format = "%I:%M:%S %p"),
    # sleep_id will be used generate totals per sleep log
    sleep_id = str_c(id, "-", log_id), 
    # Create new variables for sleep values - asleep, restless, and awake
    asleep = ifelse(sleep_value == 1, 1, 0),
    restless = ifelse(sleep_value == 2, 1, 0),
    awake = ifelse(sleep_value == 3, 1, 0)
  )
# Create sleep_summary_0 df ---------------------------
# that shows totals for 3 sleep values per sleep log
sleep_summary_0 <-
  minute_sleep %>%
  # id_date will be used to generate a daily total
  mutate(
    id_date = str_c(id, "-", activity_date_ymd)
  ) %>% 
  group_by(sleep_id, activity_date_ymd, id_date, id) %>% 
  dplyr::summarize(
    total_asleep = sum(sleep_value == "1"),
    total_restless = sum(sleep_value == "2"),
    total_awake = sum(sleep_value == "3")
  )
# Create sleep_summary df ---------------------------
# that combines totals for each day per id
sleep_summary <-
  sleep_summary_0 %>%
  # activity_date will be used to merge with daily_sleep df
  mutate(
    activity_date = parse_date_time(activity_date_ymd, "%Y/%m/%d")
  ) %>%
  group_by(id_date, activity_date, id) %>%
  dplyr::summarize(
    total_asleep_merged = sum(total_asleep),
    total_restless_merged = sum(total_restless),
    total_awake_merged = sum(total_awake)
  )
# Merge these two daily sleep dfs into one ---------------------------
sleep_data <- merge(x = daily_sleep, y = sleep_summary, by = c("id", "activity_date"), all = TRUE)

Analyze


Summary statistics

Now that we’ve done some deeper cleaning and formatting, let’s get to know this data from a big-picture perspective. We can do this by obtaining some high-level summary statistics that can show us how these dataframes relate to one another, as well as what trends might exist within each:

A. Unique participants

How many unique participants are there in each dataframe?

Daily dataframes

# There are 33 users (one user per unique id) in the daily activity df
n_distinct(daily_activity$id)
[1] 33
# There are 24 users (one user per unique id) in the sleep dfs
n_distinct(daily_sleep$id)
[1] 24
n_distinct(sleep_data$id)
[1] 24
# There are 8 users (one user per unique id) in the weight df
n_distinct(weight_log$id)
[1] 8

Hourly dataframes

# There are 33 users (one user per unique id) in the hourly dfs
n_distinct(hourly_calories$id)
[1] 33
n_distinct(hourly_intensities$id)
[1] 33
n_distinct(hourly_steps$id)
[1] 33

Minute dataframe

# There are 24 users (one user per unique id) in the minute df
n_distinct(minute_sleep$id)
[1] 24

B. Total observations

How many observations are there in each dataframe?

Daily dataframes

# The observations vary across the daily dfs
nrow(daily_activity)
[1] 940
nrow(daily_sleep)
[1] 413
nrow(sleep_data)
[1] 452
nrow(weight_log)
[1] 67

Hourly dataframes

# There are 22099 observations in each hourly dataframe
nrow(hourly_calories)
[1] 22099
nrow(hourly_intensities)
[1] 22099
nrow(hourly_steps)
[1] 22099

Minute dataframe

# There are 188521 observations in the minute dataframe
nrow(minute_sleep)
[1] 188521

C. Quick overview

What are some quick summary statistics we’d want to know about each dataframe?

Daily dataframes

# Daily totals for steps, distance, calories ---------------------------
daily_activity %>%
  select(
    total_steps,
    total_distance,
    calories
  ) %>%
  summary()
  total_steps    total_distance      calories   
 Min.   :    0   Min.   : 0.000   Min.   :   0  
 1st Qu.: 3790   1st Qu.: 2.620   1st Qu.:1828  
 Median : 7406   Median : 5.245   Median :2134  
 Mean   : 7638   Mean   : 5.490   Mean   :2304  
 3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:2793  
 Max.   :36019   Max.   :28.030   Max.   :4900  
# Active minute levels per category ---------------------------
daily_activity %>%
  select(
    very_active_m,
    fairly_active_m,
    lightly_active_m,
    sedentary_m
  ) %>%
  summary()
 very_active_m    fairly_active_m  lightly_active_m  sedentary_m    
 Min.   :  0.00   Min.   :  0.00   Min.   :  0.0    Min.   :   0.0  
 1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:127.0    1st Qu.: 729.8  
 Median :  4.00   Median :  6.00   Median :199.0    Median :1057.5  
 Mean   : 21.16   Mean   : 13.56   Mean   :192.8    Mean   : 991.2  
 3rd Qu.: 32.00   3rd Qu.: 19.00   3rd Qu.:264.0    3rd Qu.:1229.5  
 Max.   :210.00   Max.   :143.00   Max.   :518.0    Max.   :1440.0  
# Sleep totals ---------------------------
# for records, minutes asleep, and time in bed
daily_sleep %>%
  select(
    total_sleep_records,
    total_minutes_asleep,
    total_time_in_bed
  ) %>%
  summary()
 total_sleep_records total_minutes_asleep total_time_in_bed
 Min.   :1.000       Min.   : 58.0        Min.   : 61.0    
 1st Qu.:1.000       1st Qu.:361.0        1st Qu.:403.0    
 Median :1.000       Median :433.0        Median :463.0    
 Mean   :1.119       Mean   :419.5        Mean   :458.6    
 3rd Qu.:1.000       3rd Qu.:490.0        3rd Qu.:526.0    
 Max.   :3.000       Max.   :796.0        Max.   :961.0    
# Sleep totals per category ---------------------------
sleep_data %>%
  select(
    total_minutes_asleep, 
    total_time_in_bed,
    total_asleep_merged,
    total_restless_merged,
    total_awake_merged
  ) %>%
  summary()
 total_minutes_asleep total_time_in_bed total_asleep_merged
 Min.   : 58.0        Min.   : 61.0     Min.   :  0.0      
 1st Qu.:361.0        1st Qu.:403.0     1st Qu.:325.0      
 Median :433.0        Median :463.0     Median :407.5      
 Mean   :419.5        Mean   :458.6     Mean   :384.8      
 3rd Qu.:490.0        3rd Qu.:526.0     3rd Qu.:477.0      
 Max.   :796.0        Max.   :961.0     Max.   :778.0      
 NA's   :39           NA's   :39                           
 total_restless_merged total_awake_merged
 Min.   :  0.00        Min.   :  0.000   
 1st Qu.: 11.00        1st Qu.:  0.000   
 Median : 20.00        Median :  3.000   
 Mean   : 31.17        Mean   :  4.469   
 3rd Qu.: 31.00        3rd Qu.:  5.000   
 Max.   :227.00        Max.   :129.000   
                                         
# Weight totals ---------------------------
weight_log %>%
  select(
    weight_lb,
    fat,
    bmi
  ) %>%
  summary()
   weight_lb          fat             bmi       
 Min.   :116.0   Min.   :22.00   Min.   :21.45  
 1st Qu.:135.4   1st Qu.:22.75   1st Qu.:23.96  
 Median :137.8   Median :23.50   Median :24.39  
 Mean   :158.8   Mean   :23.50   Mean   :25.19  
 3rd Qu.:187.5   3rd Qu.:24.25   3rd Qu.:25.56  
 Max.   :294.3   Max.   :25.00   Max.   :47.54  
                 NA's   :65                     

Hourly dataframes

# Hourly calories summary ---------------------------
hourly_calories %>%
  select(calories) %>%
  summary()
    calories     
 Min.   : 42.00  
 1st Qu.: 63.00  
 Median : 83.00  
 Mean   : 97.39  
 3rd Qu.:108.00  
 Max.   :948.00  
# Hourly intensities summary ---------------------------
hourly_intensities %>%
  select(
    total_intensity,
    average_intensity
  ) %>%
  summary()
 total_intensity  average_intensity
 Min.   :  0.00   Min.   :0.0000   
 1st Qu.:  0.00   1st Qu.:0.0000   
 Median :  3.00   Median :0.0500   
 Mean   : 12.04   Mean   :0.2006   
 3rd Qu.: 16.00   3rd Qu.:0.2667   
 Max.   :180.00   Max.   :3.0000   
# Hourly steps summary ---------------------------
hourly_steps %>%
  select(step_total) %>%
  summary()
   step_total     
 Min.   :    0.0  
 1st Qu.:    0.0  
 Median :   40.0  
 Mean   :  320.2  
 3rd Qu.:  357.0  
 Max.   :10554.0  

Minute dataframe

# Minute sleep summary ---------------------------
minute_sleep %>%
  select(
    sleep_value,
    asleep,
    restless,
    awake
  ) %>%
  summary()
  sleep_value        asleep          restless           awake       
 Min.   :1.000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
 1st Qu.:1.000   1st Qu.:1.0000   1st Qu.:0.00000   1st Qu.:0.0000  
 Median :1.000   Median :1.0000   Median :0.00000   Median :0.0000  
 Mean   :1.096   Mean   :0.9149   Mean   :0.07438   Mean   :0.0107  
 3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
 Max.   :3.000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000  

D. Interesting notes

What does this tell us about this sample of people’s activities? A few interesting notes:

  • 7,638 - average total steps per day, which is below the generally recommended 10,000 goal.
  • 991.2 - average “Sedentary” minutes per day, which is the highest average of the tracked activity minutes per day. This is just over 16.5 hours per day.
  • 192.8- average “Lightly Active” minutes per day, which the highest among the non-sedentary activity minutes per day (21.16 - Very Active, 13.56- Fairly Active).
  • 34.72 - combined daily average of Very Active (21.16) and Fairly Active (13.56) minutes per day. This indicates that, on average, these individuals may mostly be getting the WHO recommended 150 minutes of moderate-intensity physical activity per week Global recommendations on physical activity for health
  • 91.5% - average percentage of time asleep while in bed. This is calculated by dividing “Total Minutes Asleep” (419.5) by “Total Time In Bed” (458.6).
  • 97.39 - average calories burned per hour.

Plotting a few explorations

Steps and active minutes

The goal of 10,000 steps per day has been widely proposed across the smart device industry. But how important of a health metric is daily steps really? For example, what’s the relationship between total steps and calories burned?

# Total steps vs calories plot ---------------------------
ggplot(data = daily_activity, aes(x = total_steps, y = calories)) +
  geom_point() +
  geom_smooth() + # Trend line with a shadow representing  95% confidence interval
  labs(title = "Total steps vs calories") +
  ylab("Calories") +
  xlab("Total Steps") +
  theme_minimal()

cor.test(daily_activity$total_steps, daily_activity$calories, method = "pearson")

  Pearson's product-moment correlation

data:  daily_activity$total_steps and daily_activity$calories
t = 22.472, df = 938, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5483688 0.6316184
sample estimates:
      cor 
0.5915681 

As expected, we see a positive correlation here between total_steps and calories. This makes sense because the more active someone is, the more calories they will burn. But what about the relationship between very_active_m and calories?

# Very active minutes vs calories plot ---------------------------
ggplot(data = daily_activity, aes(x = very_active_m, y = calories)) +
  geom_point() +
  geom_smooth() + # Trend line with a shadow representing  95% confidence interval
  labs(title = "Very active minutes vs calories") +
  ylab("Calories") +
  xlab("Very active minutes") +
  theme_minimal()

cor.test(daily_activity$very_active_m, daily_activity$calories, method = "pearson")

  Pearson's product-moment correlation

data:  daily_activity$very_active_m and daily_activity$calories
t = 23.939, df = 938, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5745197 0.6540260
sample estimates:
      cor 
0.6158383 

Unsurprisingly, we see another positive correlation. This time between very_active_m and calories. Very active people will burn more calories. But we know from earlier that the average very_active_m per day of 21.16 is much lower than then average lightly_active_m per day of 192.80. So what’s the relationship between lightly_active_m and calories?

# Lightly active minutes vs calories plot ---------------------------
ggplot(data = daily_activity, aes(x = lightly_active_m, y = calories)) +
  geom_point() +
  geom_smooth() + # Trend line with a shadow representing  95% confidence interval
  labs(title = "Lightly active minutes vs calories") +
  ylab("Calories") +
  xlab("Lightly active minutes") +
  theme_minimal()

cor.test(daily_activity$lightly_active_m, daily_activity$calories, method = "pearson")

  Pearson's product-moment correlation

data:  daily_activity$lightly_active_m and daily_activity$calories
t = 9.1661, df = 938, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.2269361 0.3443464
sample estimates:
      cor 
0.2867175 

The relationship between lightly_active_m and calories is still positive, albeit a lot weaker. This could mean that lightly_active_m has less of an impact on calories burned throughout a day.

Daily calories

What’s the relationship between calories and day of the week?

# Create labels and limits for plot ---------------------------
labels_weekdays <- c(
  "Monday" = "Monday", "Tuesday" = "Tuesday",
  "Wednesday" = "Wednesday", "Thursday" = "Thursday",
  "Friday" = "Friday", "Saturday" = "Saturday",
  "Sunday" = "Sunday"
  )
limits_weekdays <- c(
  "Monday", "Tuesday", "Wednesday", "Thursday", "Friday",
  "Saturday", "Sunday"
  )
# Calories vs day of week plot ---------------------------
ggplot(data = daily_activity) +
  geom_point(
    aes(x = day_of_week, y = calories, color = as.factor(time_of_week)),
    position = "jitter", 
    alpha = 0.3
    ) +
  labs(title = "Daily calories throughout the week", color = "Time of week") +
  ylab("Total Calories") +
  scale_x_discrete(
    "Day",
    labels = labels_weekdays, 
    limits = limits_weekdays,
    guide = guide_axis(angle = 45)
    ) +
  stat_summary(
    aes(x = day_of_week, y = calories),
    fun = mean, 
    geom = "point",
    color = "red", 
    size = 2, 
    alpha = 0.7
    ) +
  theme_minimal()

There doesn’t look to be a large difference in the average calories burned per day. These averages are represented by the red dots on each day.

Daily sleep

What’s the relationship between sleep quality and time of week? You might expect people to get better sleep over the weekends vs weekdays - are there any unexpected trends?

# Total minutes asleep vs time of week ---------------------------
ggplot(data = sleep_data) +
  geom_point(
    aes(
    x = weekdays.POSIXt(activity_date), 
    y = total_asleep_merged,
    color = as.factor(time_of_week)
    ),
  position = "jitter",
  alpha = 0.3
  ) +
  labs(title = "Total minutes asleep throughout week") +
  guides(color = "none") +
  ylab("Minutes asleep") +
  scale_x_discrete(
    "Day",
    labels = labels_weekdays, 
    limits = limits_weekdays,
    guide = guide_axis(angle = 45)
    ) +
  stat_summary(
    aes(x = weekdays.POSIXt(activity_date), y = total_asleep_merged),
    fun = mean, 
    geom = "point", 
    color = "red", 
    size = 2,
    alpha = 0.7
    ) +
  theme_minimal()

# Total minutes restless vs time of week ---------------------------
ggplot(data = sleep_data) +
  geom_point(
    aes(
    x = weekdays.POSIXt(activity_date), 
    y = total_restless_merged,
    color = as.factor(time_of_week)
    ),
  position = "jitter",
  alpha = 0.3
  ) +
  labs(title = "Total minutes restless throughout week") +
  guides(color = "none") +
  ylab("Minutes restless") +
  scale_x_discrete(
    "Day",
    labels = labels_weekdays, 
    limits = limits_weekdays,
    guide = guide_axis(angle = 45)
    ) +
  stat_summary(
    aes(x = weekdays.POSIXt(activity_date), y = total_restless_merged),
    fun = mean, 
    geom = "point", 
    color = "red", 
    size = 2,
    alpha = 0.7
    ) +
  theme_minimal()

# Total minutes awake vs time of week ---------------------------
ggplot(data = sleep_data) +
  geom_point(
    aes(
      x = weekdays.POSIXt(activity_date), 
      y = total_awake_merged,
      color = as.factor(time_of_week)
      ),
    position = "jitter",
    alpha = 0.3
    ) +
  labs(title = "Total minutes awake throughout week") +
  guides(color = "none") +
  ylab("Minutes awake") +
  scale_x_discrete(
    "Day",
    labels = labels_weekdays, 
    limits = limits_weekdays,
    guide = guide_axis(angle = 45)
    ) +
  stat_summary(
    aes(x = weekdays.POSIXt(activity_date), y = total_awake_merged),
    fun = mean, 
    geom = "point", 
    color = "red", 
    size = 2,
    alpha = 0.7
    ) +
  theme_minimal()

Based on the averages for each day on these graphs (red dots), we can see that:

  • The average total minutes sleep were highest on Sunday, Wednesday and Saturday
  • The average for minutes awake and minutes restless were highest on Sunday and Saturday

People do seem to get sleep more on the weekends. However, it doesn’t look like the quality of sleep is greater based minutes awake and restless.

Merging data

Based on these initial explorations, the impact of active minutes seems like it could reveal deeper customer insights than total steps, calories or sleep. It would benefit us to merge these dataframes and explore that metric further.

daily_data

Let’s start with combining daily_activity, sleep_data, and weight_log into one daily_data dataframe. We will disregard the daily_calories, daily_intensities, and daily_steps since their data is already aggregated within the the daily_activity dataframe:

# Merging daily data into one df ---------------------------
daily_data_0 <-
  merge(
    x = daily_activity, y = sleep_data,
    by = c("id", "activity_date_ymd", "day_of_week", "time_of_week"), 
    all.x = TRUE
  )
daily_data <-
  merge(
    x = daily_data_0, y = weight_log,
    by = c("id", "activity_date_ymd", "day_of_week", "time_of_week"), all.x = TRUE
  )
# Add two new active minute variables to daily data ---------------------------
daily_data <-
  daily_data %>%
  mutate(
    daily_mod_vig_m = fairly_active_m + very_active_m, 
    daily_light_sed_m = lightly_active_m + sedentary_m
  )

hourly_data

Now we’ll combine hourly_calories, hourly_intensities, and hourly_steps into one hourly_data dataframe:

# Merging hourly data into one df  ---------------------------
hourly_data_0 <-
  merge(
    x = hourly_calories, y = hourly_intensities,
    by = c(
      "id", "activity_hour", "activity_date_ymd", "activity_time", "hour_of_day", 
      "day_of_week", "time_of_week"
    ),
    all.x = TRUE, no.dups = TRUE
  )
hourly_data <-
  merge(
    x = hourly_data_0, y = hourly_steps,
    by = c(
      "id", "activity_hour", "activity_date_ymd", "activity_time", "hour_of_day", 
      "day_of_week", "time_of_week"
    ),
    all.x = TRUE
  )

weekly_data

Finally, we’ll create a weekly_data dataframe using data from daily data:

# Create new weekly_data_00 df  ---------------------------
weekly_data_0 <-
  daily_data %>%
  select(
    id, activity_date_ymd, day_of_week, activity_date, total_steps, total_distance,
    very_active_d, moderately_active_d, light_active_d,
    sedentary_d, very_active_m, fairly_active_m,
    lightly_active_m, sedentary_m, calories
  ) %>%
  mutate(
    week_number = paste0(
      "Week ", (as.numeric(format(activity_date_ymd, "%U"))), ", ", 
      format(activity_date_ymd, "%b %Y")
      )
  )
# Summarize totals by week in new weekly_data df  ---------------------------
weekly_data <-
  weekly_data_0 %>%
  group_by(id, week_number) %>%
  dplyr::summarize(
    weekly_total_steps = sum(total_steps),
    weekly_total_distance = sum(total_distance),
    weekly_vigorous_d = sum(very_active_d),
    weekly_moderate_d = sum(moderately_active_d),
    weekly_light_d = sum(light_active_d),
    weekly_sedentary_d = sum(sedentary_d),
    weekly_vigorous_m = sum(very_active_m),
    weekly_moderate_m = sum(fairly_active_m),
    weekly_mod_vig_m = weekly_vigorous_m + weekly_moderate_m,
    weekly_light_m = sum(lightly_active_m),
    weekly_sedentary_m = sum(sedentary_m),
    weekly_light_sed_m = weekly_light_m + weekly_sedentary_m,
    weekly_calories = sum(calories)
  )

Data cleaning [revisited]

Before continuing with our analysis on these merged dataframes, let’s do some more data cleaning.

Delete old dataframes

Let’s free up some RAM space by deleting old dataframes that we’ll no longer need or aren’t using moving forward with our analysis:

# List all objects in current R workspace ---------------------------
ls()
 [1] "daily_activity"     "daily_calories"     "daily_data"        
 [4] "daily_data_0"       "daily_intensities"  "daily_sleep"       
 [7] "daily_steps"        "encoding"           "hourly_calories"   
[10] "hourly_data"        "hourly_data_0"      "hourly_intensities"
[13] "hourly_steps"       "input_file"         "labels_weekdays"   
[16] "limits_weekdays"    "minute_sleep"       "out_dir"           
[19] "sleep_data"         "sleep_summary"      "sleep_summary_0"   
[22] "weekly_data"        "weekly_data_0"      "weight_log"        
# Remove old objects ---------------------------
# we'll no longer need or use moving forward
rm("daily_calories", "daily_intensities", "daily_steps", "daily_sleep", 
   "weight_log")
rm("hourly_calories", "hourly_intensities", "hourly_steps", "minute_sleep")
rm("daily_data_0", "hourly_data_0", "sleep_summary_0", "weekly_data_0")
# List all objects again in current R workspace  ---------------------------
ls()
 [1] "daily_activity"  "daily_data"      "encoding"        "hourly_data"    
 [5] "input_file"      "labels_weekdays" "limits_weekdays" "out_dir"        
 [9] "sleep_data"      "sleep_summary"   "weekly_data"    

Create new variables for faceting

In order to create plots that are faceted by day of week or time of day in correct order, we’ll need to add a new variable to each of the merged dataframes:

# Create day_list for ordered facet grid/wrapping  ---------------------------
hourly_data$day_list <-
  factor(hourly_data$day_of_week, levels = c(
    "Monday", "Tuesday", "Wednesday", "Thursday",
    "Friday", "Saturday", "Sunday"
  ))
daily_data$day_list <-
  factor(daily_data$day_of_week, levels = c(
    "Monday", "Tuesday", "Wednesday", "Thursday",
    "Friday", "Saturday", "Sunday"
  ))
# Create time_list for ordered facet grid/wrapping ---------------------------
hourly_data$time_list <-
  factor(hourly_data$activity_time,
    levels = c(
      "12:00:00 AM", "01:00:00 AM", "02:00:00 AM", "03:00:00 AM",
      "04:00:00 AM", "05:00:00 AM", "06:00:00 AM", "07:00:00 AM",
      "08:00:00 AM", "09:00:00 AM", "10:00:00 AM", "11:00:00 AM",
      "12:00:00 PM", "01:00:00 PM", "02:00:00 PM", "03:00:00 PM",
      "04:00:00 PM", "05:00:00 PM", "06:00:00 PM", "07:00:00 PM",
      "08:00:00 PM", "09:00:00 PM", "10:00:00 PM", "11:00:00 PM"
    )
  )

Mean or median?

Before we launch into the new summary stats, we need to answer an important question:

  • Should we use mean or median when presenting these summary statistics?

Well, the answer to this question depends on how strongly the outliers are influencing the mean. We can visualize with histograms of each new dataframe:

daily_data

# Daily histogram plot ---------------------------
daily_hplot <- ggplot(data = daily_data)
# Histogram plot for daily_mod_vig_m ---------------------------
(bw <- nclass.FD(daily_data$daily_mod_vig_m)) # Optimize binwidth for each plot
[1] 24
daily_hplot +
  geom_histogram(aes(x = daily_mod_vig_m), bins = bw) +
  labs(
    x = "Minutes",
    y = "Count",
    title = "Daily moderate/vigorous minutes",
    caption = "Median: 21.00  ,  Mean: 34.87"
  ) 

# Histogram plot for daily_light_sed_m ---------------------------
(bw <- nclass.FD(daily_data$daily_light_sed_m))
[1] 15
daily_hplot +
  geom_histogram(aes(x = daily_light_sed_m), bins = bw) +
  labs(
    x = "Minutes",
    y = "Count",
    title = "Daily light/sedentary minutes",
    caption = "Median: 1333  ,  Mean: 1183"
  )

# Histogram plot for total_steps ---------------------------
(bw <- nclass.FD(daily_data$total_steps))
[1] 26
daily_hplot +
  geom_histogram(aes(x = total_steps), bins = bw) +
  labs(
    x = "Total Steps",
    y = "Count",
    title = "Daily total steps",
    caption = "Median: 7439  ,  Mean: 7652"
  )

# Histogram plot for calories ---------------------------
(bw <- nclass.FD(daily_data$calories))
[1] 25
daily_hplot +
  geom_histogram(aes(x = calories), bins = bw) +
  labs(
    x = "Calories",
    y = "Count",
    title = "Daily total calories",
    caption = "Median: 2140  ,  Mean: 2308"
  )

# Histogram plot for total_minutes_asleep ---------------------------
daily_hplot +
  geom_histogram(aes(x = total_minutes_asleep)) +
  labs(
    x = "Total Minutes Asleep",
    y = "Count",
    title = "Daily minutes asleep",
    caption = "Median: 433.0  ,  Mean: 419.5"
  )
Warning: Removed 530 rows containing non-finite values (stat_bin).

# Histogram plot for total_time_in_bed ---------------------------
daily_hplot +
  geom_histogram(aes(x = total_time_in_bed)) +
  labs(
    x = "Total Time in Bed",
    y = "Count",
    title = "Daily total time in bed",
    caption = "Median: 463.0  ,  Mean: 458.6"
  )
Warning: Removed 530 rows containing non-finite values (stat_bin).

# Remove bindwith object ---------------------------
rm(bw)

Since a few outliers are greatly affecting the mean for each variable, we should use median for all these daily summary statistics: daily moderate/vigorous minutes, daily light/sedentary minutes, and daily total steps. However, we can still use mean for the sleep variables and calories.

hourly_data

# Hourly histogram plots ---------------------------
hourly_hplot <- ggplot(data = hourly_data)
# Histogram plot for calories ---------------------------
(bw <- nclass.FD(hourly_data$calories)) # Optimize binwidth for each plot
[1] 283
hourly_hplot +
  geom_histogram(aes(x = calories), bins = bw) +
  labs(
    title = "Hourly calories",
    caption = "Median: 83.00  ,  Mean: 97.39",
    x = "Calories",
    y = "Count"
  )

# Histogram plot for total_intensity ---------------------------
(bw <- nclass.FD(hourly_data$total_intensity))
[1] 158
hourly_hplot +
  geom_histogram(aes(x = total_intensity), bins = bw) +
  labs(
    title = "Hourly total intensity",
    caption = "Median: 3.00,  Mean: 12.04",
    x = "Total Intensity",
    y = "Count"
  )

# Histogram plot for average_intensity ---------------------------
(bw <- nclass.FD(hourly_data$average_intensity))
[1] 158
hourly_hplot +
  geom_histogram(aes(x = average_intensity), bins = bw) +
  labs(
    title = "Hourly average intensity",
    caption = "Median: 0.0500,  Mean: 0.2006",
    x = "Average Intensity",
    y = "Count"
  )

# Histogram plot for step_total ---------------------------
(bw <- nclass.FD(hourly_data$step_total))
[1] 415
hourly_hplot +
  geom_histogram(aes(x = step_total), bins = bw) +
  labs(
    x = "Step Total",
    y = "Count",
    title = "Hourly step total",
    caption = "Median: 40.0  ,  Mean: 320.2"
  ) 

# Remove bindwith object ---------------------------
rm(bw)

It looks like we should use the median for all the hourly summary statistics as well: hourly calories, hourly total intensity, hourly average intensity and hourly step total.

weekly_data

# Weekly histogram plots ---------------------------
weekly_hplot <- ggplot(data = weekly_data)
# Histogram plot for weekly_vigorous_m ---------------------------
(bw <- nclass.FD(weekly_data$weekly_vigorous_m)) # Optimize binwidth for each plot
[1] 12
weekly_hplot +
  geom_histogram(aes(x = weekly_vigorous_m), bins = bw) +
  labs(
    x = "Minutes",
    y = "Count",
    title = "Weekly vigorous minutes",
    caption = "Median: 68.50  ,  Mean: 130.10"
  )

# Histogram plot for weekly_moderate_m ---------------------------
(bw <- nclass.FD(weekly_data$weekly_moderate_m))
[1] 15
weekly_hplot +
  geom_histogram(aes(x = weekly_moderate_m), bins = bw) +
  labs(
    x = "Minutes",
    y = "Count",
    title = "Weekly moderate minutes",
    caption = "Median: 65.50  ,  Mean: 83.45"
  )

# Histogram plot for weekly_light_m ---------------------------
(bw <- nclass.FD(weekly_data$weekly_light_m))
[1] 10
weekly_hplot +
  geom_histogram(aes(x = weekly_light_m), bins = bw) +
  labs(
    x = "Minutes",
    y = "Count",
    title = "Weekly light minutes",
    caption = "Median: 1188  ,  Mean: 1182"
  )

# Histogram plot for weekly_sedentary_m ---------------------------
(bw <- nclass.FD(weekly_data$weekly_sedentary_m))
[1] 9
weekly_hplot +
  geom_histogram(aes(x = weekly_sedentary_m), bins = bw) +
  labs(
    x = "Minutes",
    y = "Count",
    title = "Weekly sedentary minutes",
    caption = "Median: 5754  ,  Mean: 6064"
  )

# Histogram plot for weekly_mod_vig_m ---------------------------
(bw <- nclass.FD(weekly_data$weekly_mod_vig_m))
[1] 11
weekly_hplot +
  geom_histogram(aes(x = weekly_mod_vig_m), bins = bw) +
  labs(
    x = "Minutes",
    y = "Count",
    title = "Weekly moderate/vigorous minutes",
    caption = "Median: 182.00  ,  Mean: 213.51"
  )

# Histogram plot for weekly_light_sed_m ---------------------------
(bw <- nclass.FD(weekly_data$weekly_light_sed_m))
[1] 7
weekly_hplot +
  geom_histogram(aes(x = weekly_light_sed_m), bins = bw) +
  labs(
    x = "Minutes",
    y = "Count",
    title = "Weekly light/sedentary minutes",
    caption = "Median: 6956  ,  Mean: 7246"
  )

# Histogram plot for weekly_total_steps ---------------------------
(bw <- nclass.FD(weekly_data$weekly_total_steps))
[1] 10
weekly_hplot +
  geom_histogram(aes(x = weekly_total_steps), bins = bw) +
  labs(
    x = "Total Steps",
    y = "Count",
    title = "Weekly total steps",
    caption = "Median: 44903  ,  Mean: 46857"
  )

# Histogram plot for weekly_calories ---------------------------
(bw <- nclass.FD(weekly_data$weekly_calories))
[1] 10
weekly_hplot +
  geom_histogram(aes(x = weekly_calories), bins = bw) +
  labs(
    x = "Calories",
    y = "Count",
    title = "Weekly calories",
    caption = "Median: 13615  ,  Mean: 14130"
  )

# Remove bindwith object ---------------------------
rm(bw)

It looks like we should use median for three of the weekly summary statistics: weekly vigorous minutes, weekly moderate minutes, and weekly moderate/vigorous minutes. We should use mean for the remaining five: weekly sedentary minutes, weekly light minutes, weekly light/sedentary minutes, weekly total steps and weekly calories.

Summary stats [revisited]

New summary stats

Let’s look at the quick summary statistics we’d want to know about each new dataframe:

daily_data

# Daily steps, distance and calories ---------------------------
daily_data %>%
  select(
    total_steps,
    total_distance,
    calories
  ) %>%
  summary()
  total_steps    total_distance      calories   
 Min.   :    0   Min.   : 0.000   Min.   :   0  
 1st Qu.: 3795   1st Qu.: 2.620   1st Qu.:1830  
 Median : 7439   Median : 5.260   Median :2140  
 Mean   : 7652   Mean   : 5.503   Mean   :2308  
 3rd Qu.:10734   3rd Qu.: 7.720   3rd Qu.:2796  
 Max.   :36019   Max.   :28.030   Max.   :4900  
# Active minute levels per category ---------------------------
daily_data %>%
  select(
    very_active_m,
    fairly_active_m,
    lightly_active_m,
    sedentary_m,
    daily_mod_vig_m,
    daily_light_sed_m
  ) %>%
  summary()
 very_active_m    fairly_active_m  lightly_active_m  sedentary_m    
 Min.   :  0.00   Min.   :  0.00   Min.   :  0      Min.   :   0.0  
 1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:127      1st Qu.: 729.0  
 Median :  4.00   Median :  7.00   Median :199      Median :1057.0  
 Mean   : 21.24   Mean   : 13.63   Mean   :193      Mean   : 990.4  
 3rd Qu.: 32.00   3rd Qu.: 19.00   3rd Qu.:264      3rd Qu.:1229.0  
 Max.   :210.00   Max.   :143.00   Max.   :518      Max.   :1440.0  
 daily_mod_vig_m  daily_light_sed_m
 Min.   :  0.00   Min.   :   2     
 1st Qu.:  0.00   1st Qu.: 947     
 Median : 21.00   Median :1333     
 Mean   : 34.87   Mean   :1183     
 3rd Qu.: 57.00   3rd Qu.:1432     
 Max.   :275.00   Max.   :1440     
# Daily overall sleep ---------------------------
daily_data %>%
  select(
    total_sleep_records,
    total_minutes_asleep,
    total_time_in_bed
  ) %>%
  summary()
 total_sleep_records total_minutes_asleep total_time_in_bed
 Min.   :1.000       Min.   : 58.0        Min.   : 61.0    
 1st Qu.:1.000       1st Qu.:361.0        1st Qu.:403.0    
 Median :1.000       Median :433.0        Median :463.0    
 Mean   :1.119       Mean   :419.5        Mean   :458.6    
 3rd Qu.:1.000       3rd Qu.:490.0        3rd Qu.:526.0    
 Max.   :3.000       Max.   :796.0        Max.   :961.0    
 NA's   :530         NA's   :530          NA's   :530      
# Daily sleep by category ---------------------------
daily_data %>%
  select(
    total_asleep_merged,
    total_restless_merged,
    total_awake_merged
  ) %>%
  summary()
 total_asleep_merged total_restless_merged total_awake_merged
 Min.   : 35.0       Min.   :  0.00        Min.   :  0.000   
 1st Qu.:353.0       1st Qu.: 13.00        1st Qu.:  1.000   
 Median :420.0       Median : 21.00        Median :  3.000   
 Mean   :413.2       Mean   : 32.76        Mean   :  4.208   
 3rd Qu.:485.0       3rd Qu.: 32.00        3rd Qu.:  5.000   
 Max.   :778.0       Max.   :227.00        Max.   :110.000   
 NA's   :530         NA's   :530           NA's   :530       
# Weight logs ---------------------------
daily_data %>%
  select(
    weight_lb,
    fat,
    bmi
  ) %>%
  summary()
   weight_lb          fat             bmi       
 Min.   :116.0   Min.   :22.00   Min.   :21.45  
 1st Qu.:135.4   1st Qu.:22.75   1st Qu.:23.96  
 Median :137.8   Median :23.50   Median :24.39  
 Mean   :158.8   Mean   :23.50   Mean   :25.19  
 3rd Qu.:187.5   3rd Qu.:24.25   3rd Qu.:25.56  
 Max.   :294.3   Max.   :25.00   Max.   :47.54  
 NA's   :876     NA's   :941     NA's   :876    

hourly_data

# Hourly calories, intensities, step totals ---------------------------
hourly_data %>%
  select(
    calories,
    total_intensity,
    average_intensity,
    step_total
  ) %>%
  summary()
    calories      total_intensity  average_intensity   step_total     
 Min.   : 42.00   Min.   :  0.00   Min.   :0.0000    Min.   :    0.0  
 1st Qu.: 63.00   1st Qu.:  0.00   1st Qu.:0.0000    1st Qu.:    0.0  
 Median : 83.00   Median :  3.00   Median :0.0500    Median :   40.0  
 Mean   : 97.39   Mean   : 12.04   Mean   :0.2006    Mean   :  320.2  
 3rd Qu.:108.00   3rd Qu.: 16.00   3rd Qu.:0.2667    3rd Qu.:  357.0  
 Max.   :948.00   Max.   :180.00   Max.   :3.0000    Max.   :10554.0  

weekly_data

# Weekly data for original active minute totals ---------------------------
# vigorous, moderate, light and sedentary
weekly_data %>%
  ungroup() %>%
  select(
    weekly_vigorous_m,
    weekly_moderate_m,
    weekly_light_m,
    weekly_sedentary_m
  ) %>%
  summary()
 weekly_vigorous_m weekly_moderate_m weekly_light_m weekly_sedentary_m
 Min.   :  0.0     Min.   :  0.00    Min.   :   0   Min.   :  642     
 1st Qu.:  8.0     1st Qu.: 17.00    1st Qu.: 802   1st Qu.: 4778     
 Median : 68.5     Median : 65.50    Median :1188   Median : 5754     
 Mean   :130.1     Mean   : 83.45    Mean   :1182   Mean   : 6064     
 3rd Qu.:182.5     3rd Qu.:122.00    3rd Qu.:1554   3rd Qu.: 7749     
 Max.   :775.0     Max.   :549.00    Max.   :2590   Max.   :10080     
# Weekly data for new combo active minute totals ---------------------------
# moderate/vigorous and light/sedentary
weekly_data %>%
  ungroup() %>%
  select(
    weekly_mod_vig_m,
    weekly_light_sed_m
  ) %>%
  summary()
 weekly_mod_vig_m  weekly_light_sed_m
 Min.   :   0.00   Min.   : 1014     
 1st Qu.:  36.25   1st Qu.: 6029     
 Median : 182.00   Median : 6956     
 Mean   : 213.51   Mean   : 7246     
 3rd Qu.: 303.00   3rd Qu.: 9543     
 Max.   :1058.00   Max.   :10080     
# Weekly data for total steps and calories ---------------------------
weekly_data %>%
  ungroup() %>%
  select(
    weekly_total_steps,
    weekly_calories
  ) %>%
  summary()
 weekly_total_steps weekly_calories
 Min.   :     0     Min.   : 1237  
 1st Qu.: 30671     1st Qu.:10607  
 Median : 44903     Median :13615  
 Mean   : 46857     Mean   :14130  
 3rd Qu.: 63219     3rd Qu.:17529  
 Max.   :116758     Max.   :26101  

New interesting notes

What does this tell us about this sample of people’s weekly activities? A few interesting notes:

  • 46,857 steps - average total steps per week, which falls below the recommended weekly goal of 70,000 (10,000 goal per day x 7 days).
  • 6,064 min. - weekly average of sedentary minutes, which is the highest average of the tracked activity minutes. This is just over 4.2 days per week!
  • 1,182 min. - weekly average of light-intensity (LightlyActive) minutes, which the highest among the non-sedentary activity minutes per week.
  • 182.00 min. - weekly median of combined moderate- and vigorous-intensity activity minutes.
  • 65.50 min. - weekly median of moderate-intensity (FairlyActive) activity minutes.
  • 68.5 min. - weekly median of vigorous-intensity (VeryActive) activity minutes. It’s surprising that this number is higher than the moderate-intensity average!
  • 91.5% - average percentage of time asleep while in bed. This is calculated by dividing “Total Minutes Asleep” (419.5) by “Total Time In Bed” (458.6).
  • 83.00 - median calories burned per hour.

New exploration

Hour of day

Let’s look at some possible relationships between hour of day and different variables. We’ll create labels_hour_of_day to use for these plots:

# Create labels for hour of day ---------------------------
# Note that we are assigning times to the current date of  analysis to ensure
# that all times are within the same 24 hour day
labels_hour_of_day <- c(
  "2022-01-18 00:00:00" = "00:00",
  "2022-01-18 01:00:00" = "01:00",
  "2022-01-18 02:00:00" = "02:00",
  "2022-01-18 03:00:00" = "03:00",
  "2022-01-18 04:00:00" = "04:00",
  "2022-01-18 05:00:00" = "05:00",
  "2022-01-18 06:00:00" = "06:00",
  "2022-01-18 07:00:00" = "07:00",
  "2022-01-18 08:00:00" = "08:00",
  "2022-01-18 09:00:00" = "09:00",
  "2022-01-18 10:00:00" = "10:00",
  "2022-01-18 11:00:00" = "11:00",
  "2022-01-18 12:00:00" = "12:00",
  "2022-01-18 13:00:00" = "13:00",
  "2022-01-18 14:00:00" = "14:00",
  "2022-01-18 15:00:00" = "15:00",
  "2022-01-18 16:00:00" = "16:00",
  "2022-01-18 17:00:00" = "17:00",
  "2022-01-18 18:00:00" = "18:00",
  "2022-01-18 19:00:00" = "19:00",
  "2022-01-18 20:00:00" = "20:00",
  "2022-01-18 21:00:00" = "21:00",
  "2022-01-18 22:00:00" = "22:00",
  "2022-01-18 23:00:00" = "23:00",
  "2022-01-18 24:00:00" = "24:00"
)

What hours of the day have the highest observations per data point? Are there certain times where people are most active?

# step_total observations per hour ---------------------------
ggplot(data = hourly_data, aes(x = as.factor(hour_of_day))) +
  geom_bar(aes(weight = step_total), stat = "count") +
  labs(title = "'step_total' observations throughout a 24 hour day") +
  scale_y_continuous(name = "Count", labels = scales::comma) +
  scale_x_discrete(
    "Hour of day", 
    labels = labels_hour_of_day,
    guide = guide_axis(angle = 45)
    ) +
  theme_minimal()

# calories observations per hour ---------------------------
ggplot(data = hourly_data, aes(x = as.factor(hour_of_day))) +
  geom_bar(aes(weight = calories), stat = "count") +
  labs(title = "'calories' observations throughout a 24 hour day") +
  scale_y_continuous(name = "Count", labels = scales::comma) +
  scale_x_discrete(
    "Hour of day",
    labels = labels_hour_of_day,
    guide = guide_axis(angle = 45)
    ) +
  theme_minimal()

# average_intensity observations per hour ---------------------------
ggplot(data = hourly_data, aes(x = as.factor(hour_of_day))) +
  geom_bar(aes(weight = average_intensity), stat = "count") +
  labs(title = "'average_intensity' observations throughout a 24 hour day") +
  scale_y_continuous(name = "Count", labels = scales::comma) +
  scale_x_discrete(
    "Hour of day", 
    labels = labels_hour_of_day,
    guide = guide_axis(angle = 45)
    ) +
  theme_minimal()

All three variables — step_total, calories, average_intensity — have similar shapes. The counts are lowest from 12:00-4:00 AM (00:00-04:00), which is when most people are asleep. The two time windows with highest counts of observations are the same across all three graphs:

  • 12:00 - 2:00 PM (12:00 - 14:00)
  • 5:00 - 7:00 PM (17:00 - 19:00)

Average intensity vs time

What’s the relationship between time and average intensity? For example, are people’s activities more intense in the morning vs. the evening? How does this info compare with the observation counts from the plots above?

# Average intensity throughout the day ---------------------------
average_intensity_p <- 
  ggplot(
    data = hourly_data, 
    aes(x = time_list, y = average_intensity, color = average_intensity)
         )
average_intensity_p +
  geom_jitter(width = 0.3, size = 0.2) +
  labs(
    x = "Hour", 
    y = "Average intensity",
    color = "Average Intensity",
    title = "Average intensity throughout the day"
  ) + 
  stat_summary(
    aes(x = time_list, y = average_intensity),
    fun = mean, 
    geom = "point", 
    color = "white", 
    size = 1,
    alpha = 0.7
    ) +
  scale_color_viridis_c() +
  theme_solarized(light = FALSE) +
  theme(
    axis.title.x = element_text(
      margin = margin(5, 0, 5, 0),
      size = 12
    ),
    axis.title.y = element_text(
      margin = margin(0, 5, 0, 5),
      size = 12
    ),
    plot.title = element_text(
      margin = margin(10, 0, 10, 0),
      size = 14
    ),
    axis.text.x = element_text(
      angle = 45,
      vjust = 1,
      hjust = 1,
      size = 8
    )
  )

As you might expect, the average intensity begins to rise around 5:00 AM and stays relatively high throughout the workday before lowering around 8:00 PM. This is evident by looking at the mean for each hour, which is represented by the white dots in each column. The peak times look to be the same as what we saw earlier for observation counts:

  • 12:00 - 2:00 PM (12:00 - 14:00)
  • 5:00 - 7:00 PM (17:00 - 19:00)

How about day-to-day?

# Average intensity throughout the week ---------------------------
average_intensity_p <- 
  ggplot(
    data = hourly_data,
    aes(
      x = reorder(day_list, desc(day_list)),
      y = average_intensity, 
      color = average_intensity
      )
    )
average_intensity_p +
  geom_jitter(width = 0.3, size = 0.1) +
  labs(
    x = "Day", 
    y = "Average intensity",
    color = "Average Intensity",
    title = "Average intensity throughout the week"
  ) + 
  stat_summary(
    aes(
      x = reorder(day_list, desc(day_list)),
      y = average_intensity
      ),
    fun = mean, 
    geom = "point", 
    color = "white", 
    size = 1,
    alpha = 0.7
    ) +
  scale_color_viridis_c() +
  theme_solarized(light = FALSE) +
  theme(
    axis.title.x = element_text(
      margin = margin(5, 0, 5, 0),
      size = 12
    ),
    axis.title.y = element_text(
      margin = margin(0, 5, 0, 5),
      size = 12
    ),
    plot.title = element_text(
      margin = margin(10, 0, 10, 0),
      size = 14
    ),
    axis.text.x = element_text(
      size = 12
    )
  ) +
  coord_flip()

Unlike the hour-to-hour observations, there isn’t as clear a pattern day-to-day.
This is once again evident my looking at the mean for each day, which is represented by the white dots in each column. The difference in position for each dot is minuscule here.

Let’s go back to intensity and hour of day. Are the trends similar when we look at the weekend vs. week?

# average_intensity throughout a 24 hour day ---------------------------
hd_A <- dplyr::select(hourly_data, -time_of_week)
ggplot(
  data = hourly_data,
  aes(x = as.factor(hour_of_day), y = average_intensity)
  ) +
  geom_jitter(data = hd_A, color = "grey70", width = 0.4, alpha = 0.6, size = 2) +
  geom_jitter(aes(color = time_of_week), width = 0.4, alpha = 0.6, size = 2) +
  scale_color_viridis_d(begin = 0, end = .5) +
  labs(
    title = "Average intensity throughout a 24 hour day",
    x = "Hour"
    ) +
  ylab("Average intensity") +
  scale_x_discrete(
    "Hour of day",
    expand = c(0.0002, 0.0002),
    labels = labels_hour_of_day,
    guide = guide_axis(angle = 45)
    ) +
  stat_summary(
    aes(x = as.factor(hour_of_day), y = average_intensity),
    fun = mean, 
    geom = "point", 
    color = "#fde725", 
    size = 4
    ) +
  theme_pander() + 
  theme(
    legend.position = "none",
    axis.title.x = element_text(
      margin = margin(15, 0, 10, 0),
      size = 16
    ),
    axis.title.y = element_text(
      margin = margin(0, 10, 0, 10),
      size = 16
    ),
    plot.title = element_text(
      margin = margin(10, 0, 10, 0),
      size = 20
    ),
    strip.text.x = element_text(
      margin = margin(10, 0, 10, 0),
      size = 14
    ), 
    axis.text.x = element_text(
      size = 12
    ),
    axis.text.y = element_text(
      size = 12
    )
  ) +
  facet_wrap(vars(time_of_week), nrow = 2)

# total_intensity throughout a 24 hour day ---------------------------
jitter_pos <- 
  position_jitter(width = 0.25, height = 0, seed = 42)
set.seed(42)
hd_A <- 
  dplyr::select(hourly_data, -time_of_week)
set.seed(42)
hd_total_intensity_p <- 
  ggplot(
    data = hourly_data,
    aes(x = as.factor(hour_of_day), y = total_intensity, color = time_of_week)
    )
# total_intensity throughout a 24 hour day ---------------------------
hd_total_intensity_p +
  geom_point(
    data = transform(hourly_data, time_of_week = NULL),
    color = "grey70",
    position = position_jitter(width = 0.4, height = 0, seed = 42),
    size = 2
  ) +
  geom_point(
    position = position_jitter(width = 0.4, height = 0, seed = 42),
    size = 2,
    alpha = 0.6
  ) +
  labs(
    title = "Total intensity throughout a 24 hour day", 
    x = "Hour"
    ) +
  ylab("Total intensity") +
  scale_color_viridis_d(begin = 0, end = .5, guide = "legend") +
  scale_x_discrete(
    "Hour of day",
    expand = c(0.0002, 0.0002),
    labels = labels_hour_of_day,
    guide = guide_axis(angle = 45)
    ) +
  stat_summary(
    aes(x = as.factor(hour_of_day), y = total_intensity),
    fun = mean,
    geom = "point", 
    color = "#fde725",
    size = 4
    ) +
  theme_pander() + 
  theme(
    legend.position = "none",
    axis.title.x = element_text(
      margin = margin(15, 0, 10, 0),
      size = 16
    ),
    axis.title.y = element_text(
      margin = margin(0, 10, 0, 10),
      size = 16
    ),
    plot.title = element_text(
      margin = margin(10, 0, 10, 0),
      size = 20
    ),
    strip.text.x = element_text(
      margin = margin(10, 0, 10, 0),
      size = 14
    ),
    axis.text.x = element_text(
      size = 12
    ),
    axis.text.y = element_text(
      size = 12
    )
  ) +
  facet_wrap(vars(time_of_week), nrow = 2)

Active vs sedentary

How about active minutes vs. sedentary minutes throughout the week?

# Create theme for plot formatting ---------------------------
theme_margin <- 
  theme(
    axis.title.x = element_text(margin = margin(15, 0, 15, 0)),
    axis.title.y = element_text(margin = margin(0, 10, 0, 10)),
    plot.title = element_text(margin = margin(10, 0, 10, 0)),
    strip.text.x = element_text(margin = margin(10, 0, 10, 0)),
    )
# Very active minutes vs sedentary minutes ---------------------------
dd02_A <- dplyr::select(daily_data, -time_of_week)
ggplot(daily_data, aes(x = very_active_m, y = sedentary_m)) +
  geom_jitter(data = dd02_A, color = "grey70") +
  geom_jitter(
    aes(x = very_active_m, y = sedentary_m),
    color = "#112E51", 
    width = 0.4, 
    size = 1
    ) +
  labs(title = "Very active minutes vs sedentary minutes throughout the week") +
  ylab("Sedentary minutes") +
  xlab("Very active minutes") +
  geom_smooth(
    aes(x = very_active_m, y = sedentary_m),
    color = "#00CFFF", 
    linetype = "longdash", 
    alpha = 0.4,
    se = FALSE
    ) +
  theme_minimal() + 
  theme_margin +
  facet_grid(~time_of_week ~ .)

# Fairly active minutes vs sedentary minutes ---------------------------
dd02_A <- dplyr::select(daily_data, -time_of_week)
ggplot(daily_data, aes(x = fairly_active_m, y = sedentary_m)) +
  geom_jitter(data = dd02_A, color = "grey70") +
  geom_jitter(
    aes(x = fairly_active_m, y = sedentary_m),
    color = "#112E51", 
    width = 0.4, 
    size = 1
    ) +
  labs(
    title = "Fairly active minutes vs sedentary minutes throughout the week"
    ) +
  ylab("Sedentary minutes") +
  xlab("Fairly active minutes") +
  geom_smooth(
    aes(x = fairly_active_m, y = sedentary_m),
    color = "#00CFFF", 
    linetype = "longdash", 
    alpha = 0.6,
    se = FALSE
    ) +
  theme_minimal() + 
  theme_margin +
  facet_grid(~time_of_week ~ .)

# Lightly active minutes vs sedentary minutes ---------------------------
dd02_A <- dplyr::select(daily_data, -time_of_week)
ggplot(daily_data, aes(x = lightly_active_m, y = sedentary_m)) +
  geom_jitter(data = dd02_A, color = "grey70") +
  geom_jitter(
    aes(x = lightly_active_m, y = sedentary_m),
    color = "#112E51", 
    width = 0.4, 
    size = 1
    ) +
  labs(
    title = "Lightly active minutes vs sedentary minutes throughout the week"
    ) +
  ylab("Sedentary minutes") +
  xlab("Lightly active minutes") +
  geom_smooth(
    aes(x = lightly_active_m, y = sedentary_m),
    color = "#00CFFF", 
    linetype = "longdash", 
    alpha = 0.6,
    se = FALSE
    ) +
  theme_minimal() + 
  theme_margin +
  facet_grid(~time_of_week ~ .)

Active minutes

A key health metric

Of all the metrics we’ve explored thus far, active minutes is the one we’re going to focus on moving forward. We’ll use this section to generate our supporting visualizations and key findings. Then, we’ll provide our top high-level content recommendations based on this analysis to our key stakeholders.

Share


Stakeholder presentation

I’ve provided a link below for my presentation, which includes the following:

  • A summary of my analysis
  • Supporting visualizations and key findings
  • Recommendations for marketing strategy within the Bellabeat App

HTML (Google Slides; requires online access)

Session info

sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] knitr_1.37         personograph_0.1.3 grImport_0.9-5     XML_3.99-0.8      
 [5] ggbeeswarm_0.6.0   ggforce_0.3.3      ggthemes_4.2.4     patchwork_1.1.1   
 [9] crayon_1.4.2       paletteer_1.4.0    gridExtra_2.3      skimr_2.1.3       
[13] lubridate_1.8.0    janitor_2.1.0      hms_1.1.1          here_1.0.1        
[17] forcats_0.5.1      stringr_1.4.0      dplyr_1.0.7        purrr_0.3.4       
[21] readr_2.1.1        tidyr_1.1.4        tibble_3.1.6       ggplot2_3.3.5     
[25] tidyverse_1.3.1   

loaded via a namespace (and not attached):
 [1] nlme_3.1-153      fs_1.5.2          bit64_4.0.5       httr_1.4.2       
 [5] rprojroot_2.0.2   repr_1.1.4        tools_4.1.2       backports_1.4.1  
 [9] utf8_1.2.2        R6_2.5.1          vipor_0.4.5       DBI_1.1.2        
[13] mgcv_1.8-38       colorspace_2.0-2  withr_2.4.3       tidyselect_1.1.1 
[17] bit_4.0.4         compiler_4.1.2    cli_3.1.0         rvest_1.0.2      
[21] xml2_1.3.3        labeling_0.4.2    scales_1.1.1      digest_0.6.29    
[25] rmarkdown_2.11    base64enc_0.1-3   pkgconfig_2.0.3   htmltools_0.5.2  
[29] highr_0.9         dbplyr_2.1.1      fastmap_1.1.0     rlang_0.4.12     
[33] readxl_1.3.1      rstudioapi_0.13   jquerylib_0.1.4   farver_2.1.0     
[37] generics_0.1.1    jsonlite_1.7.2    vroom_1.5.7       magrittr_2.0.1   
[41] Matrix_1.3-4      Rcpp_1.0.7        munsell_0.5.0     fansi_1.0.0      
[45] lifecycle_1.0.1   stringi_1.7.6     yaml_2.2.1        snakecase_0.11.0 
[49] MASS_7.3-54       parallel_4.1.2    lattice_0.20-45   haven_2.4.3      
[53] splines_4.1.2     pillar_1.6.4      reprex_2.0.1      glue_1.6.0       
[57] evaluate_0.14     modelr_0.1.8      vctrs_0.3.8       tzdb_0.2.0       
[61] tweenr_1.0.2      cellranger_1.1.0  gtable_0.3.0      polyclip_1.10-0  
[65] rematch2_2.1.2    assertthat_0.2.1  xfun_0.29         broom_0.7.11     
[69] viridisLite_0.4.0 beeswarm_0.4.0    ellipsis_0.3.2   

References