Background
Bellabeat manufactures health-related products for women that collect data on activity, sleep, stress, and reproductive health. While they have found success as a small company, it may be possible to increase their market share in the global smart device market. Urška Sršen, the cofounder and CCO of Bellabeat, feels that collecting and analyzing data from their smart fitness devices can yield some insights on ways in which the company can find new growth opportunities.
Business Task
Both Urška Sršen (cofounder and CCO) and Sando Mur (cofounder) are stakeholders for this study. Some questions guiding analysis are:
- Are there any trends in smart device usage? If so, what are they?
- In what ways do these trends relate to Bellabeat customers?
- How can Bellabeat adjust its marketing strategy such that these trends are addressed?
Inspect the Data
Data from two months of device usage, stored in CSV files, were used for this analysis. This includes data related to activity, heart rate, sleep monitoring, and so on.
The data are not recent (they are from 2016), which limits their usefulness. There were a limited number of users. For example, in the activity data, only 35 unique users were listed, and in the weight data, only 8. The data are fairly comprehensive, however, covering a wide range of activities. The data themselves were collected from Amazon Mchanical Turk, and it is not clear if it has been cited.
Load the Data
The data were uploaded into R Studio. Set the current working directory, and load required packages. In this case, loading tidyverse is enough, as it pulls in readr, ggplot2, lubridate, and other useful libraries.
> setwd("~/capstone") > library(tidyverse) ── Attaching core tidyverse packages ──────────────────────────────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.4 ✔ readr 2.1.5 ✔ forcats 1.0.0 ✔ stringr 1.5.1 ✔ ggplot2 3.5.1 ✔ tibble 3.2.1 ✔ lubridate 1.9.3 ✔ tidyr 1.3.1 ✔ purrr 1.0.2 ── Conflicts ────────────────────────────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package to force all conflicts to become errors
Next, load the csv files. Since data is split across two files (one for March and one for April), we can merge them together into a two-month span using rbind.
> activity <- read_csv("Fitabase Data 3.12.16-4.11.16/dailyActivity_merged.csv") Rows: 457 Columns: 15 ── Column specification ──────────────────────────────────────────────────────────────────────────────── Delimiter: "," chr (1): ActivityDate dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDistance, VeryActiveDistan... ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. > activity <- rbind(activity, read_csv("Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")) Rows: 940 Columns: 15 ── Column specification ───────────────────────────────────────────────────────────────────────────────── Delimiter: "," chr (1): ActivityDate dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDistance, VeryActiveDistanc... ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
To verify that this worked as intended, check the dimensions of the dataframe.
> dim(activity) [1] 1397 15
This shows that the 940 entries from the second csv have been appended to the 457 entries from the first one.
The other csv files were loaded similarly (e.g. intensities, weight, and so on).
Clean the Data
Examining the data shows that some minor adjustments need to be made.
> head(activity) # A tibble: 6 × 15 Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitiesDist…¹ VeryActiveDistance <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1.50e9 3/25/2016 11004 7.11 7.11 0 2.57 2 1.50e9 3/26/2016 17609 11.6 11.6 0 6.92 3 1.50e9 3/27/2016 12736 8.53 8.53 0 4.66 4 1.50e9 3/28/2016 13231 8.93 8.93 0 3.19 5 1.50e9 3/29/2016 12041 7.85 7.85 0 2.16 6 1.50e9 3/30/2016 10970 7.16 7.16 0 2.36 # ℹ abbreviated name: ¹LoggedActivitiesDistance # ℹ 8 more variables: ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>, # SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>, # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>
The dates are stored as character data, but it would be easier to analyze things if they were proper dates. We can use mdy from lubridate to convert them.
> activity$ActivityDate <- mdy(activity$ActivityDate) > head(activity) # A tibble: 6 × 15 Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitiesDist…¹ VeryActiveDistance <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1.50e9 2016-03-25 11004 7.11 7.11 0 2.57 2 1.50e9 2016-03-26 17609 11.6 11.6 0 6.92 3 1.50e9 2016-03-27 12736 8.53 8.53 0 4.66 4 1.50e9 2016-03-28 13231 8.93 8.93 0 3.19 5 1.50e9 2016-03-29 12041 7.85 7.85 0 2.16 6 1.50e9 2016-03-30 10970 7.16 7.16 0 2.36 # ℹ abbreviated name: ¹LoggedActivitiesDistance # ℹ 8 more variables: ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>, # SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>, # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>
Examine the Data In Detail
First, determine the number of unique identifiers in the data.
> n_distinct(activity$Id) [1] 35
There are 35 unique entries, which is not a huge sample size. Next, obtain some basic statistics about the data.
> activity %>% + select(TotalSteps, TotalDistance, Calories) %>% + summary() TotalSteps TotalDistance Calories Min. : 0 Min. : 0.000 Min. : 0 1st Qu.: 3146 1st Qu.: 2.170 1st Qu.:1799 Median : 6999 Median : 4.950 Median :2114 Mean : 7281 Mean : 5.219 Mean :2266 3rd Qu.:10544 3rd Qu.: 7.500 3rd Qu.:2770 Max. :36019 Max. :28.030 Max. :4900
We can see that the average number of steps is 7281, and the average distance is 5.219 miles. The average number of calories burned is 2266. There are also entries that record no steps whatsoever, and what appear to be some outliers in the data, such as 36019 steps.
Like before, information from other csv files was examined using similar methods. Some of these statistics are discussed in the Insights section of this report.
Analyze the Data
Using R Studio, it was possible to greate visuals that showed any relationships between variables. For example, plotting the number of daily steps vs. the caolries burned throughout the day produces the following scatter plot.
This shows that there is a positive correlation between the two variables, and that participants who are more active burn more calories. It is often useful to visualize data as pie charts. We can examine the percentage of participants who are very active, fairly active, or lightly active using the following:
> VeryActiveMinutes <- sum(activity$VeryActiveMinutes) > FairlyActiveMinutes <- sum(activity$FairlyActiveMinutes) > LightlyActiveMinutes <- sum(activity$LightlyActiveMinutes) > SedentaryMinutes <- sum(activity$SedentaryMinutes) > sectors <- c(VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes) > label <- c("Very Active", "Fairly Active", "Lightly Active", "Sedentary") > percentages <- round(sectors/sum(sectors)*100) > label <- paste(label, percentages, sep = " ") > label <- paste0(label, "%", sep = " ") > pie(sectors, labels = label, col = c("blue", "green", "yellow", "red"), main = "Intensity Level Percentage", sub = "Percentage of day spent being very, moderately, or fairly active")
This produces the following visual.
From this we can see that the overwhelming majority of time per day is sedentary, with only a few participants reaching the fairly or very active phases.
Another relationship that is interesting to note is that of time spent in bed and time asleep.
ggplot(data=sleep, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + geom_point()+ labs(title="Minutes Asleep vs. Time in Bed")
This graph shows another correlation between the two variables, but it is interesting that there are some outliers in the data. That is, some participants that spent a larger amount of time in bed did not necessarily get a lot of sleep.
Insights
The average total steps per day is 7281, which is beneath the “10000 steps per day” guidelines that many activity tracker companies, and the CDC, recommend. The average time spent Average sedentary time is 991 minutes, which could definitely stand to be lowered. If Bellabeat offered an incentive system (e.g. points, scores) and notifications suggesting that participants be more active, they might see an uptake in activity.
The majority of the participants are lightly active, and sleep for approximately 7 hours per day, while the recommended daily amount is 8 hours. Sleep time is generally influenced by the amount of time spent in bed, but not always. Bellabeat might integrate a “sleep coach” feature that helps users get an appropriate amount of sleep.
If participants are trying to lose weight, Bellabeat can suggest some ideas for low calorie meals.