Case Study (R): bike-sharing service data analysis

How does a bike-sharing service design a marketing strategy to convert occasional cyclists into annual subscribers

Case Study, R language, bike-sharing service data analysis

Scenario

You are a data analyst working in the marketing analyst team of Cyclistic, a bike-sharing company in Chicago. The marketing director believes that the company’s future success depends on maximising the number of annual subscriptions. Therefore, your team wants to understand how occasional and annual cyclists use Cyclistic bikes differently. Based on this information, the team will design a new marketing strategy to convert occasional cyclists into annual subscribers. But first, Cyclistic managers must evaluate your recommendations, which must be supported by convincing data and professional visualisations.

Personnel involved:
  • Cyclistic: A bike-sharing program with more than 5800 bicycles and 600 docking stations. Cyclistic offers recumbent bikes, hand-held tricycles, and cargo bikes, making bike-sharing more inclusive for people with disabilities and cyclists who cannot use standard two-wheelers. Most cyclists opt for traditional bicycles; about 8 per cent use assistive options. Bike-sharing users are more likely to ride for recreation, but about 30 per cent use them to commute to work every day.
  • Lily Moreno: the company’s marketing director and manager. Moreno is responsible for developing campaigns and initiatives to promote the bike share program. These can include email, social media and other channels.
  • Cyclistic marketing analytics team: a team of data analysts responsible for collecting, analyzing, and reporting data that help drive Cyclistic’s marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals and how you, as a junior data analyst, can help Cyclistic achieve them.
  • Cyclistic executive team: the traditionally detail-oriented executive team will decide whether to approve the recommended marketing program.
Guidelines:
  • How does annual subscribers’ and occasional cyclists’ use of Cyclistic bicycles differ?
  • Why should casual cyclists buy Cyclistic’s annual subscription?
  • How can Cyclistic use digital media to influence casual cyclists to make an annual subscription?


[1-6] Ask

Questions:
  • What is the problem you are trying to solve?
    Create a profile of the two types of customers so that all their most critical behavioural characteristics can be identified.
  • How can your insights guide Cyclistic’s business decisions?
    My work can help the marketing team devise a strategy to convert as many casual cyclists as possible into subscribers.
Main tasks to be performed:
  • Identify the task at hand
  • Identify the main parties involved in the project (stakeholders)
Objectives:
  • Identify the task at hand
  • Identify the main parties involved in the project (stakeholders)

[2-6] Prepare

Introduction:

Historical data regarding customer rides for the past 12 months collected directly from Cyclistic are used to analyse and identify behaviours.

Questions:
  • Where can the data be found?
    The data are found grouped on a page accessible through a public link. The datasets are named differently because Cyclistic is a fictional company.
  • How are the data organized?
    Data are available in individual .csv files broken down by month.
  • Are there problems of bias or reliability in these data? Is your data ROCCC?
    I did not identify any problems of bias or reliability at the preparation stage since they were collected directly from the company, and the population is the entire customer base. My data are reliable, original, complete, current and cited (ROCCC).
  • How are issues of user license, privacy, security, and accessibility of processed data handled?
    Regarding privacy, they do not include sensitive data (e.g., credit cards, phone numbers, etc.), making it impossible to trace the identity of the individual cyclist.
    The data were made available by Motivate International Inc. with this license document. Use is reserved for non-commercial purposes only. This public data can be used to explore how customers use Cyclistic bicycles. However, data privacy issues prohibit the use of cyclists’ personal information. This means it will be impossible to link pass purchases to credit card numbers to determine whether occasional cyclists live in Cyclistic’s service area or have purchased multiple individual passes.
    For this case study, the data sets are appropriate and allow the assigned questions to be answered.
  • How did you verify the integrity of the data?
    Each dataset has easy-to-identify labelled columns, and the data are populated correctly according to the specific type.
  • How does the procedure performed help you in carrying out the analysis?
    The procedure followed during the preparation phase will allow answering the client’s central question, thus giving an accurate idea of the behaviour pattern of the cyclist using Cyclistic’s services.
  • Have any issues been identified with the data received?
    Cells with empty or null values were identified.
Main tasks to be performed:
  • Download the data and store them appropriately.
  • Identify how it is organized.
  • Sorting and filtering data.
  • Determine the credibility of the data.
Objectives:
  • Description of all data sources used.


[3-6] Process

Introduction:

Data for the past 12 months will be loaded, and some new columns labelled with easy-to-understand vocabulary, such as “ride_length” and “day_of_the_week,” will be created.

  • The prefix “ds_” will be used for datasets;
  • “Members” will refer to annual subscribers;
  • “Casual” will refer to occasional users who rent from time to time;
  • The “occasional” bicycle rental operation is assumed to be done on the company’s website, via a mobile application, or directly at stations.
Questions:
  • Which tools to choose and why?
    To sort and organize the data, I chose to use the R language with RStudio, as I found it suitable for performing all the tasks required by the case study, as well as being able to perform all operations centrally so that they could also be easily reworked in case of changes/integrations.
  • Have you ensured data integrity?
    Yes, the data are consistent in all columns.
  • What measures are taken to ensure data cleansing?
    First, the columns were formatted with the correct data type, and then Na values and duplicates were removed.
  • How can you verify that the data are clean and ready to be analyzed?
    You can verify it through this R markdown file.
  • Have you documented your cleaning process so that you can review and share the results?
    Yes, I confirm this has been detailed in this R markdown file.
Main tasks to be performed:
  • Check the data for errors;
  • Choose the most suitable tools;
  • Transform data so that it can be used effectively;
  • Document the cleaning process.
Objectives:
  • Documentation of any data cleaning or manipulation activities

Libraries Setup

Before loading the libraries, all related packages must have been previously installed.
If not, run the first code chunk below; otherwise, go directly to loading the libraries.

install.packages("tidyverse")
install.packages("lubridate")
install.packages("ggplot2")
install.packages("janitor")
install.packages("dplyr")
install.packages("skimr")
install.packages("scales")
library(tidyverse) #helps wrangle data
library(lubridate) #helps wrangle data attributes
library(ggplot2) #helps visualize data
library(janitor) # simply tools for examining and cleaning dirty data
library(dplyr) # data manipulations
library(skimr) # compact and flexible summaries of data
library(scales) # scale functions for visualization
getwd() #your working directory
Step 1-5: Data collection

Load the data sets into R:

ds_2021_011 
Step 2-5: Process the data and combine them into one file

Check the matching of fields between data sets and combine them.
Use the column names of the most recently loaded dataset as a reference.

colnames(ds_2021_011)
colnames(ds_2021_012)
colnames(ds_2022_001)
colnames(ds_2022_002)
colnames(ds_2022_003)
colnames(ds_2022_004)
colnames(ds_2022_005)
colnames(ds_2022_006)
colnames(ds_2022_007)
colnames(ds_2022_008)
colnames(ds_2022_009)
colnames(ds_2022_010)

Make sure the columns are of the same type:

compare_df_cols(ds_2021_011,ds_2021_012,ds_2022_001,ds_2022_002,ds_2022_003,ds_2022_004,ds_2022_005,ds_2022_006,ds_2022_007,ds_2022_008,ds_2022_009,ds_2022_010, return = "mismatch")

Combine individual data sets into a single data frame and remove empty rows and columns, if any. When finished, delete all previous individual data sets as they are no longer needed:

ds_all_trips 

Summary of data structure:

summary(ds_all_trips)
Step 3-5: Clean up and add data to prepare for analysis

Examine the newly created dataset

List of column names:

colnames(ds_all_trips)

Several rows present:

nrow(ds_all_trips)

Dimensions of the data structure:

dim(ds_all_trips)

See the first six lines of the data frame.
Verify that the dates match the starting time interval required for the analysis:

head(ds_all_trips)

See the last six lines of the data frame.
Verify that the dates match the time interval of the conclusion required for the analysis:

tail(ds_all_trips)

See the list of columns and data types (numeric, characters, etc.):

str(ds_all_trips)

Statistical summary of data.
In particular, check all numerical data for anomalies:

summary(ds_all_trips)

There are some problems to be solved.
  1. PROBLEM 1
    (PRESENT ONLY IF you are using pre-2020 data sets)
    In the “member_casual” column, members are shown in two different manners (“member” and “Subscriber”), same for casual riders (“Customer” and “casual”). We will have to consolidate these labels from four to two.
  2. PROBLEM 2
    Data can only be aggregated at the run level, which is too granular. Additional columns such as day, month, and year should be added to allow other opportunities for data aggregation.
  3. PROBLEM 3
    A field should be added to calculate the duration of the run. We will add “ride_length” to the entire data frame, with a split based on hours, minutes and seconds.
  4. PROBLEM 4
    There are some runs where the duration is negative or those where Divvy has taken the bikes out of circulation for quality control reasons (identifiable in “start_station_name” as “HQ QR”). These runs should be eliminated if present in the time interval taken for analysis.
Problems 1-4

(PRESENT ONLY IF you are using pre-2020 data sets)

In the “member_casual” column, replace “Subscriber” with “member” and “Customer” with “casual.”

Before 2020, Divvy used different labels for these two types of clientele; we need to make our data frame consistent with the current/recent nomenclature.

1 – Begin by seeing how many observations fall into each client type:

table(ds_all_trips$member_casual)

2 – Run the following code chunk only if different values than the correct “casual” and “member” were reported in the previous one, changing them to the official nomenclatures (we will use the labels used by 2020, “member” and “casual”).

ds_all_trips <- ds_all_trips %>%
mutate(member_casual = recode(member_casual ,"Subscriber" = "member" ,"Customer" = "casual"))

3 – Once the previous (optional) step has been performed, ensure the correct number of observations has been reassigned.

table(ds_all_trips$member_casual)
Problems 2-4

1 – Datetime processing (via lubridate library):

ds_all_trips$started_at 

2 – Creation of two new fields for the start and end time of each run:

ds_all_trips$start_hour 

3 – Creation of two new fields containing the initial letter and the number of the day of the week, respectively.
The “lubridate.week.start” option is used to prevent the system from reading “Sunday” as the first day of the week, setting it instead according to the local time zone (in my case, the Italian time zone):

ds_all_trips$day_of_week_letter 

4 – Creation of new columns with valid values for subsequent measurements.
Add columns with each run’s date, month, day (both numerical and textual) and year.
This will allow us to aggregate run data for each month, day, or year; without this data, it would only be possible to aggregate at the run level:

ds_all_trips$date 

5 – Creation of a new field joining YEAR-MONTH (year_month):

ds_all_trips$year_month 

6 – Added a new field with the day of the week, helpful in determining shift patterns during the week (weekday):

ds_all_trips$weekday 
Problems 3-4

1 – Creation of run duration field in hours, minutes, and seconds:

ds_all_trips$ride_length_hours 

2 – Convert the values of “ride_length_…” from factor to numeric so that calculations can be performed correctly for subsequent displays:

is.factor(ds_all_trips$ride_length_hours)
is.factor(ds_all_trips$ride_length_mins)
is.factor(ds_all_trips$ride_length_secs)
ds_all_trips$ride_length_hours 

3 – Summary of data:

summary(ds_all_trips)
Problems 4-4

1 – Removal of “Na” values

ds_all_trips_clean 

2 – Removal of duplicates:

ds_all_trips_clean_no_dups 

3 – Removal of rows with negative values related to ride length (generated when bicycles are taken from the docks and checked by Divvy for ride quality or ride length), with saving to a new data frame (ds_all_trips_clean_length_correct) and deletion of the previous one:

ds_all_trips_clean_length_correct <- ds_all_trips_clean_no_dups %>% filter(ride_length_secs>0)
print(paste("Removed", nrow(ds_all_trips_clean_no_dups) - nrow(ds_all_trips_clean_length_correct), "duplicated rows"))
rm(ds_all_trips_clean_no_dups) #removal of previous dataset

4 – Data summary of the new data frame:

summary(ds_all_trips_clean_length_correct)


[4-6] Analyze

Introduction:

At this stage, we will proceed with constructing a profile of the two types of customers (occasional and subscribers) and how they differ.

New variables helpful in identifying specific features will be created.

Questions:
  • How to organize the data to analyze it?
    They ensure that all columns are consistent with the data type present.
  • Has the data been formatted correctly?
    All data were formatted correctly.
  • What surprises did you discover by analyzing the data?
    I found no particular surprises, probably because the company has steadily improved its handling of these datasets (e.g., comparing a pre-2020 dataset, one finds many more discrepancies).
  • What trends or relationships did you find in the data?
    I have found that the number of rides seems affected by weather conditions and that the difference in the proportion of annual and occasional subscribers is more minor in “warm” months.
  • How do these insights (insights) help answer the pivotal questions of this case study?
    These insights help us understand how Cyclist’s customers use the bicycle rental service.
Main tasks to be performed:
  • Aggregate data so that they are helpful and accessible;
  • Organizing and formatting data;
  • Perform the calculations;
  • Identify trends and relationships.
Objectives:
  • A summary of the analysis conducted
Step 4-5: Conduct a descriptive analysis

Analysis

1 – Descriptive analysis of running time (values are expressed in hours, minutes and seconds).

MEAN
mean(ds_all_trips_clean_length_correct$ride_length_hours) #hours straight average (total ride length hours / rides)
mean(ds_all_trips_clean_length_correct$ride_length_mins) #mins straight average (total ride length minutes / rides)
mean(ds_all_trips_clean_length_correct$ride_length_secs) #secs straight average (total ride length seconds / rides)
MEDIAN
median(ds_all_trips_clean_length_correct$ride_length_hours) #midpoint number in the ascending array of hour ride lengths
median(ds_all_trips_clean_length_correct$ride_length_mins) #midpoint number in the ascending array of mins ride lengths
median(ds_all_trips_clean_length_correct$ride_length_secs) #midpoint number in the ascending array of secs ride lengths
MAX
max(ds_all_trips_clean_length_correct$ride_length_hours) #hours longest ride
max(ds_all_trips_clean_length_correct$ride_length_mins) #mins longest ride
max(ds_all_trips_clean_length_correct$ride_length_secs) #secs longest ride
MIN
min(ds_all_trips_clean_length_correct$ride_length_hours) #hours shortest ride
min(ds_all_trips_clean_length_correct$ride_length_mins) #mins shortest ride
min(ds_all_trips_clean_length_correct$ride_length_secs) #secs shortest ride

2 – You can group previous analyses by “hours,” “mins,” and “secs” using summary():

Instructions:
summary(ds_all_trips_clean_length_correct$ride_length_hours)
summary(ds_all_trips_clean_length_correct$ride_length_mins)
summary(ds_all_trips_clean_length_correct$ride_length_secs)

3 – Comparison of subscribers and casual users by ride duration in hours, minutes, and seconds.

Introduction

N.B. Between mean and median, it is preferable to consider the median.
The average is usually the most appropriate measure for determining a position. This is because it takes into account every value in the dataset. However, outliers in the dataset may affect the mean leading it not to accurately represent all scores (as is the case in the dataset used in this case study). Thus, the median is better because outliers do not affect it.

3.1 – “mean” calcolo della media (durata media di una corsa)
setNames(aggregate(ds_all_trips_clean_length_correct$ride_length_hours ~ ds_all_trips_clean_length_correct$member_casual, FUN = mean), c("customers", "mean hours"))
setNames(aggregate(ds_all_trips_clean_length_correct$ride_length_mins ~ ds_all_trips_clean_length_correct$member_casual, FUN = mean), c("customers", "mean mins"))
setNames(aggregate(ds_all_trips_clean_length_correct$ride_length_secs ~ ds_all_trips_clean_length_correct$member_casual, FUN = mean), c("customers", "mean secs"))
3.2 – “median” calcolo della mediana (calore numerico centrale relativo alla durata delle corse)
setNames(aggregate(ds_all_trips_clean_length_correct$ride_length_hours ~ ds_all_trips_clean_length_correct$member_casual, FUN = median), c("customers", "median hours"))
setNames(aggregate(ds_all_trips_clean_length_correct$ride_length_mins ~ ds_all_trips_clean_length_correct$member_casual, FUN = median), c("customers", "median mins"))
setNames(aggregate(ds_all_trips_clean_length_correct$ride_length_secs ~ ds_all_trips_clean_length_correct$member_casual, FUN = median), c("customers", "median secs"))
3.3 – “max” calcolo della durata massima di una corsa
setNames(aggregate(ds_all_trips_clean_length_correct$ride_length_hours ~ ds_all_trips_clean_length_correct$member_casual, FUN = max), c("customers", "max hours"))
setNames(aggregate(ds_all_trips_clean_length_correct$ride_length_mins ~ ds_all_trips_clean_length_correct$member_casual, FUN = max), c("customers", "max mins"))
setNames(aggregate(ds_all_trips_clean_length_correct$ride_length_secs ~ ds_all_trips_clean_length_correct$member_casual, FUN = max), c("customers", "max secs"))
3.4 – “min” calcolo della duarata minima di una corsa
setNames(aggregate(ds_all_trips_clean_length_correct$ride_length_hours ~ ds_all_trips_clean_length_correct$member_casual, FUN = min), c("customers", "min hours"))
setNames(aggregate(ds_all_trips_clean_length_correct$ride_length_mins ~ ds_all_trips_clean_length_correct$member_casual, FUN = min), c("customers", "min mins"))
setNames(aggregate(ds_all_trips_clean_length_correct$ride_length_secs ~ ds_all_trips_clean_length_correct$member_casual, FUN = min), c("customers", "min secs"))

4 – See the average travel time for each day between subscribers and occasional users:

Instructions
setNames(aggregate(ds_all_trips_clean_length_correct$ride_length_secs ~ ds_all_trips_clean_length_correct$member_casual + ds_all_trips_clean_length_correct$day_of_week, FUN = mean), c("customers", "day week", "duration in seconds"))

4.1 – Note that the days of the week are not in order; let’s fix that.
Indicate the names of the days in Italian if the operating system is in this language (provided that you are using an application installed on your computer for processing this work, e.g., RStudio Desktop); otherwise, use the relevant translation (on online platforms, leave the names of the days in English):

ds_all_trips_clean_length_correct$day_of_week 

4.2 – Let us analyze the average travel time for each day among subscribers and occasional users again:

setNames(aggregate(ds_all_trips_clean_length_correct$ride_length_secs ~ ds_all_trips_clean_length_correct$member_casual + ds_all_trips_clean_length_correct$day_of_week, FUN = mean), c("customers", "day week", "duration in seconds"))

5 – Analyze ride data by type and day of the week.

Instructions

To avoid the `summarise()` message grouped output by `member_casual`, we can hide it by using the `.summarise` argument and setting it to “FALSE“.

options(dplyr.summarise.inform = FALSE)

CAUTION: On all graphs, the “lubridate” function was used to sort the data with the day of the week in the language according to the local time zone, which in my case, starts on Monday (otherwise, the list would start on Sunday). If it is necessary to respect the original ordering remove lubridate (basically, remove the single instruction with lubridate and reactivate the next one disabled with the # symbol).

ds_all_trips_clean_length_correct %>% 
 mutate(weekday = lubridate::wday(ds_all_trips_clean_length_correct$started_at, label = TRUE, week_start = getOption("lubridate.week.start", 1),locale = Sys.getlocale("LC_TIME"))) %>% #creates weekday field using wday()
 #mutate(weekday = wday(started_at, label = TRUE)) %>%
 group_by(member_casual, weekday) %>% #groups by usertype and weekday
 dplyr::summarise(number_of_rides = n() #calculates the number of rides and average duration
 ,average_duration_hours = mean(ride_length_hours), average_duration_mins = mean(ride_length_mins), average_duration_secs = mean(ride_length_secs)) %>% #calculates the average duration
 arrange(member_casual, weekday) #sorts

6 – We display the number of rides by rider type

Instructions
# This function help to resize the plots
function_help_resize_plots 
ds_all_trips_clean_length_correct %>% 
 mutate(weekday = lubridate::wday(ds_all_trips_clean_length_correct$started_at, label = TRUE, week_start = getOption("lubridate.week.start", 1),locale = Sys.getlocale("LC_TIME"))) %>%
 #mutate(weekday = wday(started_at, label = TRUE)) %>% 
 group_by(member_casual, weekday) %>% 
 dplyr::summarise(number_of_rides = n(), average_duration = mean(ride_length_secs)) %>% 
 arrange(member_casual, weekday) %>% 
 ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
 guides(fill = guide_legend(title = "User category")) +
 scale_y_continuous("number of runs", labels = scales::comma) +
 scale_x_discrete(name = "day week") +
 scale_fill_discrete(labels = c("Casuals", "Members")) +
 geom_col(position = "dodge")
Number of rides day week
Considerations on what has been elaborated up to this point

Some considerations can be drawn from the graphs just compiled:

  • during the weekend (Saturday and Sunday), occasional users significantly increase the number of rides compared to other single days of the week.
  • annual subscribers make more trips during the work week than during the weekend, so behavior related to use of the subscription for work purposes (home/work commute) is presumable.

7 – We create a visualization for the average duration

7.1 – Sorting into hours:
ds_all_trips_clean_length_correct %>% 
 mutate(weekday = lubridate::wday(ds_all_trips_clean_length_correct$started_at, label = TRUE, week_start = getOption("lubridate.week.start", 1),locale = Sys.getlocale("LC_TIME"))) %>%
 #mutate(weekday = wday(started_at, label = TRUE)) %>% 
 group_by(member_casual, weekday) %>% 
 dplyr::summarise(number_of_rides = n(), average_duration = mean(ride_length_hours)) %>% 
 arrange(member_casual, weekday) %>% 
 ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
 guides(fill = guide_legend(title = "User category")) +
 scale_y_continuous("average duration in hours") +
 scale_x_discrete(name = "day week") +
 scale_fill_discrete(labels = c("Casuals", "Members")) +
 geom_col(position = "dodge")
Average running time in hours
7.2 – Sorting in minutes:
ds_all_trips_clean_length_correct %>% 
 mutate(weekday = lubridate::wday(ds_all_trips_clean_length_correct$started_at, label = TRUE, week_start = getOption("lubridate.week.start", 1),locale = Sys.getlocale("LC_TIME"))) %>%
 #mutate(weekday = wday(started_at, label = TRUE)) %>% 
 group_by(member_casual, weekday) %>% 
 dplyr::summarise(number_of_rides = n(), average_duration = mean(ride_length_mins)) %>% 
 arrange(member_casual, weekday) %>% 
 ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
 guides(fill = guide_legend(title = "User category")) +
 scale_y_continuous("average duration in minutes") +
 scale_x_discrete(name = "day week") +
 scale_fill_discrete(labels = c("Casuals", "Members")) +
 geom_col(position = "dodge")
Average running time in minutes
7.3 – Sorting in seconds:
ds_all_trips_clean_length_correct %>% 
 mutate(weekday = lubridate::wday(ds_all_trips_clean_length_correct$started_at, label = TRUE, week_start = getOption("lubridate.week.start", 1),locale = Sys.getlocale("LC_TIME"))) %>%
 #mutate(weekday = wday(started_at, label = TRUE)) %>% 
 group_by(member_casual, weekday) %>% 
 dplyr::summarise(number_of_rides = n(), average_duration = mean(ride_length_secs)) %>% 
 arrange(member_casual, weekday) %>% 
 ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
 guides(fill = guide_legend(title = "User category")) +
 scale_y_continuous("average duration in seconds") +
 scale_x_discrete(name = "day week") +
 scale_fill_discrete(labels = c("Casuals", "Members")) +
 geom_col(position = "dodge")
Average running time in seconds
Considerations:

Some considerations can be drawn from these graphs:

  • occasional users have twice as much time using bicycles as annual subscribers;
  • annual subscribers have their travel time equally distributed throughout the week;
  • occasional ones see significantly higher usage (compared to other days of the week) on weekends and Mondays.


Data distribution

At this stage, we want to try to answer the most basic questions about data distribution.

Occasional vs. Subscribers

How much data concerns annual (member) subscribers, and how much concerns casual (casual) subscribers?

ds_all_trips_clean_length_correct %>% 
 group_by(member_casual) %>% 
 summarise(count = length(ride_id),
 '%' = (length(ride_id) / nrow(ds_all_trips_clean_length_correct)) * 100)

Data visualization with a bar graph:

function_help_resize_plots(16,8)
ggplot(ds_all_trips_clean_length_correct, aes(member_casual, fill=member_casual)) +
 guides(fill = guide_legend(title = "User category")) +
 geom_bar() +
 scale_y_continuous("number of runs", labels = scales::comma) +
 scale_x_discrete(name = "Occasionali vs Abbonati") +
 scale_fill_discrete(labels = c("Casuals", "Members")) +
 labs(title="Chart 1 - Casuals vs Members")
Chart 1 - Occasional vs. Subscribers

Visualization on a pie chart:

ds_all_trips_clean_length_correct %>% 
 group_by(member_casual) %>% 
 summarise(count = length(ride_id),
 percent_cat_users = round((length(ride_id) / nrow(ds_all_trips_clean_length_correct)) * 100, digits = 2)) %>% 
 #arrange(percent_cat_users) %>%
 #mutate(percent_cat_users_labels = scales::percent(percent_cat_users, accuracy = 0.01)) %>% 
 ggplot(aes(x = "", y = count, fill = member_casual)) +
 geom_col(color = "black") +
 geom_label(aes(label = paste0(percent_cat_users, "%")), position = position_stack(vjust = 0.5), show.legend = FALSE) +
 guides(fill = guide_legend(title = "User category")) +
 scale_fill_discrete(labels = c("Casuals", "Members")) +
 labs(title="Casuals vs Members (in %)",
 subtitle = "November 2021 - October 2022",
 x="",
 y="") + 
 #scale_fill_viridis_d() +
 coord_polar(theta = "y") +
 theme_void() # remove default theme
Occasional vs. Subscribers

As seen in the “Occasional vs Subscribers” table, subscribers have a more significant proportion of the dataset, with a percentage of ~60%, ~20% more than the count of occasional cyclists.

Month

How are the data distributed by month?

ds_all_trips_clean_length_correct %>%
 group_by(year_month) %>%
 summarise(count = length(ride_id),
 '%' = (length(ride_id) / nrow(ds_all_trips_clean_length_correct)) * 100,
 'members_p' = (sum(member_casual == "member") / length(ride_id)) * 100,
 'casual_p' = (sum(member_casual == "casual") / length(ride_id)) * 100,
 'dif_members_casuals' = members_p - casual_p)
ds_all_trips_clean_length_correct %>%
 ggplot(aes(year_month, fill=member_casual)) +
 geom_bar() +
 labs(x="Mese", title="Chart 2 - Distribution of trips per month") +
 guides(fill = guide_legend(title = "User category")) +
 scale_y_continuous("number of runs", labels = scales::comma) +
 scale_fill_discrete(labels = c("Casuals", "Members")) +
 theme(axis.text.x=element_text(angle=45,hjust=1))
Chart 2 - Distribution by month

Some considerations can be drawn from this graph:

  • the number of runs seems to be affected by weather conditions;
  • the month with the most significant number of data points was July 2022, with ~14.5% of the dataset;
  • in all months, we have more subscriber runs than casual runs;
  • the difference in proportion between annual and occasional subscribers is more minor in “hot” months;
  • the distribution seems cyclical.
Check correlation with climatic conditions.

To test the correlation with climatic conditions, we compare the number of runs with the climatic data for Chicago during the case study period.

Source: Climate-Data.org (Daily average °C, November 2021 — October 2022).
Please note: If the monthly reference interval changes, it is necessary to manually change the temperature values in the following dataset “ds_chicago_temperatures“.

ds_stats_n_races <- ds_all_trips_clean_length_correct %>%
 group_by(year_month) %>%
 summarise(count = length(ride_id),
 '%' = (length(ride_id) / nrow(ds_all_trips_clean_length_correct)) * 100,
 'members_p' = (sum(member_casual == "member") / length(ride_id)) * 100,
 'casual_p' = (sum(member_casual == "casual") / length(ride_id)) * 100,
 'dif_members_casuals' = members_p - casual_p)
ds_chicago_temperature %
 ggplot(aes(year_month, fill=member_casual)) +
 geom_bar() +
 labs(x="Mese", title="Chart 2 - Distribution of trips per month") +
 guides(fill = guide_legend(title = "User category")) +
 scale_y_continuous("number of runs", labels = scales::comma) +
 scale_fill_discrete(labels = c("Casuals", "Members")) +
 theme(axis.text.x=element_text(angle=45,hjust=1))
Chart 2 - Distribution by month
ggplot(ds_chicago_temperature_races, aes(x = year_month, y = count)) + # bar plot
 geom_col(linewidth = 1, color = "darkblue", fill = "white") +
 scale_x_discrete(name = "year - month") +
 scale_y_continuous("number of runs", labels = scales::comma) +
 scale_fill_discrete(labels = c("Casuals", "Members")) +
 theme(axis.text.x=element_text(angle=45,hjust=1))
Number of rides
ggplot(ds_chicago_temperature_races, aes(x = month, y = temp_mean)) + # line plot
 geom_line(linewidth = 1.5, color="red", group = 1) +
 scale_x_discrete(name = "year - month") +
 scale_y_continuous(name = "medium temperature") +
 theme(axis.text.x=element_text(angle=45,hjust=1))
Average temperature
ggplot(ds_chicago_temperature_races, aes(x = year_month)) + 
 geom_col(aes(y = count), size = 1, color = "darkblue", fill = "white") +
 #scale_y_continuous("number of runs", labels = scales::comma) +
 geom_line(aes(y = 30000*temp_mean), linewidth = 1.5, color="red", group = 1) + 
 scale_x_discrete(name = "year - month") +
 scale_y_continuous(sec.axis = sec_axis(~./300000, name = "temp mean")) +
 theme(axis.text.x=element_text(angle=45,hjust=1),axis.text.y=element_blank(),axis.ticks.y=element_blank()) +
 labs(title = "In a comparison of stroke number and temperature", subtitle = "the red line indicates the average temperature in the corresponding month")
Comparison of stroke number and temperature
rm(ds_stats_n_races) # remove temporary dataset used to sync temperatures

The main result is:

  • temperature affects the volume of monthly runs, particularly in winter/cold ones;
  • when the temperature drops below freezing, occasional customers avoid bicycle use. Under the same conditions, annual subscribers maintain a specific frequency of use, probably related to an obligatory commute (home/work).
Days of the week

How are the data distributed over the days of the week?

ds_all_trips_clean_length_correct %>%
 group_by(weekday) %>% 
 summarise(count = length(ride_id),
 '%' = (length(ride_id) / nrow(ds_all_trips_clean_length_correct)) * 100,
 'members_p' = (sum(member_casual == "member") / length(ride_id)) * 100,
 'casuals_p' = (sum(member_casual == "casual") / length(ride_id)) * 100,
 'dif_perc_members_casuals' = members_p - casuals_p)

Let’s visualize the data via a graph (enlarge it for better reading)

ggplot(ds_all_trips_clean_length_correct, aes(weekday, fill=member_casual)) +
 geom_bar(position='dodge2') +
 #geom_text(stat = "count", aes(label = ..count..)) +
 geom_label(stat='count',
 aes(label=after_stat(count)),
 position=position_dodge2(width=0.5),
 size=3,
 show.legend = FALSE # removes the letter within the icon
  ) +
 labs(x="giorni della settimana", title="Chart 3.1 - Distribution by day of the week") +
 guides(fill = guide_legend(title = "User category")) +
 scale_y_continuous("number of runs", labels = scales::comma) +
 scale_fill_discrete(labels = c("Casuals", "Members"))
 #theme(legend.position = "none")
Chart 3.1 - Distribution by day of the week.

Same data but displayed differently:

ggplot(ds_all_trips_clean_length_correct, aes(weekday, fill=member_casual)) +
 geom_bar() +
 #geom_text(stat = "count", aes(label = ..count..)) +
 geom_text(stat = "count", 
 aes(label= after_stat(count)), 
 position = position_stack(vjust = 0.5),
 size = 3
  ) +
 labs(x="giorni della settimana", title="Chart 3.2 - Distribution by day of the week") +
 guides(fill = guide_legend(title = "User category")) +
 scale_y_continuous("number of runs", labels = scales::comma) +
 scale_fill_discrete(labels = c("Casuals", "Members"))
Chart 3.2 - Distribution by day of the week.

It is interesting to see:

  • the volume of data is distributed essentially equally on all days of the week except Saturday;
  • Saturday has the most data points;
  • annual subscribers have the highest data volume, except on Saturdays. On this day of the week, casuals have the most data points;
  • weekends starting on Friday see the highest volume of occasional rides, with an increase of up to 20 per cent.
Time of day
ds_all_trips_clean_length_correct %>%
 group_by(start_hour) %>% 
 summarise(count = length(ride_id),
 '%' = (length(ride_id) / nrow(ds_all_trips_clean_length_correct)) * 100,
 'members_p' = (sum(member_casual == "member") / length(ride_id)) * 100,
 'casuals_p' = (sum(member_casual == "casual") / length(ride_id)) * 100,
 'dif_perc_members_casuals' = members_p - casuals_p)
ds_all_trips_clean_length_correct %>%
 ggplot(aes(start_hour, fill=member_casual)) +
 scale_y_continuous("number of runs", labels = scales::comma) +
 labs(x="Time of day", title="Chart 4.1 - Distribution by time of day") +
 guides(fill = guide_legend(title = "User category")) +
 scale_fill_discrete(labels = c("Casuals", "Members")) +
 geom_bar()
Chart 4.1 - Distribution by time of day.

same results showed differently (enlarged graph)

ggplot(ds_all_trips_clean_length_correct, aes(start_hour, fill=member_casual)) +
 geom_bar(position='dodge2') +
 #geom_text(stat = "count", aes(label = ..count..)) +
 geom_label(stat='count',
 aes(label=after_stat(count)),
 position=position_dodge2(width=0.5),
 size=3,
 show.legend = FALSE # removes the letter within the icon
  ) +
 labs(x="time of day", title="Chart 4.2 - Distribution by time of day") +
 guides(fill = guide_legend(title = "User category")) +
 scale_y_continuous("number of runs", labels = scales::comma) +
 scale_fill_discrete(labels = c("Casuals", "Members")) +
 theme(legend.position = c(0.2, 0.8))
 # coord_flip()
 #theme(legend.position = "none")
Chart 4.2 - Distribution by time of day.

From this graph, we can see:

  • There is a more significant influx of cyclists in the afternoon;
  • the number of rides of annual subscribers is significantly higher than those of occasional subscribers in the 6 to 10 a.m. and 3 to 7 p.m. time slots;
  • casuals are more incredible than subscribers between midnight and 4 a.m.

This graph can be expanded by dividing it by each day of the week.

ds_all_trips_clean_length_correct %>%
 ggplot(aes(start_hour, fill=member_casual)) +
 geom_bar() +
 labs(x="Time of day", y="number of runs", title="Chart 5 - Distribution by time of day and day of the week") +
 guides(fill = guide_legend(title = "User category")) +
 scale_fill_discrete(labels = c("Casuals", "Members")) +
 facet_wrap(~ weekday)
Chart 5 - Distribution by time of day and day of the week.

There is a clear difference between weekdays and weekends.

We generate graphs for these two weekly splits.

ds_all_trips_clean_length_correct %>%
 mutate(type_of_weekday = ifelse(weekday == '6 - Sat' | weekday == '7 - Sun', 'weekend', 'midweek')) %>%
 ggplot(aes(start_hour, fill=member_casual)) +
 guides(fill = guide_legend(title = "User category")) +
 scale_fill_discrete(labels = c("Casuals", "Members")) +
 scale_y_continuous("number of runs", labels = scales::comma) +
 labs(x="Time of day", title="Chart 6 - Distribution by time of weekday-weekend") +
 geom_bar() +
 facet_wrap(~ type_of_weekday)
Chart 6 - Distribution by the time of day midweek-weekend

The two plots differ in some key aspects:

  • while weekends have a steady flow of data points, midweek days have a steeper/irregular flow of data;
  • counting data points is not very meaningful since each graph represents a different number of days;
  • for midweek days there is a significant increase in data points between 6 and 8 a.m. (especially of subscribers) and then declines in the next two hours (9 and 10 a.m.) and then resumes with a steady increase from 11 a.m. to 6 p.m;
  • another significant increase is from 4 p.m. to 6 p.m., particularly in subscribers;
  • during the weekend, we have a greater flow of casual users between 11:00 am and 6:00 pm.


Summarizing the above:
It is crucial to distinguish the two types of cyclists (subscribers and occasional riders) who use bicycles at different times of the day. We can assume a few factors, one of which is that subscribers may use bicycles during their routine daily activities, such as going to work (between 5 and 8 a.m. on a weekday) and returning from work (between 4 and 6 p.m. midweek).



Type of bicycle
ds_all_trips_clean_length_correct %>%
group_by(rideable_type) %>%
summarise(count = length(ride_id),
'%' = (length(ride_id) / nrow(ds_all_trips_clean_length_correct)) * 100,
'members_p' = (sum(member_casual == "member") / length(ride_id)) * 100,
'casual_p' = (sum(member_casual == "casual") / length(ride_id)) * 100,
'member_casual_perc_difer' = members_p - casual_p)
ggplot(ds_all_trips_clean_length_correct, aes(rideable_type, fill=member_casual)) +
guides(fill = guide_legend(title = "User category")) +
scale_fill_discrete(labels = c("Casuals", "Members")) +
scale_y_continuous("number of runs", labels = scales::comma) +
labs(x="Bicycle type", title="Chart 7 - Distribution by type of bicycle") +
geom_bar()
Chart 7 - Distribution by bicycle type.

It is important to note that:

  • “docked” bicycles have a low volume of rides and are used only by occasional users;
  • subscribers have a greater preference for classic bicycles, 66 per cent compared to 34 per cent of occasional users;
  • Even for e-bikes, subscribers have a greater preference (57 per cent) than occasional subscribers (43 per cent).


Stations
Introduction

Knowing which stations are the most popular for marketing efforts will be helpful. To find out which ones they are, we will use the name of the departure and arrival stations and count the number of trips that start or end on them.

Departure Stations

Departure Stations —> Sort starting with those with the highest number of runs:

ds_all_trips_clean_length_correct %>% 
 group_by(start_station_name) %>%
 summarise(
 count = length(ride_id), 
 #ride_id = n(),
 ) %>%
 slice_max(count, n = 10)

Departure Stations —> those most used by subscribers:

ds_all_trips_clean_length_correct %>% 
 filter(member_casual=='member') %>%
 group_by(start_station_name) %>%
 summarise(count = length(ride_id)) %>%
 arrange(desc(count), desc(start_station_name)) %>% 
 slice_max(count, n = 10)

Departure Stations —> those most used by the occasional:

ds_all_trips_clean_length_correct %>% 
 filter(member_casual=='casual') %>%
 group_by(start_station_name) %>%
 summarise(count = length(ride_id)) %>%
 arrange(desc(count), desc(start_station_name)) %>% 
 slice_max(count, n = 10)

Departure Stations —> we now compare the usage of the starting stations with the two types of users (subscribers and occasional users) to check the percentages of use and the time of highest frequency, and we save it all in a dedicated dataset so that we can use it for all filtering and visualization operations:

colnames(ds_all_trips_clean_length_correct)
ds_all_trips_clean_length_correct_station_start <- ds_all_trips_clean_length_correct %>%
 group_by(start_station_name) %>% 
 summarise(count = length(ride_id),
 '%' = (length(ride_id) / nrow(ds_all_trips_clean_length_correct)) * 100,
 'members_p' = (sum(member_casual == "member") / length(ride_id)) * 100,
 'casual_p' = (sum(member_casual == "casual") / length(ride_id)) * 100,
 'member_casual_perc_difer' = members_p - casual_p,
 'members' = (sum(member_casual == "member")),
 'casual' = (sum(member_casual == "casual")),
 'member_casual_difer' = members - casual,
 'hours_first_race' = min(start_hour, na.rm = TRUE),
 'hours_last_race' = max(start_hour, na.rm = TRUE),
 'data_ds_start_at' = min(started_at, na.rm = TRUE),
 'data_ds_end_at' = max(ended_at, na.rm = TRUE),
 'hours_mean' = mean(start_hour, na.rm = TRUE),
 'hours_median' = median(start_hour, na.rm = TRUE)) %>% 
 arrange(desc(count), desc(start_station_name))
 #slice(1:10) 
ds_all_trips_clean_length_correct_station_start
Arrival stations

Arrival stations —> Sort starting with those with the highest number of runs:

ds_all_trips_clean_length_correct %>% 
 group_by(end_station_name) %>%
 summarise(
 count = length(ride_id), 
 #ride_id = n(),
 ) %>%
 slice_max(count, n = 10)

Arrival stations —>those most used by subscribers:

ds_all_trips_clean_length_correct %>% 
 filter(member_casual=='member') %>%
 group_by(end_station_name) %>%
 summarise(count = length(ride_id)) %>%
 arrange(desc(count), desc(end_station_name)) %>% 
 slice_max(count, n = 10)

Arrival stations —>those most used by the occasional:

ds_all_trips_clean_length_correct %>% 
 filter(member_casual=='casual') %>%
 group_by(end_station_name) %>%
 summarise(count = length(ride_id)) %>%
 arrange(desc(count), desc(end_station_name)) %>% 
 slice_max(count, n = 10)

Arrival stations —> We now compare the use of the arrival stations with the two types of users (subscribers and occasional users) to check the percentages of use and the time of highest frequency, and we save it all in a dedicated dataset so that we can use it for all filtering and visualization operations:

ds_all_trips_clean_length_correct_station_end <- ds_all_trips_clean_length_correct %>%
 group_by(end_station_name) %>%
 summarise(count = length(ride_id),
 '%' = (length(ride_id) / nrow(ds_all_trips_clean_length_correct)) * 100,
 'members_p' = (sum(member_casual == "member") / length(ride_id)) * 100,
 'casual_p' = (sum(member_casual == "casual") / length(ride_id)) * 100,
 'member_casual_perc_difer' = members_p - casual_p,
 'members' = (sum(member_casual == "member")),
 'casual' = (sum(member_casual == "casual")),
 'member_casual_difer' = members - casual,
 'hours_first_race' = min(start_hour, na.rm = TRUE),
 'hours_last_race' = max(start_hour, na.rm = TRUE),
 'data_ds_start_at' = min(started_at, na.rm = TRUE),
 'data_ds_end_at' = max(ended_at, na.rm = TRUE),
 'hours_mean' = mean(start_hour, na.rm = TRUE),
 'hours_median' = median(start_hour, na.rm = TRUE)) %>% 
 arrange(desc(count), desc(end_station_name))
 #slice(1:10) 
ds_all_trips_clean_length_correct_station_end

Departure/arrival station considerations:

It can be seen that the most frequently used stations are different depending on the type of user. This may be related to the type of bicycle use; as occasional users’ use is more leisure-related, their stations may be closer to leisure places, while subscribers use bicycles mainly to commute to work.

The most significant traffic, both departing and arriving, is concentrated at the “Streeter Dr & Grand Ave” station, where, however, 77 per cent of leaving and 80 per cent of coming users belong to the “occasional” category, with the average peak hour (every week) around 2 p.m.

On the other hand, traffic for annual subscribers is concentrated on the “Kingsbury St & Kinzie St” station, with about 75 per cent outbound and 77 per cent inbound, compared to the occasional subscribers who frequent this station. At this station, on average (every week), the peak time is around 1 pm.

The stations most used by subscribers are:
“Kingsbury St & Kinzie St”, “Clark St & Elm St”, and “Wells St & Concord Ln”.

The stations most used by casual users are:
“Streeter Dr & Grand Ave”, “DuSable Lake Shore Dr & Monroe St”, and “Millennium Park”.



Analysis without outline values (step 1 of 2)

Now let’s take a look at some variables in the data set.

We first display some summary statistics from the data set.

summary(ds_all_trips_clean_length_correct$ride_length_mins)

Minimum and maximum values can be a problem for plotting some graphs. Why do some bicycles’ travel times have such a high maximum value? Perhaps there is some malfunction of the stations returning wrong dates.

We first divide the population into several equal parts to check in which range the anomaly (ventili) is present:

ventiles = quantile(ds_all_trips_clean_length_correct$ride_length_mins, seq(0, 1, by=0.05))
ventiles

We can see that:

  • the difference between 0% and 100% is 34354.07 minutes (over 572 hours, almost 24 days);
  • the difference between 0% and 95% is 47.5 minutes;
  • values below 5% are too short to be considered a real run (less than 3 minutes).

Let’s look at which rides have the highest travel times to see if we can find any references that might provide valuable insights into the reason for this abnormal use:

ds_all_trips_clean_length_correct %>% arrange(desc(ride_length_mins))

Instead, here are the runs listed from the minimum duration:

ds_all_trips_clean_length_correct %>% arrange((ride_length_mins))

Checking the various entries (e.g., departure and arrival stations) would seem to be no problem; however, we note that all the outliers for rides with high times are confined to “docked” bikes.
More information about this data will be requested from relevant stakeholders to see if the company considers it normal.

We also check by filtering for the range > 95%:

ds_all_trips_clean_length_correct_outliners_max <- ds_all_trips_clean_length_correct %>%
 filter(ride_length_mins > as.numeric(ventiles['95%'])) %>%
 arrange(desc(ride_length_mins))
ds_all_trips_clean_length_correct_outliners_max

Once the appropriate checks have been made, delete the last dataset just created (…_outliners_max):

rm(ds_all_trips_clean_length_correct_outliners_max)
Analysis without outline values (step 2 of 2)

Attention: the analysis continues by creating a dataset without outliners (“…_outliners_without“), with a subset of data that excludes the outlier minimum/maximum extremes identified earlier, then with an optimal range between 5 and 95 per cent.
This choice is also dictated by the need to perform verifications that would be difficult to interpret in the graphs with the presence of these outliers.

ds_all_trips_clean_length_correct_outliners_without <- ds_all_trips_clean_length_correct %>% 
 filter(ride_length_mins > as.numeric(ventiles['5%'])) %>%
 filter(ride_length_mins < as.numeric(ventiles['95%']))
print(paste("Removed", nrow(ds_all_trips_clean_length_correct) - nrow(ds_all_trips_clean_length_correct_outliners_without), "rows as outliner" ))

A box plot is one of the first interactions between the columns and the running time, with underlying schemes based on the casual_members column. Also, for summarized data.

ds_all_trips_clean_length_correct_outliners_without %>% 
 group_by(member_casual) %>% 
 summarise(mean = mean(ride_length_mins),
 'first_quarter' = as.numeric(quantile(ride_length_mins, .25)),
 'median' = median(ride_length_mins),
 'third_quarter' = as.numeric(quantile(ride_length_mins, .75)),
 'IR' = third_quarter - first_quarter)
ggplot(ds_all_trips_clean_length_correct_outliners_without, aes(x=member_casual, y=ride_length_mins, fill=member_casual)) +
 guides(fill = guide_legend(title = "User category")) +
 scale_fill_discrete(labels = c("Casuals", "Members")) +
 labs(x="Members vs Casuals", y="Travelling time", title="Chart 8 - Distribution of travel time for Casuals/Members") +
 geom_boxplot()
Chart 8 - Distribution of travel time for Occasional/Subscribers.

It is important to note that:

  • occasional riders have more time to ride than subscribers;
  • The mean and IQR are also larger for occasional clients.

Let’s see if finding other helpful information when processing the day of the week is possible.

ggplot(ds_all_trips_clean_length_correct_outliners_without, aes(x=weekday, y=ride_length_mins, fill=member_casual)) +
 geom_boxplot() +
 facet_wrap(~ member_casual) +
 guides(fill = guide_legend(title = "User category")) +
 scale_fill_discrete(labels = c("Casuals", "Members")) +
 labs(x="Day", y="Travelling time", title="Diagram 9 var.a - Distribution of rental time by day of the week") +
 theme(axis.text.x=element_text(angle=45,hjust=1))
Chart 9 var. a - Distribution of rental time by day of the week

Some considerations:

  • pedalling time for subscribers remains unchanged during midweek, while it increases on weekends;
  • casuals follow a more curved distribution, with a peak on Saturday/Sunday and the lowest values on Tuesday/Wednesday/Thursday.

Finally, we elaborate according to bicycle type.

ggplot(ds_all_trips_clean_length_correct_outliners_without, aes(x=rideable_type, y=ride_length_mins, fill=member_casual)) +
 geom_boxplot() +
 facet_wrap(~ member_casual) +
 guides(fill = guide_legend(title = "User category")) +
 scale_fill_discrete(labels = c("Casuals", "Members")) +
 labs(x="Bicycle type", y="Travelling time", title="Graph 10 - Distribution of rental time by type of bicycle")
 #coord_flip()
Chart 10 - Distribution of rental time by bicycle type.

Some considerations:

  • electric bikes have a lower riding time than other bikes, both for subscribers and occasional riders;
  • “docked” bicycles have more travel time (data that would be “out of scale” in the graph if we used the dataset including the outliners removed above, referring to the > 95% interval) and are used only by occasional customers.

Step 5-5: Export summary files for further analysis

Create a CSV file in Excel, Tableau, SQL or other analysis/presentation software.

Save the final data frame with and without outliners (in addition to those used for weather conditions and start and finish stations) in a CSV file using the following code chunk:

write_csv(ds_all_trips_clean_length_correct,"divvy_tripdata.csv")
write_csv(ds_all_trips_clean_length_correct_outliners_without,"divvy_tripdata_without_outliners.csv")
write_csv(ds_chicago_temperature_races,"divvy_tripdata_chicago_temperature_races.csv")
write_csv(ds_all_trips_clean_length_correct_station_start,"divvy_tripdata_station_start.csv")
write_csv(ds_all_trips_clean_length_correct_station_end,"divvy_tripdata_station_end.csv")


[5-6] Share

Questions
  • Were you able to answer how subscribers and occasional cyclists use Cyclistic bicycles differently?
    Yes, I could identify the behavioural pattern of your customer types.
  • What story do your data tell?
    Casual users and subscribers use the Cyclistic service differently, but there are opportunities to convert many current customers to subscribers.
  • How do the results obtained relate to the original question?
    My results relate to the initial question by building a profile of the two types of customers, finding the critical differences between occasional users and subscribers, and why each user group uses bicycles.
  • Who are the stakeholders for this case study? What is the best way to communicate with them?
    My interlocutors are the Cyclistic Executive Team.
  • Can data visualizations help you share your results?
    Visualizations are crucial to presenting complex data clearly, intuitively, and immediately.
  • Is the presentation accessible to the public?
    The presentation of results is accessible to all.
Tasks to be performed
  • Determine the best way to share results;
  • Creating effective data visualizations;
  • Presenting the results;
  • Make sure the work is accessible.
Goals
  • Visualizations support the analysis and indicate the results obtained.


[6-6] Act

Questions
  • What is the conclusion of your analysis?
    Subscribers and occasional cyclists have different usage profiles. For more details, refer to my considerations detailed throughout the analysis’s various steps.
  • How might your team and your company apply your insights?
    The team and the company could apply the guidelines arising from this analysis by developing a marketing campaign to turn casual cyclists into subscribers.
  • What next steps would you or your stakeholders take based on your findings?
    Delve further into the various analysis metrics to find additional helpful information. Information that the marketing team could use to improve the company’s digital campaign.
  • Is there any other data you could use to elaborate on your results?
    It might be helpful to constantly update the data in this case study with more recent data to show the trend of any actions taken, checking the evolution of the behaviour of subscribers and occasional cyclists.
Final Recommendations
  • Le tre principali raccomandazioni basate sulla tua analisi.
    Consiglierei al team di marketing di creare una campagna digitale indirizzandola verso le località più frequentate dai ciclisti occasionali che evidenzi le potenzialità e i vantaggi del noleggio delle biciclette rispetto all’uso dell’auto, mostrando l’impatto positivo sull’ambiente, l’esercizio fisico e la riduzione della congestione del traffico.
    Inoltre:
    • highlight the availability of bicycles in each time slot of the day;
    • create subscriptions linked to the number of rides to be “consumed” within 12 months of their purchase;
    • offer a discount on colder days to encourage rides by casual customers.
Tasks to be performed

Draft a summary report of the case study.

The case study summary report will aim to present the analysis results in a clear, concise, and accessible way to a non-technical audience, such as executives and members of Cyclistic’s marketing team. The report will follow a logical structure that will include the following key elements:

  • Introduction: This section will introduce the case study scenario, the objective of the analysis, and the methodology adopted. He will also briefly present the background of the Cyclistic company and its position in the bike-sharing market.
  • Methodology: In this section, we will describe the method based on the six steps “Ask, Prepare, Process, Analyze, Share and Act”. We will illustrate how these steps were applied to the case study and how they contributed to the decision-making process.
  • Data analysis and results: Here, we will present the results of the data analysis, highlighting the differences between occasional cyclists and annual subscribers in terms of behaviours and preferences. We will include clear and understandable data visualizations to support our conclusions and recommendations.
  • Recommendations: Based on the analysis results, we will propose the three main proposals for Cyclistic’s marketing team, explaining how these strategies can help convert casual cyclists into annual subscribers.
  • Implementation and monitoring: In this section, we will suggest how the marketing team can implement the proposed recommendations and discuss the importance of monitoring success metrics to evaluate the strategies’ effectiveness.
  • Conclusion: We will conclude the report by summarizing the main findings of the analysis, the proposed recommendations, and the added value that the application of the methodology used and the use of the R and RStudio tools brought to the case study.

Finally, the report will be written in clear and accessible language, avoiding excessive technicalities and focusing on essential information to enable business decision-makers to quickly understand the analysis results and make informed decisions based on the proposed recommendations.

Conclusion

In this case study, we examined how a bike-sharing service can design a marketing strategy to convert occasional cyclists into annual subscribers. The three main recommendations from the analysis include: launching a targeted digital campaign, creating flexible subscriptions tied to the number of rides, and offering discounts on colder days to encourage bicycle use.

To elaborate on the case study, I applied the methodology learned in the graduate course “Google Data Analytics Professional Certificate“. This methodology follows six key steps: 1. Ask, 2. Prepare, 3. Process, 4. Analyze, 5. Share and 6. Act. This structured approach made it possible to formulate relevant questions, prepare and analyze data, create professional visualizations, and propose solid data recommendations.

The case study was developed using the R programming language and the integrated development environment (IDE) offered by RStudio Desktop. These tools provided the basis for efficient and thorough data analysis.

In conclusion, by applying the methodology learned in the “Google Data Analytics Professional Certificate” specialization course and using the R and RStudio tools, we were able to come up with concrete, data-driven recommendations to help a company providing bike-sharing services maximize the number of annual subscriptions and develop an effective marketing strategy.

If you are interested in learning more about this topic or have specific questions, please get in touch with me using the references on my contact page. I will happily answer your questions and provide more information about my work as a Data Analyst. Thank you for visiting my site and for your interest in my work.

News tag:
Scroll to Top