Last week, I was tasked to do a cohort analysis for a friend’s company on BigQuery. I used dbt to transform the data, carry out tests and create a documentation. Then created a dashboard on Looker with LookML. I found it very insightful and I thought to make a blog post about it.
For this project, I would not be using Bigquery as I’m working with a small data set with 400k records. I will be using dbeaver (connected to Postgresql).
Picture this: it’s a Thursday night, and your favorite TV show is about to begin. The drama, the suspense, the intricate storylines that weave together a tapestry of human experience. Each character, each twist and turn, they all play a part in the grand narrative that keeps you hooked, episode after episode. Now, imagine if we could unravel these complex narratives, understand the dynamics at play, and predict how the story unfolds. Welcome to the world of cohort analysis.
Just like in any TV show, where each character has a backstory, a journey, and an evolution, in cohort analysis, we group individuals who share a common characteristic within a defined time period. We follow these groups, or cohorts, through the twists and turns of their journey, analyzing their behavior over time.
Think of it as following the dwarves in Lord of the Rings. They all start at the same time, fresh-faced and eager, and we track their progress through their trials and tribulations. Who excels, who falters, who forms lasting bonds, and who makes dramatic exits. In the realm of data, these cohorts might be new users who signed up in the same month, customers who made their first purchase in a specific quarter, or patients who started a treatment in the same year.
By analyzing these cohorts, we uncover patterns and insights that are hidden in plain sight. We understand retention rates, identify trends, and make informed predictions. It’s like knowing which characters are likely to make it through to the season finale and which plotlines might need a bit of a rewrite.
Cohort analysis helps us see beyond the surface-level metrics. Instead of a single snapshot in time, we get a moving picture, a storyline that evolves. We see how different groups behave differently, how their needs change, and how we can better support their journeys.
So, let’s dive into the world of cohort analysis. Let’s unravel the stories hidden in our data, understand the journeys of our users, and predict the next big plot twist. Because in this world of data, every detail matters, every journey is unique, and every insight brings us closer to a blockbuster success.
I recently did a cohort analysis for this data – it contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based online retail store.
Data dictionary
Variable Name | Role | Type | Description | Units | Missing Values |
---|---|---|---|---|---|
InvoiceNo | ID | Categorical | a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation | no | |
StockCode | ID | Categorical | a 5-digit integral number uniquely assigned to each distinct product | no | |
Description | Feature | Categorical | product name | no | |
Quantity | Feature | Integer | the quantities of each product (item) per transaction | no | |
InvoiceDate | Feature | Date | the day and time when each transaction was generated | no | |
UnitPrice | Feature | Continuous | product price per unit | sterling | no |
CustomerID | Feature | Categorical | a 5-digit integral number uniquely assigned to each customer | no | |
Country | Feature | Categorical | the name of the country where each customer resides | no |
with online_retail as
(
SELECT InvoiceNo
,StockCode
,Description
,Quantity
,InvoiceDate
,UnitPrice
,CustomerID
,Country
FROM online_retail
Where CustomerID != 0
)
, quantity_unit_price as
(
-- Cleaning the data
---397882 records with quantity and Unit price greater than zero
select *
from online_retail
where Quantity > 0 and UnitPrice > 0
)
, dup_check as
(
---deduplication of rows using key columns
select * , ROW_NUMBER() over (partition by InvoiceNo, StockCode, Quantity order by InvoiceDate)dup_flag
from quantity_unit_price
)
---I got 397667 clean records and 5215 duplicates were removed
---Storing the cleaned data into a temp table
select *
into #online_retail_main
from dup_check
where dup_flag = 1
----BEGIN COHORT ANALYSIS
select * from #online_retail_main
--Unique Identifier (CustomerID)
--Initial Start Date (First Invoice Date)
--Revenue Data
select
CustomerID,
min(InvoiceDate) first_purchase_date,
DATE_PARTS(year(min(InvoiceDate)), month(min(InvoiceDate)), 1) Cohort_Date
into #cohort
from #online_retail_main
group by CustomerID
select *
from #cohort
---Create Cohort Index
select
mmm.*,
cohort_index = year_diff * 12 + month_diff + 1
into #cohort_retention
from
(
select
mm.*,
year_diff = invoice_year - cohort_year,
month_diff = invoice_month - cohort_month
from
(
select
m.*,
c.Cohort_Date,
year(m.InvoiceDate) invoice_year,
month(m.InvoiceDate) invoice_month,
year(c.Cohort_Date) cohort_year,
month(c.Cohort_Date) cohort_month
from #online_retail_main m
left join #cohort c
on m.CustomerID = c.CustomerID
)mm
)mmm
--where CustomerID = 14733
---Pivot Data to see the cohort table
select *
into #cohort_pivot
from(
select distinct
CustomerID,
Cohort_Date,
cohort_index
from #cohort_retention
)tbl
pivot(
Count(CustomerID)
for Cohort_Index In
(
[1],
[2],
[3],
[4],
[5],
[6],
[7],
[8],
[9],
[10],
[11],
[12],
[13])
)as pivot_table
select *
from #cohort_pivot
order by Cohort_Date
select Cohort_Date ,
(1.0 * [1]/[1] * 100) as [1],
1.0 * [2]/[1] * 100 as [2],
1.0 * [3]/[1] * 100 as [3],
1.0 * [4]/[1] * 100 as [4],
1.0 * [5]/[1] * 100 as [5],
1.0 * [6]/[1] * 100 as [6],
1.0 * [7]/[1] * 100 as [7],
1.0 * [8]/[1] * 100 as [8],
1.0 * [9]/[1] * 100 as [9],
1.0 * [10]/[1] * 100 as [10],
1.0 * [11]/[1] * 100 as [11],
1.0 * [12]/[1] * 100 as [12],
1.0 * [13]/[1] * 100 as [13]
from #cohort_pivot
order by Cohort_Date