Cohort Analysis with SQL

Last week, I was tasked to do a cohort analysis for a friend’s company on BigQuery. I used dbt to transform the data, carry out tests and create a documentation. Then created a dashboard on Looker with LookML. I found it very insightful and I thought to make a blog post about it.

For this project, I would not be using Bigquery as I’m working with a small data set with 400k records. I will be using dbeaver (connected to Postgresql).

Picture this: it’s a Thursday night, and your favorite TV show is about to begin. The drama, the suspense, the intricate storylines that weave together a tapestry of human experience. Each character, each twist and turn, they all play a part in the grand narrative that keeps you hooked, episode after episode. Now, imagine if we could unravel these complex narratives, understand the dynamics at play, and predict how the story unfolds. Welcome to the world of cohort analysis.

Just like in any TV show, where each character has a backstory, a journey, and an evolution, in cohort analysis, we group individuals who share a common characteristic within a defined time period. We follow these groups, or cohorts, through the twists and turns of their journey, analyzing their behavior over time.

Think of it as following the dwarves in Lord of the Rings. They all start at the same time, fresh-faced and eager, and we track their progress through their trials and tribulations. Who excels, who falters, who forms lasting bonds, and who makes dramatic exits. In the realm of data, these cohorts might be new users who signed up in the same month, customers who made their first purchase in a specific quarter, or patients who started a treatment in the same year.

By analyzing these cohorts, we uncover patterns and insights that are hidden in plain sight. We understand retention rates, identify trends, and make informed predictions. It’s like knowing which characters are likely to make it through to the season finale and which plotlines might need a bit of a rewrite.

Cohort analysis helps us see beyond the surface-level metrics. Instead of a single snapshot in time, we get a moving picture, a storyline that evolves. We see how different groups behave differently, how their needs change, and how we can better support their journeys.

So, let’s dive into the world of cohort analysis. Let’s unravel the stories hidden in our data, understand the journeys of our users, and predict the next big plot twist. Because in this world of data, every detail matters, every journey is unique, and every insight brings us closer to a blockbuster success.

I recently did a cohort analysis for this data – it contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based online retail store.

Data dictionary

Variable NameRoleTypeDescriptionUnitsMissing Values
InvoiceNoIDCategoricala 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellationno
StockCodeIDCategoricala 5-digit integral number uniquely assigned to each distinct productno
DescriptionFeatureCategoricalproduct nameno
QuantityFeatureIntegerthe quantities of each product (item) per transactionno
InvoiceDateFeatureDatethe day and time when each transaction was generatedno
UnitPriceFeatureContinuousproduct price per unitsterlingno
CustomerIDFeatureCategoricala 5-digit integral number uniquely assigned to each customerno
CountryFeatureCategoricalthe name of the country where each customer residesno
TLDR: To see the full SQL query used for this project

Leave a comment