Project Details

Back to Portfolio

Customer Persona Generator

Automating Market Insights with Clustering and NLP

Introduction

The Challenge

Manually creating customer personas is a time-consuming process for marketing, sales, and product teams. This project aimed to automate this process by leveraging unsupervised machine learning to group customers and then generate rich, descriptive personas for each segment.

The Solution

A tool that applies clustering and text summarization to generate data-driven customer personas automatically. This system reduces manual effort by up to 80%, providing teams with quick, actionable insights into their customer base.

Project Objectives and Workflow

Data & Preprocessing

Data Source

The project utilized the Customer Segmentation dataset from Kaggle. It contains rich data on grocery store customers, including:

  • Demographics: Age, marital status, education, income.
  • Household: Number of children and teens.
  • Spending: Amount spent across 7 product categories.

Feature Engineering

To prepare the data for modeling, several new, more insightful features were created:

  • edu_level: Simplified education into numeric categories.
  • num_family_member: Calculated total household size.
  • total_spent: Aggregated spending across all categories.
  • Product Spend Ratios: Calculated the percentage of total spend for each product category.

Modeling – Customer Segmentation

The core of the segmentation was achieved using the KMeans clustering algorithm.

  1. Scaling: All numerical features were standardized using `StandardScaler` to ensure fair comparison.
  2. Dimensionality Reduction: Principal Component Analysis (PCA) was applied to reduce the number of features while retaining most of the data's variance, making clustering more effective.
  3. Finding Optimal Clusters: Using the "Elbow Method" with Inertia and evaluating the Silhouette Score, an optimal number of k = 4 clusters was determined.
  4. Segmentation: Customers were grouped into four distinct segments: Cluster A, B, C, and D.
Customer Segments Visualization

Fig 3: Visualization of the four customer clusters after PCA.

The Persona Generator

The final output is an interactive tool built with Streamlit. It takes the insights from the four customer clusters and dynamically generates detailed, fictional personas that represent the real-world characteristics of each segment.

Key Components

  • Name & Gender: Randomly assigned from real-world census data.
  • Avatar: Uniquely generated using the `python-avatars` library.
  • Profile Details: Age, income, household size, and spending levels are generated based on the statistical profile (25th-75th percentile) of their cluster.
  • Product Preferences: Highlights the product categories each persona is most likely to spend on compared to the average customer.
Streamlit App Interface

Fig 4: Dummy interface of the deployed Streamlit application.

Tools & Technologies

Python Pandas Scikit-learn Streamlit KMeans Clustering PCA python-avatars Excalidraw