Choose waffles not pies when visualizing parts of a whole

Which region in New Zealand has the highest share of Baby Boomers?

Apr 28, 2024

Autonomous Econ is a newsletter that empowers economists and policy analysts by equipping them with Python and data science skills. The content is designed to boost their productivity through automation and transform them into savvier analysts.

Upcoming posts will roughly be split between practical guides (80 percent) and data journalism pieces where I demonstrate the tools (20 percent).

If you find the content valuable, please consider becoming a free subscriber by clicking on the button below.

This is the first of a series of posts (Non-Normie Plots) on how to build some not so common but effective plots using Python, so that your visualizations can really stand out. The types of plots I aim to cover are used extensively by publications like The Economist and The New York Times, but can be tricky with traditional tools like Excel. I will include a notebook template with each post so that you can easily apply it to your own data.

In the first edition, I will illustrate how to create waffle plots with a Python package called pywaffle. I will dive into some demographic data of New Zealand to visualize the generational makeup of the country. By the end of it, I will produce multiple waffle plots of different regions where we have the highest concentration of Baby Boomers (those born between 1946 and 1964), Millennials (1981-1996), and Gen Zers (1997-2012).

A waffle chart representing the population distribution across different generations in New Zealand.

Why use waffle plots?

Waffle plots typically use a symmetrical grid of square cells (hence the name) that represent a certain value to illustrate the proportion to a whole. They are superior to pie charts where categories with similar proportions or categories with very small proportions are notoriously difficult to make out. Waffle plots also allow you to use icons instead of squares to make your plot more visually engaging and improve their readability.

An infographic on "Affordable Housing" displaying two sections: "Total Owners Percentage" with icons of houses depicting homeownership rates by country and price variation markers, and "Housing Tenure Distribution" with a bar chart comparing housing types across countries. — Example waffle plot by **Filippo Mastroianni**

The basic template and customization options

For this demonstration, I will use demographic data from Statistics New Zealand. The dataset is available from this GitHub repository along with the Python notebook where you can follow along with this tutorial.

For the basic plot of the generation data for the whole country, we can prepare our dataset with the generation categories in the index and the regions on the column axis of the DataFrame. Note that the populations of other generations have been excluded from this analysis.

A data frame titled 'df_waffle_nz' with two columns: "Generation" and "Total, New Zealand." It lists population figures for different generations in New Zealand: Baby Boomer (969010), Gen X (966300), Gen Z (1010650), Millennial (1102490), and Silent (206330).

We can use the Waffle class from pywaffle when initializing our figure object. Here’s how to set up the key parameters:

values: choose 'Total, New Zealand' as col_name to plot the values for this category.
labels: Use the DataFrame index for labeling.
rounding_rule: Set 'ceil' as the rounding rule to ensure we round up to 1 block even if a category has less than 100k. The other options are ‘nearest’ or ‘floor’.
rows: Determine the height of our waffle plot; rows set the vertical size.
vertical: set this to ‘True’ so that the blocks are stacked vertically.

Note that when setting the number of rows, the number of columns will automatically adjust based on our scale factor and the rounding rule. In this case, since we have specified 10 rows, and require 46 blocks to represent all categories, 5 columns are necessary to properly display the data.

A screenshot of Python code for generating a waffle chart, including parameters for plot title, labels, legend, and styling.

I also include the scale_factor parameter since we are dealing with values in the 200k-1 million range. This will scale each block to 100k to make them easier to interpret.

To further customize our figure, the scale factor can be displayed as a label on the plot. Additionally, we can set the background color to grey to improve the contrast with the lighter colored blocks.

If we want to explicitly show the proportions, we can do so in the legend using a loop in the label element. This involves dividing each value by the total population, calculated as total_population = df_plot['Total, New Zealand'].sum().

If we want to explicitly set the rows and columns, for example, 20 rows by 5 columns, the blocks will automatically be scaled to maintain the ratio of each category across all blocks. The rounding rule should be set to 'floor' to ensure that the sum of the scaled values does not exceed the total chart size of 100 blocks.

To stack the plot horizontally, we can invert the rows and columns (rows=5, columns=20) and set the 'vertical' parameter to False.

The blocks can be replaced with different icons using the 'icons' parameter. You can find a whole library of icons to choose from at Font Awesome.

Multiple waffle plots by region

We can also plot more than one waffle plot, for example, to compare data over time or across regions. Since I want to show the districts with the highest proportion of Millennials, Baby Boomers, and Gen Zers, I have added the districts in my DataFrame as follows.

A table displaying the population count by generation for three regions in New Zealand: Queenstown-Lakes district, Thames-Coromandel district, and Wellington city. Rows represent generations including Baby Boomer, Gen X, Gen Z, Millennial, and Silent, with corresponding population numbers for each district.

We can create a function from the basic plot template that combines everything we've seen so far, and call on it to create a plot object for each region.

I just need to define region_1, region_2, and region_3 for each plot object and add them to the fig object.

Below is the final plot, incorporating all the customization elements I've discussed throughout. The complete code is provided at the end of this article.

A waffle chart showing population demographics by selected districts in New Zealand. It includes Thames-Coromandel district, Queenstown-Lakes district, and Wellington city, with color-coded icons representing the proportion of Baby Boomers, Gen X, Gen Z, Millennials, and Silent generations.

The Thames-Coromandel district has the largest share of Baby Boomers, along with a large share of residents from the Silent Generation (1928-1945). No surprises here, as the Coromandel region in New Zealand is relatively warm and not too far from the main center of Auckland. The vibrant city of Wellington has attracted the largest share of Gen Zers. Queenstown has the highest concentration of Millennials, perhaps attracting those who want an active lifestyle but can afford the high living costs there too.

If you’re interested in the generational makeup for a specific district, you can check out this detailed heatmap. I’m also working on a future post integrating the same demographic data in the context of the housing market, so stay tuned.

The design possibilities from the basic template I provided are endless, so feel free to experiment and replicate others that you have seen. The Python code to produce waffle plots is fairly simple, so don’t miss out on the opportunity to make your visualizations more engaging. If you have yet to try Python, I hope I have convinced a few of you to start including it in your data analysis toolkit.

If you enjoyed this post, please leave a like ❤️ at the end of the article. It goes a long way in helping this newsletter get discovered. You could do me an even bigger favour by sharing this post.


# Filtering and setup as per your existing logic
region_1 = "Thames-Coromandel district"
region_2 = "Queenstown-Lakes district"
region_3 = "Wellington city"


# Function to create a plot dict for a given region
def create_plot_data(region):
    df_plot = df_waffle_region[region]
    total_population = df_plot.sum()

    values = [value for value in df_plot.tolist()]
    labels = [
        f"{index}: {value/total_population:.1%}"
        for index, value in zip(df_plot.index, df_plot)
    ]

    plot_data = {
        "values": values,
        "labels": labels,
        "legend": {
            "loc": "upper left",
            "bbox_to_anchor": (1.05, 1),
            "fontsize": 9,
            "frameon": False,
        },
        "title": {"label": region, "loc": "left", "fontsize": 12},
        "rounding_rule": "floor",
        "icons": "person",
        "icon_size": 16,
    }
    return plot_data


plot1 = create_plot_data(region_1)
plot2 = create_plot_data(region_2)
plot3 = create_plot_data(region_3)

fig = plt.figure(
    FigureClass=Waffle,
    plots={
        311: plot1,
        312: plot2,
        313: plot3,
    },
    rows=5,  # Common parameter applied to all subplots.
    columns=20,
    cmap_name="Accent",
    figsize=(8, 6),
)

# Common figure settings
fig.suptitle(
    "Population demographics by selected districts",
    fontsize=16,
    fontweight="bold",
    va="top",
    ha="left",
    x=0.01,
    y=1.01,
)
fig.supxlabel(
    f"Source: Statstics NZ\nautonomousecon.substack.com", fontsize=8, x=1, ha="right"
)
fig.set_facecolor("#EEEDE7")

plt.show()

Autonomous Econ

Choose waffles not pies when visualizing parts of a whole

Which region in New Zealand has the highest share of Baby Boomers?

Why use waffle plots?

The basic template and customization options

Multiple waffle plots by region

Full Python code for multiple waffle plots

Discussion about this post