Font Subsetting Strategies: Content-Based vs Alphabetical

Written by Paul Hebert on July 7, 2022

A grid of tiles with the letters A-Z on them. A number of other tiles showing non-Latin glyphs are broken off from the grid

Font subsetting allows you to split a font’s characters (letters, numbers, symbols, etc.) into separate files so your visitors only download what they need. There are two main subsetting strategies that have different advantages depending on the type of site you’re building.

On a recent project, I did a deep dive into optimizing web font loading. I wanted to switch from Google Fonts to hosting the fonts ourselves in order to speed up load times and reduce external dependencies. We were using the excellent, open source font Inter. I downloaded the font files and added them to the project. But, I quickly noticed that the size of the font files I was downloading had increased drastically since switching from Google Fonts.

What’s going on?

Modern fonts often support text in a multitude of different languages and alphabets. Inter supports 51 languages and has 2,504 characters. This is awesome, but I don’t need all of that for the site I’m building, and I don’t want my site visitors to download 317kb of font files.

Inter This is a TrueType variable font with 2504 characters. It has 2 axes and 18 named instances. It has 35 layout features. Characters: 2504 Glyphs: 2548 Layout features: 35 Languages: 51 Filename: Inter.var.woff2 Size: 317 KB Format: WOFF2 Designed by: Rasmus Andersson Manufactured by: rsms Copyright: Copyright © 2020 The Inter Project Authors Version: Version 3.019;git-0a5106e0b License This Font Software is licensed under the SIL Open Font License, Version 1.1. This license is available with a FAQ at: http://scripts.sil.org/OFL Language support 51 languages: Afrikaans, Albanian, Azerbaijani, Basque, Belarusian, Bosnian, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, English, Estonian, Faroese, Filipino, Finnish, French, Galician, German, Greek, Hungarian, Icelandic, Indonesian, Irish, Italian, Kazakh, Kyrgyz, Latvian, Lithuanian, Macedonian, Malay, Mongolian, Norwegian Bokmål, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tongan, Turkish, Ukrainian, Uzbek, Vietnamese, Welsh, Zulu

Screenshot taken from the excellent tool Wakamai Fondue.

Google’s font files were smaller because they were subset. I wanted to do the same thing for the locally hosted version of the font. I found an excellent open source tool called Glyphhanger 1 which would allow me to split a font into smaller pieces, but I wasn’t sure what characters I should include in my subsets. I did some research and found two main strategies…

Heads up! Some font licenses do not allow you to legally modify the font you’re using. Therefore, it could be illegal for you to subset the font. You may want to review your font license, speak with the font’s creators or consult a lawyer before subsetting a font.

Content-Based Subsetting

If you know your site content ahead of time, you can create a super-targeted subset that only contains the characters used on your site. For example, if the only text on your site was “Hello World”, you could create a subset that only contained those characters: “H”, “e”, “l”, “o”, “W”, “r”, and “d”.

Your site probably has more text than “Hello World,” but it probably has a lot less than 2,504 characters. For example, all of Alice in Wonderland only contains 75 unique characters!

!()*,-.03:;?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyzù—‘’“”

All of the characters used in Alice in Wonderland

After subsetting Inter to the 75 characters in Alice in Wonderland, the file size drops from 317kb to 33kb. That’s almost one tenth the size of the original, and reduces the size of fonts loaded by 284kb!

How to Determine Your Site’s Content

If you want to subset based on content, you first need to determine what content is on your site. Luckily, there are a couple of prebuilt tools that can help:

Glyphhanger (mentioned above) can scan your website and find the specific characters used by each font.
Unicode Range Interchange allows you to paste in your site content and see a list of unique characters.

I won’t go too in-depth on this since you can view the links above for more instructions, but here’s an example of how to subset a font based on site content with Glyphhanger:

glyphhanger ./my-site.html --subset=Inter.var.woff2 --family='Inter, sans-serif'

Pros and Cons

Subsetting based on content allows you to make the smallest subset possible! But there’s a major downside: if your site has dynamic content there’s no good way to subset ahead of time. Even if your site is static, every time you make content changes you’ll need to re-subset your font. You can set up a build task to automate this for static sites2, but for dynamic sites you’ll need to look for other options.

Luckily, there’s another approach for font subsetting that works well with dynamic content…

Making Multiple Subsets

The strategy outlined above creates a single subset that’s as small as possible. For dynamic sites, it often makes sense to create a number of small subsets and let the browser pick and choose the ones it needs. That way, you’re providing the entire font but visitors only download the chunks that are used on the page.

This is possible because of the awesome unicode-range CSS feature. This allows you to specify that a certain font file contains certain characters. Then, the browser can check whether the page contains those characters and determine whether to download the file or not.

Subsetting By Alphabet

It’s very common to subset fonts by alphabet. For example, you could have one font file that contains the Latin alphabet, another that contains the Cyrillic alphabet, and another that contains the Greek alphabet.

Do you really need all those alphabets?

You may think that your site only contains Latin characters. And you might be right! But once you’re creating a single subset, it’s not much more work to create the others.

Skipping this step now may set you up for problems in the future. If your site allows user accounts or has users fill out forms, you should definitely provide these additional alphabets. Your site visitors’ names or form input may require these characters.

Base and Extended Alphabets

We can take this strategy a step further and split large alphabets into even smaller subsets. For example, a base subset could contain commonly used letters from that alphabet and an extended subset could contain rarer letters that are less likely to be used on a page. On most pages, the browser is likely to only download the base subset, but the extended subset is there if it’s needed.

This is how Google Fonts subsets Inter. If you view Google’s CSS file for Inter, you’ll see that it contains 7 different alphabet-based subsets: Cyrillic, Cyrillic Extended, Greek, Greek Extended, Latin, Latin Extended, and Vietnamese.3

Implementing Alphabet-Based Subsets

First, we’ll need to figure out the appropriate unicode ranges for different alphabets. A unicode range tells the browser which characters a font file includes. I based my subsets off of Google Fonts but there are a number of other resources you can consult.4

Once you’ve determined your alphabets and their unicode ranges, you can use Glyphhanger to generate your subsets. I put together a Node script to automate this process.

First, we’ll define our alphabets up front:

const alphabets = [
  {
    name: "cyrillic-ext",
    unicodeRange:
      "U+0460-052F, U+1C80-1C88, U+20B4, U+2DE0-2DFF, U+A640-A69F, U+FE2E-FE2F",
  },
  {
    name: "cyrillic",
    unicodeRange: "U+0400-045F, U+0490-0491, U+04B0-04B1, U+2116",
  },
  { name: "greek-ext", unicodeRange: "U+1F00-1FFF" },
  { name: "greek", unicodeRange: "U+0370-03FF" },
  {
    name: "vietnamese",
    unicodeRange:
      "U+0102-0103, U+0110-0111, U+0128-0129, U+0168-0169, U+01A0-01A1, U+01AF-01B0, U+1EA0-1EF9, U+20AB",
  },
  {
    name: "latin-ext",
    unicodeRange:
      "U+0100-024F, U+0259, U+1E00-1EFF, U+2020, U+20A0-20AB, U+20AD-20CF, U+2113, U+2C60-2C7F, U+A720-A7FF",
  },
  {
    name: "latin",
    unicodeRange:
      "U+0000-00FF, U+0131, U+0152-0153, U+02BB-02BC, U+02C6, U+02DA, U+02DC, U+2000-206F, U+2074, U+20AC, U+2122, U+2191, U+2193, U+2212, U+2215, U+FEFF, U+FFFD",
  },
];

Then, we can iterate over our alphabets and create subsets. This script will create new subsets in a subsets subdirectory. You’ll likely need to tweak it to make it work with your font files.

import { spawn } from "child_process";
import path from "path";
import { renameSync } from "fs";
import { fileURLToPath } from "url";

// Set up dirname to make it easier to place our files
// @see https://leveluptutorials.com/posts/how-to-use-__dirname-with-esm
const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);

for (const alphabet of alphabets) {
  // We want to wait for each subset to finish before starting
  // the next one (this is required for some file renaming
  // done below)
  await new Promise((resolve) => {
    const glyphhangerProcess = spawn(
      "glyphhanger",
      [
        // Spaces in the whitelist breaks the CLI script
        `--whitelist=${alphabet.unicodeRange.replaceAll(" ", "")}`,
        `--subset=Inter-var.woff2`,
        "--output=subsets",
        "--formats=woff2",
      ],
      { stdio: "inherit" }
    );

    glyphhangerProcess.on("close", () => {
      // I couldn't figure out a way to change the output
      // file name in glyphhanger. Instead we manually change
      // it after the file is generated
      const oldPath = path.join(
        __dirname,
        `subsets/Inter-${font.name}-subset.woff2`
      );
      const newPath = path.join(
        __dirname,
        `subsets/Inter-${font.name}-${subset.name}.woff2`
      );
      renameSync(oldPath, newPath);

      resolve();
    });
  });
}

Finally, we can create a CSS file that loads the subsets with the correct unicode ranges. Again, you’ll probably want to tweak this.

import path from "path";
import { writeFile } from "fs";
import { fileURLToPath } from "url";

const cssCode = "";

// Set up dirname to make it easier to place our files
// @see https://leveluptutorials.com/posts/how-to-use-__dirname-with-esm
const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);

// Create our font face rules
// This would need to be modified for other weights and styles
alphabets.forEach((alphabet) => {
  cssCode += `
    @font-face {
      font-display: swap;
      font-family: 'Inter';
      font-style: normal;
      font-weight: 400;
      src: url('subsets/Inter-var-${alphabet.name}.woff2') format('woff2');
      unicode-range: ${alphabet.unicodeRange};
    }
  `;
});

// Determine where to save our file
const cssPath = path.join(__dirname, "inter.css");

// Write our CSS file
writeFile(
  cssPath,
  `
/* Stop: this is a generated file. Do not edit directly. */
${cssCode}
  `.trim(),
  {},
  () => {
    console.log("CSS file written to:", cssPath);
  }
);

As you can see, this can be trickier than creating a single subset, but it allows us to serve the entire font, support dynamic content, and only have site visitors download the parts they need.

Optimizing Fonts in Your Project

Subsetting can have a big impact on font loading performance, but the ideal strategy varies depending on the type of site you’re building.

Subsetting is just one technique to improve the performance of web fonts. I highly recommend Zach Leatherman’s definitive article on the subject if you’re looking for other ways to improve your font loading performance.

Glyphhanger requires some Python dependencies and can be a little tricky to install and get running. I recommend this excellent article by Sara Soueidan if you get stuck.
If you add a subsetting build step, you’ll probably only want to run it for production builds, and serve the whole font file locally during development.
The Vietnamese alphabet is really cool! According to wikipedia it “produces words that have no silent letters, with letters and words consistent in how they are read and spoken, with rare exceptions. The elaborate use of diacritics produces a highly accurate sound transcription for tonal languages.”
unicode.org splits the unicode range into a number of alphabet-based ranges. unicode-table.com splits the alphabets slightly differently and has a nice scrolling visualizer to help understand the ranges. JRX provides another option for splitting up the alphabets.

Font Subsetting Strategies: Content-Based vs Alphabetical

Font Subsetting Strategies: Content-Based vs Alphabetical

What’s going on?

Content-Based Subsetting

How to Determine Your Site’s Content

Pros and Cons

Making Multiple Subsets

Subsetting By Alphabet

Do you really need all those alphabets?

Base and Extended Alphabets

Implementing Alphabet-Based Subsets

Optimizing Fonts in Your Project

Recommend

上海数字经济 “十四五” | 原生信仰者听见的历史潮流轰鸣声

Unicode characters & text tools for iPhone, iPad and Mac

Apple’s 14-inch MacBook Pro with the speedy M1 Pro CPU is $250 off

换个姿势做运维！GOPS 2022 · 深圳站精彩内容抢先看

Comic-Con Is Back! Will It Live Up to Its Former Hype?

Ninad Desai (@ninaddesai) / FAUN

诺基亚勒令一个开源 Linux 手机项目 "NOTKIA" 改名字 | 程序师 - 程序员、...

One Man’s Lonely, Lonely Fight to Ban Private Jets

John Dwell (@user22) / FAUN

开源大佬从谷歌离职：在Go语言项目上停滞不前，要去更小的企业寻求变革

About Joyk