Building 15 Web Scrapers for Font Foundry Specimen Images

FontAlternatives needs specimen images for every premium font. These are the high-quality images showing fonts in use that foundries create for marketing.

There’s no universal API for this. Each foundry has their own website structure. So I built 15 foundry-specific scrapers with an orchestrator that picks the right one.

The problem

I need specimen images for 300+ premium fonts. Manually downloading images would take hours. And when I add new fonts, I’d need to do it again.

Options I considered:

Manual download: Time-consuming, doesn’t scale
MyFonts API: No public API for images
Google Images: Unreliable, wrong images, copyright issues
Web scraping: Works, but each foundry is different

Web scraping won. But it meant building separate scrapers for each foundry.

The orchestrator pattern

The orchestrator is a simple priority system:

import { scrapeFontImages } from './scrapers';

async function downloadFontImages(fontSlug: string): Promise<void> {
  const font = await getFontData(fontSlug);

  // Try foundry-specific scraper first
  const foundryScraper = getFoundryScraper(font.foundry);
  if (foundryScraper) {
    try {
      const images = await foundryScraper(font);
      if (images.length > 0) {
        await saveImages(fontSlug, images);
        return;
      }
    } catch (error) {
      console.warn(`Foundry scraper failed: ${font.foundry}`, error);
    }
  }

  // Fallback to MyFonts
  try {
    const images = await scrapeMyFonts(font.name);
    if (images.length > 0) {
      await saveImages(fontSlug, images);
      return;
    }
  } catch (error) {
    console.warn('MyFonts scraper failed', error);
  }

  // Generic fallback
  try {
    const images = await scrapeGeneric(font);
    await saveImages(fontSlug, images);
  } catch (error) {
    console.error('All scrapers failed', error);
    // Create placeholder, flag for manual upload
    await createPlaceholder(fontSlug);
  }
}

Foundry-specific scrapers get the best images. MyFonts is the reliable fallback. Generic scraper is last resort.

Foundry-specific scrapers

Each foundry structures their site differently. Here’s how I handle a few of them:

Klim Type Foundry

Klim uses a clean structure with specimen images in predictable locations:

async function scrapeKlim(font: Font): Promise<string[]> {
  const slug = font.name.toLowerCase().replace(/\s+/g, '-');
  const url = `https://klim.co.nz/retail-fonts/${slug}/`;

  const page = await browser.newPage();
  await page.goto(url, { waitUntil: 'networkidle0' });

  // Klim uses data-src for lazy-loaded images
  const images = await page.$$eval(
    '.specimen-image img',
    (imgs) => imgs.map((img) =>
      img.getAttribute('data-src') || img.getAttribute('src')
    ).filter(Boolean)
  );

  await page.close();
  return images;
}

Pangram Pangram

Pangram uses full-bleed specimen images with consistent class names:

async function scrapePangram(font: Font): Promise<string[]> {
  const slug = font.name.toLowerCase().replace(/\s+/g, '-');
  const url = `https://pangrampangram.com/products/${slug}`;

  const page = await browser.newPage();
  await page.goto(url, { waitUntil: 'networkidle0' });

  // Scroll to trigger lazy loading
  await page.mouse.wheel({ deltaY: 5000 });
  await page.waitForTimeout(1000);

  const images = await page.$$eval(
    'img.specimen-full',
    (imgs) => imgs.map((img) => img.src)
  );

  await page.close();
  return images;
}

Commercial Type

Commercial Type has a gallery section with high-res specimens:

async function scrapeCommercialType(font: Font): Promise<string[]> {
  const slug = font.name.toLowerCase().replace(/\s+/g, '-');
  const url = `https://commercialtype.com/catalog/${slug}`;

  const page = await browser.newPage();
  await page.goto(url, { waitUntil: 'networkidle0' });

  // Find the gallery section
  const images = await page.$$eval(
    '[data-gallery] img, .specimen-gallery img',
    (imgs) => imgs.map((img) => {
      // Get highest resolution version
      const srcset = img.getAttribute('srcset');
      if (srcset) {
        const sources = srcset.split(',').map(s => s.trim().split(' '));
        const highest = sources.sort((a, b) =>
          parseInt(b[1]) - parseInt(a[1])
        )[0];
        return highest[0];
      }
      return img.src;
    })
  );

  await page.close();
  return images;
}

Hoefler&Co

Hoefler uses JavaScript-rendered content, requiring full page wait:

async function scrapeHoefler(font: Font): Promise<string[]> {
  const slug = font.name.toLowerCase().replace(/\s+/g, '-');
  const url = `https://www.typography.com/fonts/${slug}`;

  const page = await browser.newPage();
  await page.goto(url, { waitUntil: 'networkidle0' });

  // Wait for dynamic content
  await page.waitForSelector('.font-specimen', { timeout: 10000 });

  const images = await page.$$eval(
    '.font-specimen img, .gallery-item img',
    (imgs) => imgs.map((img) => img.src)
  );

  await page.close();
  return images;
}

Handling srcset and responsive images

Modern foundry sites use responsive images. I extract the highest resolution:

function extractBestImage(img: Element): string | null {
  // Try srcset first
  const srcset = img.getAttribute('srcset');
  if (srcset) {
    const sources = srcset
      .split(',')
      .map((s) => {
        const parts = s.trim().split(/\s+/);
        return {
          url: parts[0],
          width: parseInt(parts[1]?.replace('w', '') || '0'),
        };
      })
      .sort((a, b) => b.width - a.width);

    if (sources[0]?.url) {
      return sources[0].url;
    }
  }

  // Fallback to data-src (lazy loading)
  const dataSrc = img.getAttribute('data-src');
  if (dataSrc) return dataSrc;

  // Finally, regular src
  return img.getAttribute('src');
}

The MyFonts fallback

When foundry-specific scrapers fail or don’t exist, MyFonts usually has the font:

async function scrapeMyFonts(fontName: string): Promise<string[]> {
  const searchUrl = `https://www.myfonts.com/search?query=${encodeURIComponent(fontName)}`;

  const page = await browser.newPage();
  await page.goto(searchUrl, { waitUntil: 'networkidle0' });

  // Click first result
  const firstResult = await page.$('.search-result-item a');
  if (!firstResult) {
    await page.close();
    return [];
  }

  await firstResult.click();
  await page.waitForNavigation({ waitUntil: 'networkidle0' });

  // Get specimen images from font page
  const images = await page.$$eval(
    '.specimen-image img, .font-preview img',
    (imgs) => imgs.map((img) => img.src)
  );

  await page.close();
  return images;
}

MyFonts images are lower quality than foundry originals, but they’re consistent and cover almost every commercial font.

Image processing pipeline

Raw scraped images need processing:

Format conversion: PNG to WebP to AVIF
Resizing: Create thumbnail (400px width)
Optimization: Strip metadata, compress

import sharp from 'sharp';

async function processImage(
  buffer: Buffer,
  fontSlug: string,
  index: number
): Promise<void> {
  const basePath = `.cache/assets/previews/${fontSlug}`;

  // Full size WebP
  await sharp(buffer)
    .webp({ quality: 85 })
    .toFile(`${basePath}/specimen-${index}.webp`);

  // Full size AVIF
  await sharp(buffer)
    .avif({ quality: 80 })
    .toFile(`${basePath}/specimen-${index}.avif`);

  // Thumbnail
  await sharp(buffer)
    .resize(400, null, { withoutEnlargement: true })
    .webp({ quality: 80 })
    .toFile(`${basePath}/thumb-${index}.webp`);
}

Manifest tracking

I track which images exist for each font:

{
  "avenir": {
    "specimens": ["specimen-0.webp", "specimen-1.webp"],
    "thumbnails": ["thumb-0.webp"],
    "lastUpdated": "2024-01-15T10:30:00Z",
    "source": "lineto"
  }
}

The manifest tells me:

Which fonts have images
How many specimens each font has
When images were last scraped
Which scraper was used (for debugging)

Rate limiting and politeness

Scrapers can hammer servers. I add delays between requests:

const RATE_LIMITS: Record<string, number> = {
  klim: 2000,       // 2 seconds between requests
  pangram: 1500,
  commercial: 2000,
  myfonts: 3000,    // MyFonts is stricter
  default: 1000,
};

async function delay(foundry: string): Promise<void> {
  const ms = RATE_LIMITS[foundry] || RATE_LIMITS.default;
  await new Promise((resolve) => setTimeout(resolve, ms));
}

I also set a realistic user agent and respect robots.txt (mostly - specimen pages aren’t usually blocked).

Error handling and manual fallback

Scrapers fail. Sites change. When automation fails, I need a manual path:

async function handleScraperFailure(fontSlug: string): Promise<void> {
  // Create placeholder image
  await createPlaceholder(fontSlug);

  // Create GitHub issue for manual upload
  if (process.env.GITHUB_TOKEN) {
    await createGitHubIssue({
      title: `Manual image needed: ${fontSlug}`,
      body: `Automated scraping failed for ${fontSlug}. Please manually upload specimen images.`,
      labels: ['manual-upload', 'images'],
    });
  }
}

The placeholder is a simple gray box with the font name. It’s better than broken images.

The 15 foundries

Current scrapers:

Foundry	URL Pattern	Notes
Klim	klim.co.nz/retail-fonts/{slug}/	Clean structure
Pangram	pangrampangram.com/products/{slug}	Lazy images
Commercial Type	commercialtype.com/catalog/{slug}	Has gallery
Hoefler&Co	typography.com/fonts/{slug}	JS rendered
Lineto	lineto.com/typefaces/{slug}	Simple selectors
Dinamo	abcdinamo.com/typefaces/{slug}	Modern structure
Grilli Type	grillitype.com/typeface/{slug}	Grid layout
Colophon	colophon-foundry.org/typefaces/{slug}	Minimal
Sharp Type	sharptype.co/typefaces/{slug}	Good quality
Fontsmith	fontsmith.com/fonts/{slug}	Mixed quality
Fontshare	fontshare.com/fonts/{slug}	Free fonts
Google Fonts	fonts.google.com/specimen/{slug}	API available
Adobe Fonts	fonts.adobe.com/fonts/{slug}	Requires auth
Type Network	typenetwork.com/fonts/{slug}	Federation
MyFonts	myfonts.com/ (search)	Fallback

Tradeoffs

What I gained:

Automated image acquisition for 300+ fonts
Consistent image quality through processing
Scalable (adding fonts doesn’t require manual work)

What I lost:

Maintenance burden (site changes break scrapers)
Rate limiting means slow batch processing
Some fonts still need manual upload

The brittle reality: Scrapers break. On average, 1-2 foundries change their HTML structure each month. When tests fail, I check which scraper broke and update the selectors. It’s tedious but manageable.

Running the pipeline

# Single font
npx tsx scripts/download-font-images.ts --slug avenir

# Batch (respects rate limits)
npx tsx scripts/download-font-images.ts --batch tier1

# Update manifest
npx tsx scripts/update-image-manifest.ts

The batch mode processes fonts in order of their tier (Tier 1 first, most important fonts). It runs in CI but can also run locally for testing. These images feed into the automated content pipeline that creates new font pages.

What I’d do differently

If starting over:

Foundry partnerships: Some foundries might provide images directly if asked
CDN integration: Store images on R2 from the start, not local cache
Visual regression: Detect when scraped images change unexpectedly

The scraper approach works, but it’s duct tape. A proper solution would involve foundry cooperation. For a side project, duct tape is fine.

Explore on FontAlternatives

#web-scraping#automation#typescript#puppeteer