Search Shortcut cmd + k | ctrl + k
sitemap

Parse XML sitemaps from websites with automatic discovery via robots.txt

Maintainer(s): onnimonni

Installing and Loading

INSTALL sitemap FROM community;
LOAD sitemap;

Example

-- Get all URLs from a sitemap (auto-prepends https://)
SELECT * FROM sitemap_urls('example.com');

-- Direct sitemap URL (skips discovery)
SELECT * FROM sitemap_urls('https://example.com/sitemap.xml');

-- Get URLs from multiple sites
SELECT * FROM sitemap_urls(['example.com', 'google.com']);

-- Filter specific URLs
SELECT * FROM sitemap_urls('example.com')
WHERE url LIKE '%/blog/%';

-- Ignore errors for invalid domains
SELECT * FROM sitemap_urls(['valid.com', 'invalid.com'], ignore_errors := true);

-- Count URLs by type
SELECT
    CASE
        WHEN url LIKE '%/product/%' THEN 'product'
        WHEN url LIKE '%/blog/%' THEN 'blog'
        ELSE 'other'
    END as type,
    count(*) as count
FROM sitemap_urls('https://example.com')
GROUP BY type;

About sitemap

The sitemap extension provides a table function for parsing XML sitemaps from websites. It automatically discovers sitemaps via robots.txt and supports recursive sitemap index traversal.

Features:

  • Multi-fallback sitemap discovery (robots.txt, /sitemap.xml, /sitemap_index.xml, HTML meta tags)
  • Session caching for discovered sitemap locations
  • Direct sitemap URL support (skips discovery)
  • Sitemap index support (nested sitemaps)
  • Array support for processing multiple domains
  • Retry logic with exponential backoff
  • Respects Retry-After header on 429 responses
  • Gzip decompression for .xml.gz files
  • Multiple namespace support (standard + Google schemas)
  • SQL filtering with WHERE clauses
  • Optional error ignoring for batch processing

The function returns a table with columns:

  • url: Page URL (VARCHAR)
  • lastmod: Last modification date (VARCHAR, optional)
  • changefreq: Change frequency hint (VARCHAR, optional)
  • priority: Priority hint 0.0-1.0 (VARCHAR, optional)

Options:

  • follow_robots (BOOLEAN): Parse robots.txt first (default: true)
  • max_depth (INTEGER): Max sitemap index nesting (default: 3)
  • max_retries (INTEGER): Max retry attempts (default: 5)
  • backoff_ms (INTEGER): Initial backoff in ms (default: 100)
  • max_backoff_ms (INTEGER): Max backoff cap in ms (default: 30000)
  • ignore_errors (BOOLEAN): Don't throw on failed fetches (default: false)

Example with options:

SELECT * FROM sitemap_urls(
    'https://example.com',
    follow_robots := true,
    max_depth := 5,
    max_retries := 10
) WHERE url LIKE '%/product/%';

Compose with http_request to fetch page content:

SELECT s.url, h.body
FROM sitemap_urls('https://example.com') s
JOIN LATERAL (SELECT * FROM http_get(s.url)) h ON true
WHERE s.url LIKE '%/product/%'
LIMIT 10;