Search Shortcut cmd + k | ctrl + k
sitemap

Parse XML sitemaps from websites with automatic discovery via robots.txt

Maintainer(s): onnimonni

Installing and Loading

INSTALL sitemap FROM community;
LOAD sitemap;

Example

-- Get all URLs from a sitemap (auto-prepends https://)
SELECT * FROM sitemap_urls('example.com');

-- Direct sitemap URL (skips discovery)
SELECT * FROM sitemap_urls('https://example.com/sitemap.xml');

-- Get URLs from multiple sites
SELECT * FROM sitemap_urls(['example.com', 'google.com']);

-- Filter specific URLs
SELECT * FROM sitemap_urls('example.com')
WHERE url LIKE '%/blog/%';

-- Bruteforce find sitemap (tries 587+ patterns)
SELECT bruteforce_find_sitemap('example.com') as sitemap_url;

-- Ignore errors for invalid domains
SELECT * FROM sitemap_urls(['valid.com', 'invalid.com'], ignore_errors := true);

-- Set custom user agent
SET sitemap_user_agent = 'MyBot/1.0';

-- Count URLs by type
SELECT
    CASE
        WHEN url LIKE '%/product/%' THEN 'product'
        WHEN url LIKE '%/blog/%' THEN 'blog'
        ELSE 'other'
    END as type,
    count(*) as count
FROM sitemap_urls('https://example.com')
GROUP BY type;

About sitemap

The sitemap extension provides a table function for parsing XML sitemaps from websites. It automatically discovers sitemaps via multiple fallback methods and supports recursive sitemap index traversal.

Features:

  • Multi-fallback sitemap discovery (robots.txt, /sitemap.xml, /sitemap_index.xml, HTML meta tags)
  • Session caching for discovered sitemap locations
  • Bruteforce finder (bruteforce_find_sitemap) - tries 587+ common sitemap URL patterns
  • Custom user agent via SET sitemap_user_agent
  • Direct sitemap URL support (skips discovery)
  • Sitemap index support (nested sitemaps)
  • Array support for processing multiple domains
  • Retry logic with exponential backoff
  • Respects Retry-After header on 429 responses
  • Gzip decompression for .xml.gz files
  • Multiple namespace support (standard + Google schemas)
  • SQL filtering with WHERE clauses
  • Optional error ignoring for batch processing

The function returns a table with columns:

  • url: Page URL (VARCHAR)
  • lastmod: Last modification date (VARCHAR, optional)
  • changefreq: Change frequency hint (VARCHAR, optional)
  • priority: Priority hint 0.0-1.0 (VARCHAR, optional)

Options:

  • follow_robots (BOOLEAN): Parse robots.txt first (default: true)
  • max_depth (INTEGER): Max sitemap index nesting (default: 3)
  • max_retries (INTEGER): Max retry attempts (default: 5)
  • backoff_ms (INTEGER): Initial backoff in ms (default: 100)
  • max_backoff_ms (INTEGER): Max backoff cap in ms (default: 30000)
  • ignore_errors (BOOLEAN): Don't throw on failed fetches (default: false)

Example with options:

SELECT * FROM sitemap_urls(
    'https://example.com',
    follow_robots := true,
    max_depth := 5,
    max_retries := 10
) WHERE url LIKE '%/product/%';

Bruteforce sitemap discovery (scalar function):

-- Find sitemap by trying 587+ common patterns
SELECT bruteforce_find_sitemap('https://example.com') as sitemap_url;
-- Returns first working sitemap URL or NULL

-- Patterns tested include:
-- /sitemap.xml, /sitemap_index.xml
-- /sitemap/sitemap.xml, /sitemaps/sitemap-index.xml
-- /en/sitemap.xml, /de/sitemap.xml
-- /pub/media/sitemap.xml
-- And 580+ more variations

Compose with http_request to fetch page content:

SELECT s.url, h.body
FROM sitemap_urls('https://example.com') s
JOIN LATERAL (SELECT * FROM http_get(s.url)) h ON true
WHERE s.url LIKE '%/product/%'
LIMIT 10;