Search Shortcut cmd + k | ctrl + k
sitemap

Parse XML sitemaps from websites with automatic discovery via robots.txt

Maintainer(s): onnimonni

Installing and Loading

INSTALL sitemap FROM community;
LOAD sitemap;

Example

-- Get all URLs from a sitemap (auto-prepends https://)
SELECT * FROM sitemap_urls('example.com');

-- Direct sitemap URL (skips discovery)
SELECT * FROM sitemap_urls('https://example.com/sitemap.xml');

-- Get URLs from multiple sites
SELECT * FROM sitemap_urls(['example.com', 'google.com']);

-- Filter specific URLs
SELECT * FROM sitemap_urls('example.com')
WHERE url LIKE '%/blog/%';

-- Bruteforce find sitemap (tries 587+ patterns)
SELECT bruteforce_find_sitemap('example.com') as sitemap_url;

-- Ignore errors for invalid domains
SELECT * FROM sitemap_urls(['valid.com', 'invalid.com'], ignore_errors := true);

-- Set custom user agent
SET sitemap_user_agent = 'MyBot/1.0';

-- Count URLs by type
SELECT
    CASE
        WHEN url LIKE '%/product/%' THEN 'product'
        WHEN url LIKE '%/blog/%' THEN 'blog'
        ELSE 'other'
    END as type,
    count(*) as count
FROM sitemap_urls('https://example.com')
GROUP BY type;

About sitemap

The sitemap extension provides a table function for parsing XML sitemaps from websites. It automatically discovers sitemaps via multiple fallback methods and supports recursive sitemap index traversal.

Features:

  • Multi-fallback sitemap discovery (robots.txt, /sitemap.xml, /sitemap_index.xml, HTML meta tags)
  • Session caching for discovered sitemap locations
  • Bruteforce finder (bruteforce_find_sitemap) - tries 587+ common sitemap URL patterns
  • Custom user agent via SET sitemap_user_agent
  • Direct sitemap URL support (skips discovery)
  • Sitemap index support (nested sitemaps)
  • Array support for processing multiple domains
  • Retry logic with exponential backoff
  • Respects Retry-After header on 429 responses
  • Gzip decompression for .xml.gz files
  • Multiple namespace support (standard + Google schemas)
  • SQL filtering with WHERE clauses
  • Optional error ignoring for batch processing

The function returns a table with columns:

  • url: Page URL (VARCHAR)
  • lastmod: Last modification date (VARCHAR, optional)
  • changefreq: Change frequency hint (VARCHAR, optional)
  • priority: Priority hint 0.0-1.0 (VARCHAR, optional)

Options:

  • follow_robots (BOOLEAN): Parse robots.txt first (default: true)
  • max_depth (INTEGER): Max sitemap index nesting (default: 3)
  • max_retries (INTEGER): Max retry attempts (default: 5)
  • backoff_ms (INTEGER): Initial backoff in ms (default: 100)
  • max_backoff_ms (INTEGER): Max backoff cap in ms (default: 30000)
  • ignore_errors (BOOLEAN): Don't throw on failed fetches (default: false)

Example with options:

SELECT * FROM sitemap_urls(
    'https://example.com',
    follow_robots := true,
    max_depth := 5,
    max_retries := 10
) WHERE url LIKE '%/product/%';

Bruteforce sitemap discovery (scalar function):

-- Find sitemap by trying 587+ common patterns
SELECT bruteforce_find_sitemap('https://example.com') as sitemap_url;
-- Returns first working sitemap URL or NULL

-- Patterns tested include:
-- /sitemap.xml, /sitemap_index.xml
-- /sitemap/sitemap.xml, /sitemaps/sitemap-index.xml
-- /en/sitemap.xml, /de/sitemap.xml
-- /pub/media/sitemap.xml
-- And 580+ more variations

Compose with http_request to fetch page content:

SELECT s.url, h.body
FROM sitemap_urls('https://example.com') s
JOIN LATERAL (SELECT * FROM http_get(s.url)) h ON true
WHERE s.url LIKE '%/product/%'
LIMIT 10;

Added Functions

This extension does not add any functions.

Overloaded Functions

This extension does not add any function overloads.

Added Types

This extension does not add any types.

Added Settings

This extension does not add any settings.