Parse XML sitemaps from websites with automatic discovery via robots.txt
Maintainer(s):
onnimonni
Installing and Loading
INSTALL sitemap FROM community;
LOAD sitemap;
Example
-- Get all URLs from a sitemap (auto-prepends https://)
SELECT * FROM sitemap_urls('example.com');
-- Direct sitemap URL (skips discovery)
SELECT * FROM sitemap_urls('https://example.com/sitemap.xml');
-- Get URLs from multiple sites
SELECT * FROM sitemap_urls(['example.com', 'google.com']);
-- Filter specific URLs
SELECT * FROM sitemap_urls('example.com')
WHERE url LIKE '%/blog/%';
-- Ignore errors for invalid domains
SELECT * FROM sitemap_urls(['valid.com', 'invalid.com'], ignore_errors := true);
-- Count URLs by type
SELECT
CASE
WHEN url LIKE '%/product/%' THEN 'product'
WHEN url LIKE '%/blog/%' THEN 'blog'
ELSE 'other'
END as type,
count(*) as count
FROM sitemap_urls('https://example.com')
GROUP BY type;
About sitemap
The sitemap extension provides a table function for parsing XML sitemaps from websites. It automatically discovers sitemaps via robots.txt and supports recursive sitemap index traversal.
Features:
- Multi-fallback sitemap discovery (robots.txt, /sitemap.xml, /sitemap_index.xml, HTML meta tags)
- Session caching for discovered sitemap locations
- Direct sitemap URL support (skips discovery)
- Sitemap index support (nested sitemaps)
- Array support for processing multiple domains
- Retry logic with exponential backoff
- Respects Retry-After header on 429 responses
- Gzip decompression for .xml.gz files
- Multiple namespace support (standard + Google schemas)
- SQL filtering with WHERE clauses
- Optional error ignoring for batch processing
The function returns a table with columns:
- url: Page URL (VARCHAR)
- lastmod: Last modification date (VARCHAR, optional)
- changefreq: Change frequency hint (VARCHAR, optional)
- priority: Priority hint 0.0-1.0 (VARCHAR, optional)
Options:
- follow_robots (BOOLEAN): Parse robots.txt first (default: true)
- max_depth (INTEGER): Max sitemap index nesting (default: 3)
- max_retries (INTEGER): Max retry attempts (default: 5)
- backoff_ms (INTEGER): Initial backoff in ms (default: 100)
- max_backoff_ms (INTEGER): Max backoff cap in ms (default: 30000)
- ignore_errors (BOOLEAN): Don't throw on failed fetches (default: false)
Example with options:
SELECT * FROM sitemap_urls(
'https://example.com',
follow_robots := true,
max_depth := 5,
max_retries := 10
) WHERE url LIKE '%/product/%';
Compose with http_request to fetch page content:
SELECT s.url, h.body
FROM sitemap_urls('https://example.com') s
JOIN LATERAL (SELECT * FROM http_get(s.url)) h ON true
WHERE s.url LIKE '%/product/%'
LIMIT 10;