Sitemapper is a Node.js module that makes it easy to parse XML sitemaps. It supports single sitemaps, sitemap indexes with multiple sitemaps, and 8000 various sitemap formats including image and video sitemaps.
# Using npm
npm install sitemapper --save
# Using yarn
yarn add sitemapper
# Using pnpm
pnpm add sitemapper
import Sitemapper from 'sitemapper';
const sitemap = new Sitemapper({
timeout: 10000, // 10 second timeout
});
sitemap
.fetch('https://gosla.sh/sitemap.xml')
.then(({ url, sites }) => {
console.log('Sites: ', sites);
})
.catch((error) => console.error(error));
You can also use Sitemapper directly from the command line:
# Using npx
npx sitemapper https://gosla.sh/sitemap.xml
import Sitemapper from 'sitemapper';
const sitemap = new Sitemapper();
sitemap
.fetch('https://wp.seantburke.com/sitemap.xml')
.then(({ url, sites }) => {
console.log(`Sitemap URL: ${url}`);
console.log(`Found ${sites.length} URLs`);
console.log(sites);
})
.catch((error) => console.error(error));
import Sitemapper from 'sitemapper';
async function parseSitemap() {
const Google = new Sitemapper({
url: 'https://www.google.com/work/sitemap.xml',
timeout: 15000, // 15 seconds
concurrency: 10,
});
try {
const { sites } = await Google.fetch();
console.log(`Found ${sites.length} URLs in the sitemap`);
console.log(sites);
} catch (error) {
console.error('Error fetching sitemap:', error);
}
}
parseSitemap();
import Sitemapper from 'sitemapper';
import { HttpsProxyAgent } from 'hpagent';
const sitemapper = new Sitemapper({
url: 'https://gosla.sh/sitemap.xml',
timeout: 30000,
concurrency: 5,
retries: 2,
debug: true,
proxyAgent: new HttpsProxyAgent({
proxy: 'http://localhost:8080',
}),
requestHeaders: {
'User-Agent': 'Mozilla/5.0 (compatible; SitemapperBot/1.0)',
},
fields: {
loc: true,
lastmod: true,
sitemap: true,
},
});
sitemapper
.fetch()
.then(({ sites }) => console.log(sites))
.catch((error) => console.error(error));
Sitemapper can be customized with the following options:
Option | Type | Default | Description |
---|---|---|---|
url |
String | undefined |
The URL of the sitemap to parse |
timeout |
Number | 15000 |
Maximum timeout in milliseconds for each request |
concurrency |
Number | 10 |
Maximum number of concurrent requests when crawling multiple sitemaps |
retries |
Number | 0 |
Number of retry attempts for failed requests |
debug |
Boolean | false |
Enable debug logging |
rejectUnauthorized |
Boolean | true |
Reject invalid SSL certificates (like self-signed or expired) |
requestHeaders |
Object | {} |
Additional HTTP headers to include with requests |
lastmod |
Number | undefined |
Only return URLs with lastmod timestamp newer than this value |
proxyAgent |
HttpProxyAgent | HttpsProxyAgent | undefined |
Instance of hpagent for proxy support |
exclusions |
Array<RegExp> | [] |
Array of regex patterns to exclude URLs from results |
fields |
Object | undefined |
Specify which fields to include in the results (see below) |
Important: When using the fields
option, the return format changes from an array of URL strings to an array of objects containing your selected fields.
For the fields
option, specify which fields to include by setting them to true
:
Field | Description |
---|---|
loc |
URL location of the page |
sitemap |
URL of the sitemap containing this URL (useful for sitemap indexes) |
lastmod |
Date of last modification |
changefreq |
How frequently the page is likely to change |
priority |
Priority of this URL relative to other URLs |
image:loc |
URL location of the image (for image sitemaps) |
image:title |
Title of the image (for image sitemaps) |
image:caption |
Caption of the image (for image sitemaps) |
video:title |
Title of the video (for video sitemaps) |
video:description |
Description of the video (for video sitemaps) |
video:thumbnail_loc |
Thumbnail URL of the video (for video sitemaps) |
// Returns an array of URL strings
[
'https://wp.seantburke.com/?p=234',
'https://wp.seantburke.com/?p=231',
'https://wp.seantburke.com/?p=185',
];
// Returns an array of objects
[
{
loc: 'https://wp.seantburke.com/?p=234',
lastmod: '2015-07-03T02:05:55+00:00',
priority: 0.8,
},
{
loc: 'https://wp.seantburke.com/?p=231',
lastmod: '2015-07-03T01:47:29+00:00',
priority: 0.8,
},
];
Sitemapper includes a simple CLI tool for basic sitemap parsing directly from the command line:
npx sitemapper <sitemap-url>
npx sitemapper https://gosla.sh/sitemap.xml
The CLI will display the sitemap URL and list all URLs found in the sitemap:
Sitemap URL: https://gosla.sh/sitemap.xml
Found URLs:
1. https://gosla.sh/page1
2. https://gosla.sh/page2
3. https://gosla.sh/page3
...
Currently, the CLI supports the --timeout
parameter to set the request timeout in milliseconds:
npx sitemapper https://gosla.sh/sitemap.xml --timeout=5000
Note: The CLI implementation is basic and does not yet support all options available in the JavaScript API. More advanced features like fields filtering, concurrency control, and different output formats require using the JavaScript API directly.
Contributions from experienced engineers are highly valued. When contributing, please consider:
- Maintain backward compatibility where possible
- Consider performance implications, particularly for large sitemaps
- Add TypeScript types
- Add tests for your change
- Update documentation and examples
- Check for typos
- Code should pass ESLint, Prettier, Spell Check and TypeScript checks
- Try not to bloat the main dependencies with new packages, dev dependencies are fine
- If adding packages, make sure to run
npm install
with the latest NPM version to update package-lock.json
- PRs should be focused on a single concern/feature
- Include sufficient context in the PR description
- Reference any relevant issues
- Run
npm test
locally to verify your changes pass the test- Sometimes the tests will fail since they reference real world sitemaps. Try running it again.
- PRs will not run github actions by default, they need to be run manually by @seantomburke
For substantial changes, consider opening an issue for discussion before implementation.
Note: The CI pipeline enforces TypeScript type checking, linting rules, formatting standards, and test coverage thresholds.
This project is licensed under the MIT License - see the LICENSE file for details.