susm
recursively crawls a website, following HTML links, scripts, stylesheets and sitemaps.
For each file it encounters that references or includes a source map
(e.g., JavaScript bundles, CSS files),
it attempts to locate and download that map.
susm
then attempts to extract any source code files and write them to disk,
preserving their relative pat
6CBC
hs as defined in the map.
First, clone the repository:
git clone https://github.com/dixslyf/susm.git
cd susm
To build the scraper, run:
cargo build --release
The compiled binary will be available at target/release/susm
(assuming Cargo's default target directory).
This repository provides a Nix flake.
To build the scraper with Nix, run:
nix build github:dixslyf/susm
To run the scraper:
nix run github:dixslyf/susm
susm has two primary modes of operation:
-
Crawl a website and unpack discovered source maps:
susm site <URL> [OPTIONS]
-
Unpack a single local source map file:
susm file <PATH> [OPTIONS]
For additional options, run:
susm --help
susm
applies a polite crawling policy by default.
Requests are rate-limited per host to avoid overloading servers.
By default, susm
waits 500 milliseconds between requests with a slight random jitter.
The request interval can be adjusted with the --request-interval
(-i
) flag.
susm
also respects the robots.txt
exclusion standard.
Before crawling, it retrieves and parses the site’s robots.txt
file (if present)
and skips any paths disallowed for its user agent.