A TypeScript port of Google's official C++ robots.txt parser, fully compliant with RFC 9309 (Robots Exclusion Protocol).
- RFC 9309 Compliant: Implements the official Robots Exclusion Protocol specification
- Google-Compatible: Matches Google's crawler behavior, including handling of edge cases and typos
- Zero Dependencies: Pure TypeScript implementation with no runtime dependencies
- Type-Safe: Full TypeScript support with comprehensive type definitions
- Pattern Matching: Supports wildcards (
*) and end anchors ($) in patterns - Typo Tolerance: Accepts common typos like
disalow,useragent,site-map - Bulk Checking: Parse once, check many URLs efficiently with
ParsedRobots
# Using npm
npm install robotstxt-parser
# Using bun
bun add robotstxt-parser
# Using pnpm
pnpm add robotstxt-parserimport { RobotsMatcher } from "robotstxt-parser";
const robotsTxt = `
User-agent: *
Disallow: /private/
Allow: /public/
User-agent: Googlebot
Allow: /
`;
const matcher = new RobotsMatcher();
// Check if a URL is allowed for a specific user agent
const isAllowed = matcher.oneAgentAllowedByRobots(
robotsTxt,
"MyBot",
"https://example.com/public/page.html",
);
console.log(isAllowed); // true
// Check with multiple user agents
const allowed = matcher.allowedByRobots(
robotsTxt,
["Googlebot", "MyBot"],
"https://example.com/private/secret.html",
);
console.log(allowed); // true (Googlebot is allowed everywhere)For checking many URLs against the same robots.txt, use ParsedRobots to avoid re-parsing:
import { ParsedRobots } from "robotstxt-parser";
const robotsTxt = `
User-agent: *
Disallow: /private/
Allow: /public/
`;
// Parse once
const parsed = ParsedRobots.parse(robotsTxt);
// Check many URLs efficiently
const urls = [
"https://example.com/public/page1.html",
"https://example.com/private/secret.html",
"https://example.com/about",
];
const results = parsed.checkUrls("MyBot", urls);
for (const result of results) {
console.log(`${result.url}: ${result.allowed ? "allowed" : "blocked"}`);
}
// Output:
// https://example.com/public/page1.html: allowed
// https://example.com/private/secret.html: blocked
// https://example.com/about: allowedThe main class for checking URL access against robots.txt rules.
import { RobotsMatcher } from "robotstxt-parser";
const matcher = new RobotsMatcher();| Method | Description |
|---|---|
oneAgentAllowedByRobots(robotsTxt, userAgent, url) |
Check if URL is allowed for a single user agent |
allowedByRobots(robotsTxt, userAgents[], url) |
Check if URL is allowed for any of the user agents |
disallow() |
Returns true if URL is disallowed (after calling allowedByRobots) |
disallowIgnoreGlobal() |
Same as disallow() but ignores * rules |
everSeenSpecificAgent() |
Returns true if robots.txt contained rules for the specified agent |
matchingLine() |
Returns the line number that matched, or 0 |
static isValidUserAgentToObey(userAgent) |
Validates user agent format (only [a-zA-Z_-] allowed) |
static parse(robotsTxt) |
Returns a ParsedRobots instance for bulk URL checking |
static batchCheck(robotsTxt, userAgent, urls[]) |
Convenience method for bulk checking (parses + checks) |
Efficient bulk URL checking by separating parsing from matching. Parse once, check many URLs.
import { ParsedRobots } from "robotstxt-parser";
const parsed = ParsedRobots.parse(robotsTxt);
// Check multiple URLs
const results = parsed.checkUrls("Googlebot", urls);
// Check a single URL
const result = parsed.checkUrl("Googlebot", "https://example.com/page");| Method | Description |
|---|---|
static parse(robotsTxt) |
Parse robots.txt and return a ParsedRobots instance |
checkUrls(userAgent, urls[]) |
Check multiple URLs, returns UrlCheckResult[] |
checkUrl(userAgent, url) |
Check a single URL, returns UrlCheckResult |
hasSpecificAgent(userAgent) |
Returns true if robots.txt has rules for this agent |
getExplicitAgents() |
Returns array of user-agents explicitly mentioned |
interface UrlCheckResult {
url: string; // The URL that was checked
allowed: boolean; // Whether crawling is allowed
matchingLine: number; // Line number of matching rule (0 if none)
matchedPattern: string; // The pattern that matched
matchedRuleType: "allow" | "disallow" | "none";
}Low-level parsing function for custom handling.
import { parseRobotsTxt, RobotsParseHandler } from "robotstxt-parser";
class MyHandler extends RobotsParseHandler {
handleRobotsStart(): void {
/* ... */
}
handleRobotsEnd(): void {
/* ... */
}
handleUserAgent(lineNum: number, value: string): void {
/* ... */
}
handleAllow(lineNum: number, value: string): void {
/* ... */
}
handleDisallow(lineNum: number, value: string): void {
/* ... */
}
handleSitemap(lineNum: number, value: string): void {
/* ... */
}
handleUnknownAction(lineNum: number, action: string, value: string): void {
/* ... */
}
}
parseRobotsTxt(robotsTxtContent, new MyHandler());A parse handler that collects detailed information about each line.
import {
parseRobotsTxt,
RobotsParsingReporter,
RobotsTagName,
} from "robotstxt-parser";
const reporter = new RobotsParsingReporter();
parseRobotsTxt(robotsTxt, reporter);
console.log(reporter.validDirectives()); // Count of valid directives
console.log(reporter.unusedDirectives()); // Count of unrecognized tags
console.log(reporter.lastLineSeen()); // Last line number parsed
console.log(reporter.parseResults()); // Array of RobotsParsedLine objectsInterface for implementing custom matching strategies.
import {
RobotsMatchStrategy,
LongestMatchRobotsMatchStrategy,
} from "robotstxt-parser";
// Default implementation uses longest-match strategy
const strategy = new LongestMatchRobotsMatchStrategy();
// Custom implementation
class MyStrategy implements RobotsMatchStrategy {
matchAllow(path: string, pattern: string): number {
// Return priority (pattern length on match, -1 on no match)
}
matchDisallow(path: string, pattern: string): number {
// Return priority (pattern length on match, -1 on no match)
}
}import {
KeyType, // Enum: USER_AGENT, SITEMAP, ALLOW, DISALLOW, UNKNOWN
RobotsTagName, // Enum: Unknown, UserAgent, Allow, Disallow, Sitemap, Unused
LineMetadata, // Interface for line parsing metadata
RobotsParsedLine, // Interface for complete parsed line info
} from "robotstxt-parser";import {
getPathParamsQuery, // Extract path from URL
maybeEscapePattern, // Normalize percent-encoding
matches, // Check if path matches pattern
} from "robotstxt-parser";
// Extract path from URL
getPathParamsQuery("https://example.com/path?query=1"); // '/path?query=1'
// Check pattern matching
matches("/foo/bar", "/foo/*"); // true
matches("/foo/bar", "/baz"); // falseThe parser supports standard robots.txt pattern syntax:
| Pattern | Matches |
|---|---|
/path |
Any URL starting with /path |
/path* |
Same as /path (implicit) |
*.php |
Any URL containing .php |
/path$ |
Exactly /path (end anchor) |
/fish*.php |
/fish.php, /fish123.php, etc. |
Priority: When both Allow and Disallow match, the longer pattern wins.
This library is designed for correctness and RFC 9309 compliance. When using it in production environments that fetch robots.txt from untrusted sources, consider these safeguards:
The library does not enforce file size limits. Both RFC 9309 and Google require parsing at least 500 KiB. Implement size checks before parsing:
const MAX_ROBOTS_SIZE = 500 * 1024; // 500 KiB (per RFC 9309)
async function fetchAndParse(url: string) {
const response = await fetch(url);
const contentLength = response.headers.get('content-length');
if (contentLength && parseInt(contentLength) > MAX_ROBOTS_SIZE) {
throw new Error('robots.txt too large');
}
const text = await response.text();
if (text.length > MAX_ROBOTS_SIZE) {
throw new Error('robots.txt too large');
}
return ParsedRobots.parse(text);
}Implement timeouts when fetching robots.txt to prevent hanging requests.
This library is a port of Google's C++ parser and includes several behaviors that are Google-specific extensions beyond RFC 9309:
| Behavior | RFC 9309 | |
|---|---|---|
| Line length limit | Truncates at 16,664 bytes | No limit specified |
| Typo tolerance | Accepts "disalow", "useragent", etc. | "MAY be lenient" (unspecified) |
| index.html normalization | Allow: /path/index.html also allows /path/ |
Not specified |
User-agent * with trailing text |
* foo treated as global agent |
Not specified |
The core matching behavior (longest-match-wins, case-insensitive user-agent matching, UTF-8 encoding) follows RFC 9309.
Note: This library only handles parsing and matching. HTTP behaviors like redirect following, caching, and status code handling are your responsibility to implement.
src/
├── index.ts # Main entry point, re-exports public API
├── matcher.ts # RobotsMatcher class - URL matching logic
├── parsed-robots.ts # ParsedRobots class - bulk URL checking
├── parser.ts # robots.txt parsing engine
├── pattern-matcher.ts # Wildcard pattern matching algorithm
├── match-strategy.ts # Match priority strategy interface
├── parsed-key.ts # Directive key recognition (with typo support)
├── reporter.ts # RobotsParsingReporter for analysis
├── url-utils.ts # URL path extraction and encoding
├── types.ts # TypeScript interfaces and enums
└── constants.ts # Configuration constants
tests/
├── matcher.test.ts # URL matching tests
├── bulk-check.test.ts # Bulk URL checking tests
├── reporter.test.ts # Parser reporting tests
└── url-utils.test.ts # URL utility tests
# Install dependencies
bun install
# Run tests
bun test
# Build for distribution
bun run buildApache-2.0
This is a TypeScript port of Google's robots.txt parser.