API Reference¶

Spider¶

Main orchestrator for multi-page crawling.

Constructor¶

Spider(
    start_url: str,
    max_depth: int = 3,
    debug: bool = False,
    log_name: str = None,
    ssl_verify: bool | str = True,
    verify_hostname: bool = True,
    request_timeout: int = 30,
    cache_dir: str = None,
    max_retries: int = 3,
    traversal_strategy: str = "bfs",
    show_progress: bool = False,
    on_page_crawled: Callable[[Document], Any] = None,
    on_error: Callable[[str, Exception], None] = None,
    on_crawl_complete: Callable[[], None] = None,
    accumulate_results: bool = False,
    request_delay: float = 0.0,
    user_agent: str = "linktrace/0.1.0",
    respect_robots_txt: bool = True
)

Parameters¶

Parameter	Type	Default	Description
`start_url`	str	—	URL to start crawling from (required)
`max_depth`	int	3	Maximum depth to follow links (0 = start_url only)
`debug`	bool	False	Enable debug logging (deprecated, use log_name)
`log_name`	str	None	Logger name for filtering logs
`ssl_verify`	bool\|str	True	SSL verification: True (system CA), False (skip), or path to CA cert
`verify_hostname`	bool	True	Verify certificate hostname matches domain
`request_timeout`	int	30	Timeout per request in seconds
`cache_dir`	str	None	Directory for disk cache (None = disabled)
`max_retries`	int	3	Retry transient errors up to N times
`traversal_strategy`	str	"bfs"	"bfs" (breadth-first) or "dfs" (depth-first)
`show_progress`	bool	False	Show tqdm progress bar with visited/pending counts
`on_page_crawled`	Callable	None	Callback after each page crawl. Supports sync and async. Return value accumulated if `accumulate_results=True`
`on_error`	Callable	None	Callback on crawl failure. Receives (url, exception)
`on_crawl_complete`	Callable	None	Callback when crawl finishes. Supports async for cleanup
`accumulate_results`	bool	False	If True, accumulate callback return values in results list
`request_delay`	float	0.0	Minimum seconds between requests to same domain (0 = no forced delay)
`user_agent`	str	"linktrace/0.1.0"	User-Agent header for requests (affects robots.txt rules)
`respect_robots_txt`	bool	True	Parse and respect robots.txt Crawl-delay directives

Methods¶

`async run_async() -> List[Document]`¶

Crawl asynchronously. Preferred method.

spider = Spider(start_url="https://example.com")
documents = await spider.run_async()

Returns: List of Document objects

Raises: - ValueError — If traversal_strategy is invalid

`run() -> List[Document]`¶

Crawl synchronously (blocking). Use asyncio.run() wrapper for async context.

spider = Spider(start_url="https://example.com")
documents = spider.run()  # Blocks until complete

Returns: List of Document objects

Attributes¶

Attribute	Type	Description
`start_url`	str	Starting URL
`max_depth`	int	Max crawl depth
`visited`	set	Set of visited URLs
`to_visit`	list	Queue of (url, depth) tuples
`documents`	list	Results (accumulated Documents)
`visited_count`	int	Number of pages successfully fetched
`traversal_strategy`	str	"bfs" or "dfs"
`on_page_crawled`	Callable \| None	Callback for each crawled page
`on_error`	Callable \| None	Callback for crawl errors
`on_crawl_complete`	Callable \| None	Callback on crawl completion
`accumulate_results`	bool	Whether to accumulate callback results
`accumulated_results`	list	Accumulated callback return values
`request_delay`	float	Minimum delay between requests to same domain
`user_agent`	str	User-Agent header value
`respect_robots_txt`	bool	Whether to respect robots.txt rules

Callback Signature¶

# Sync callback (simple file I/O, aggregation)
def on_page_crawled(doc: Document) -> Any:
    # Process document, return data to accumulate or None
    pass

# Async callback (database, HTTP, etc.)
async def on_page_crawled(doc: Document) -> Any:
    await db.insert(doc.url)
    return doc.url

# Error callback
def on_error(url: str, exception: Exception) -> None:
    logger.error(f"Failed: {url}: {exception}")

# Completion callback (async supported)
async def on_crawl_complete() -> None:
    await db.close()

Return Behavior¶

Callback	`accumulate_results`	Return Value
None	Any	List of all Documents
Provided	False	Empty list []
Provided	True	List of accumulated callback returns

Crawler¶

Low-level HTTP engine for fetching and parsing individual documents.

Constructor¶

Crawler(
    log_level: int = logging.DEBUG,
    log_name: str = None,
    ssl_verify: bool | str = True,
    verify_hostname: bool = True,
    request_timeout: int = 30,
    cache_dir: str = None,
    max_retries: int = 3,
    backoff_factor: int = 2
)

Parameters¶

Same as Spider, plus:

Parameter	Type	Default	Description
`log_level`	int	logging.DEBUG	Logging level
`backoff_factor`	int	2	Exponential backoff multiplier (wait_time = 2^attempt * factor)

Methods¶

`async aenter() -> Crawler`¶

Enter async context manager. Creates persistent aiohttp session.

async with Crawler(...) as crawler:
    doc = await crawler.crawl_document_async("https://example.com")

`async aexit(exc_type, exc_val, exc_tb)`¶

Exit async context manager. Closes session.

`async crawl_document_async(url: str) -> Document`¶

Fetch and parse a single document.

async with Crawler() as crawler:
    doc = await crawler.crawl_document_async("https://example.com/page")
    print(doc.title)
    print(doc.internal_links)

Parameters: - url (str) — URL to fetch

Returns: Document object

Notes: - Checks cache first if cache_dir configured - Retries transient errors up to max_retries times - Returns Document even on 4xx/5xx (check status_code) - Automatic cookie handling

Attributes¶

Attribute	Type	Description
`session`	aiohttp.ClientSession	HTTP session (None until context entered)
`cache`	ResponseCache	Cache object (None if cache_dir not set)
`ssl_verify`	bool\|str	SSL verification setting
`verify_hostname`	bool	Hostname verification setting
`request_timeout`	int	Request timeout in seconds
`max_retries`	int	Max retry attempts
`backoff_factor`	int	Exponential backoff multiplier

Document¶

Represents a crawled webpage.

Constructor¶

Document(url: str, source: str = None)

Parameters: - url (str) — Page URL - source (str) — Raw HTML (optional)

Attributes¶

Attribute	Type	Description
`url`	str	Page URL (absolute)
`source`	str	Raw HTML response
`title`	str	HTML `<title>` tag content
`status_code`	int	HTTP status code
`response_headers`	dict	HTTP response headers
`domain`	str	Domain extracted from URL (read-only property)
`internal_links`	List[HtmlLink]	Links to same domain (successful crawl)
`external_links`	List[HtmlLink]	Links to other domains (successful crawl)
`links`	List[HtmlLink]	All links (internal + external)
`broken_internal_links`	List[BrokenLink]	Internal links with HTTP error (4xx, 5xx)
`broken_external_links`	List[BrokenLink]	External links with HTTP error (4xx, 5xx)

Properties¶

`domain: str` (read-only)¶

Domain extracted from URL using tldextract.

doc.url = "https://www.example.com/page"
doc.domain  # "example"

HtmlLink¶

Represents a link found in HTML.

Constructor¶

HtmlLink(url: str, text: str)

Parameters: - url (str) — Link URL (absolute) - text (str) — Anchor text

Attributes¶

Attribute	Type	Description
`url`	str	Link destination URL
`text`	str	Anchor text (visible text in `<a>` tag)

Properties¶

`schema: str` (read-only)¶

URL scheme (http, https, ftp, etc).

`description: str` (read-only)¶

Alias for text.

Methods¶

Supports standard Python comparisons: - ==, != — Compare by URL - <, > — Sort by URL - hash() — Use in sets/dicts

BrokenLink¶

Represents a link that returned an HTTP error status (4xx, 5xx).

Constructor¶

BrokenLink(url: str, status: int)

Parameters: - url (str) — Link URL - status (int) — HTTP status code (e.g., 404, 500)

Attributes¶

Attribute	Type	Description
`url`	str	Link destination URL
`status_code`	int	HTTP error status code
`text`	str	String representation of status code

Note: BrokenLink inherits from HtmlLink, so it supports the same comparison operations.

Usage¶

spider = Spider(start_url="https://example.com")
documents = await spider.run_async()

for doc in documents:
    if doc.broken_internal_links:
        print(f"Found {len(doc.broken_internal_links)} broken internal links:")
        for broken in doc.broken_internal_links:
            print(f"  {broken.url} - HTTP {broken.status_code}")

Export documents to multiple formats.

Constructor¶

Serializers(documents: List[Document])

Parameters: - documents — List of Document objects

Methods¶

`to_json(output_path: str, include_html: bool = False) -> None`¶

Export to JSON file with nested structure.

serializer = Serializers(documents)
serializer.to_json("output.json", include_html=False)

Parameters: - output_path (str) — Path to write JSON file - include_html (bool) — Include raw HTML in output

Example output:

[
  {
    "url": "https://example.com",
    "title": "Example",
    "status_code": 200,
    "domain": "example",
    "response_headers": {...},
    "internal_links": [{"url": "...", "text": "..."}],
    "external_links": [...]
  }
]

`to_pandas(include_html: bool = False) -> pd.DataFrame`¶

Export to pandas DataFrame with flattened links (one row per link).

df = serializer.to_pandas()
print(df[["url", "title", "link_url", "link_type"]].head())

Returns: pandas DataFrame

Columns: - url, title, status_code, domain - link_url, link_text, link_type (internal/external/None) - html (if include_html=True)

Notes: - One row per link - Documents without links have one row with NULL link fields - Requires pip install pandas

`to_polars(include_html: bool = False) -> pl.DataFrame`¶

Export to polars DataFrame (same schema as pandas, often faster).

Returns: polars DataFrame

Requires: pip install polars

`to_arrow(include_html: bool = False) -> pa.Table`¶

Export to PyArrow Table for data pipelines.

Returns: pyarrow Table

Requires: pip install pyarrow

ResponseCache¶

Disk-based response caching (used internally by Crawler).

Constructor¶

ResponseCache(cache_dir: str, ttl_seconds: int = 86400)

Parameters: - cache_dir (str) — Directory to store cache files - ttl_seconds (int) — Time-to-live for entries (default: 1 day)

Methods¶

`async get(url: str) -> CachedResponse | None`¶

Retrieve cached response if not expired.

`async set(url: str, status_code: int, headers: dict, content: str) -> None`¶

Store response in cache.

`async clear() -> None`¶

Clear all cache files.

Exceptions¶

`ValueError`¶

Raised when: - Invalid traversal_strategy (not "bfs" or "dfs") - Missing required parameters - SSL certificate not found

`CrawlException`¶

Raised for crawl-specific errors.

class CrawlException(Exception):
    url: str        # URL that caused error
    message: str    # Error message

Logging¶

linktrace uses Python's standard logging module.

Configure Logging¶

import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)

# Or specific logger
logger = logging.getLogger("linktrace.Spider")
logger.setLevel(logging.DEBUG)

Log Levels¶

DEBUG: Detailed traversal, cache hits/misses, retries
INFO: Pages visited, crawl progress
WARNING: SSL verification disabled, expired cache entries
ERROR: Failed requests after retries, parse errors

API Reference¶

Spider¶

Constructor¶

Parameters¶

Methods¶

async run_async() -> List[Document]¶

run() -> List[Document]¶

Attributes¶

Callback Signature¶

Return Behavior¶

Crawler¶

Constructor¶

Parameters¶

Methods¶

async __aenter__() -> Crawler¶

async __aexit__(exc_type, exc_val, exc_tb)¶

async crawl_document_async(url: str) -> Document¶

Attributes¶

Document¶

Constructor¶

Attributes¶

Properties¶

domain: str (read-only)¶

HtmlLink¶

Constructor¶

Attributes¶

Properties¶

schema: str (read-only)¶

description: str (read-only)¶

Methods¶

BrokenLink¶

Constructor¶

Attributes¶

Usage¶

Constructor¶

Methods¶

to_json(output_path: str, include_html: bool = False) -> None¶

to_pandas(include_html: bool = False) -> pd.DataFrame¶

to_polars(include_html: bool = False) -> pl.DataFrame¶

to_arrow(include_html: bool = False) -> pa.Table¶

ResponseCache¶

Constructor¶

Methods¶

async get(url: str) -> CachedResponse | None¶

async set(url: str, status_code: int, headers: dict, content: str) -> None¶

async clear() -> None¶

Exceptions¶

ValueError¶

CrawlException¶

Logging¶

Configure Logging¶

Log Levels¶

`async run_async() -> List[Document]`¶

`run() -> List[Document]`¶

`async aenter() -> Crawler`¶

`async aexit(exc_type, exc_val, exc_tb)`¶

`async crawl_document_async(url: str) -> Document`¶

`domain: str` (read-only)¶

`schema: str` (read-only)¶

`description: str` (read-only)¶

`to_json(output_path: str, include_html: bool = False) -> None`¶

`to_pandas(include_html: bool = False) -> pd.DataFrame`¶

`to_polars(include_html: bool = False) -> pl.DataFrame`¶

`to_arrow(include_html: bool = False) -> pa.Table`¶

`async get(url: str) -> CachedResponse | None`¶

`async set(url: str, status_code: int, headers: dict, content: str) -> None`¶

`async clear() -> None`¶

`ValueError`¶

`CrawlException`¶