Skip to content

API Reference

Spider

Main orchestrator for multi-page crawling.

Constructor

Spider(
    start_url: str,
    max_depth: int = 3,
    debug: bool = False,
    log_name: str = None,
    ssl_verify: bool | str = True,
    verify_hostname: bool = True,
    request_timeout: int = 30,
    cache_dir: str = None,
    max_retries: int = 3,
    traversal_strategy: str = "bfs",
    show_progress: bool = False,
    on_page_crawled: Callable[[Document], Any] = None,
    on_error: Callable[[str, Exception], None] = None,
    on_crawl_complete: Callable[[], None] = None,
    accumulate_results: bool = False,
    request_delay: float = 0.0,
    user_agent: str = "linktrace/0.1.0",
    respect_robots_txt: bool = True
)

Parameters

Parameter Type Default Description
start_url str URL to start crawling from (required)
max_depth int 3 Maximum depth to follow links (0 = start_url only)
debug bool False Enable debug logging (deprecated, use log_name)
log_name str None Logger name for filtering logs
ssl_verify bool|str True SSL verification: True (system CA), False (skip), or path to CA cert
verify_hostname bool True Verify certificate hostname matches domain
request_timeout int 30 Timeout per request in seconds
cache_dir str None Directory for disk cache (None = disabled)
max_retries int 3 Retry transient errors up to N times
traversal_strategy str "bfs" "bfs" (breadth-first) or "dfs" (depth-first)
show_progress bool False Show tqdm progress bar with visited/pending counts
on_page_crawled Callable None Callback after each page crawl. Supports sync and async. Return value accumulated if accumulate_results=True
on_error Callable None Callback on crawl failure. Receives (url, exception)
on_crawl_complete Callable None Callback when crawl finishes. Supports async for cleanup
accumulate_results bool False If True, accumulate callback return values in results list
request_delay float 0.0 Minimum seconds between requests to same domain (0 = no forced delay)
user_agent str "linktrace/0.1.0" User-Agent header for requests (affects robots.txt rules)
respect_robots_txt bool True Parse and respect robots.txt Crawl-delay directives

Methods

async run_async() -> List[Document]

Crawl asynchronously. Preferred method.

spider = Spider(start_url="https://example.com")
documents = await spider.run_async()

Returns: List of Document objects

Raises: - ValueError — If traversal_strategy is invalid

run() -> List[Document]

Crawl synchronously (blocking). Use asyncio.run() wrapper for async context.

spider = Spider(start_url="https://example.com")
documents = spider.run()  # Blocks until complete

Returns: List of Document objects

Attributes

Attribute Type Description
start_url str Starting URL
max_depth int Max crawl depth
visited set Set of visited URLs
to_visit list Queue of (url, depth) tuples
documents list Results (accumulated Documents)
visited_count int Number of pages successfully fetched
traversal_strategy str "bfs" or "dfs"
on_page_crawled Callable | None Callback for each crawled page
on_error Callable | None Callback for crawl errors
on_crawl_complete Callable | None Callback on crawl completion
accumulate_results bool Whether to accumulate callback results
accumulated_results list Accumulated callback return values
request_delay float Minimum delay between requests to same domain
user_agent str User-Agent header value
respect_robots_txt bool Whether to respect robots.txt rules

Callback Signature

# Sync callback (simple file I/O, aggregation)
def on_page_crawled(doc: Document) -> Any:
    # Process document, return data to accumulate or None
    pass

# Async callback (database, HTTP, etc.)
async def on_page_crawled(doc: Document) -> Any:
    await db.insert(doc.url)
    return doc.url

# Error callback
def on_error(url: str, exception: Exception) -> None:
    logger.error(f"Failed: {url}: {exception}")

# Completion callback (async supported)
async def on_crawl_complete() -> None:
    await db.close()

Return Behavior

Callback accumulate_results Return Value
None Any List of all Documents
Provided False Empty list []
Provided True List of accumulated callback returns

Crawler

Low-level HTTP engine for fetching and parsing individual documents.

Constructor

Crawler(
    log_level: int = logging.DEBUG,
    log_name: str = None,
    ssl_verify: bool | str = True,
    verify_hostname: bool = True,
    request_timeout: int = 30,
    cache_dir: str = None,
    max_retries: int = 3,
    backoff_factor: int = 2
)

Parameters

Same as Spider, plus:

Parameter Type Default Description
log_level int logging.DEBUG Logging level
backoff_factor int 2 Exponential backoff multiplier (wait_time = 2^attempt * factor)

Methods

async __aenter__() -> Crawler

Enter async context manager. Creates persistent aiohttp session.

async with Crawler(...) as crawler:
    doc = await crawler.crawl_document_async("https://example.com")

async __aexit__(exc_type, exc_val, exc_tb)

Exit async context manager. Closes session.

async crawl_document_async(url: str) -> Document

Fetch and parse a single document.

async with Crawler() as crawler:
    doc = await crawler.crawl_document_async("https://example.com/page")
    print(doc.title)
    print(doc.internal_links)

Parameters: - url (str) — URL to fetch

Returns: Document object

Notes: - Checks cache first if cache_dir configured - Retries transient errors up to max_retries times - Returns Document even on 4xx/5xx (check status_code) - Automatic cookie handling

Attributes

Attribute Type Description
session aiohttp.ClientSession HTTP session (None until context entered)
cache ResponseCache Cache object (None if cache_dir not set)
ssl_verify bool|str SSL verification setting
verify_hostname bool Hostname verification setting
request_timeout int Request timeout in seconds
max_retries int Max retry attempts
backoff_factor int Exponential backoff multiplier

Document

Represents a crawled webpage.

Constructor

Document(url: str, source: str = None)

Parameters: - url (str) — Page URL - source (str) — Raw HTML (optional)

Attributes

Attribute Type Description
url str Page URL (absolute)
source str Raw HTML response
title str HTML <title> tag content
status_code int HTTP status code
response_headers dict HTTP response headers
domain str Domain extracted from URL (read-only property)
internal_links List[HtmlLink] Links to same domain (successful crawl)
external_links List[HtmlLink] Links to other domains (successful crawl)
links List[HtmlLink] All links (internal + external)
broken_internal_links List[BrokenLink] Internal links with HTTP error (4xx, 5xx)
broken_external_links List[BrokenLink] External links with HTTP error (4xx, 5xx)

Properties

domain: str (read-only)

Domain extracted from URL using tldextract.

doc.url = "https://www.example.com/page"
doc.domain  # "example"

Represents a link found in HTML.

Constructor

HtmlLink(url: str, text: str)

Parameters: - url (str) — Link URL (absolute) - text (str) — Anchor text

Attributes

Attribute Type Description
url str Link destination URL
text str Anchor text (visible text in <a> tag)

Properties

schema: str (read-only)

URL scheme (http, https, ftp, etc).

description: str (read-only)

Alias for text.

Methods

Supports standard Python comparisons: - ==, != — Compare by URL - <, > — Sort by URL - hash() — Use in sets/dicts


Represents a link that returned an HTTP error status (4xx, 5xx).

Constructor

BrokenLink(url: str, status: int)

Parameters: - url (str) — Link URL - status (int) — HTTP status code (e.g., 404, 500)

Attributes

Attribute Type Description
url str Link destination URL
status_code int HTTP error status code
text str String representation of status code

Note: BrokenLink inherits from HtmlLink, so it supports the same comparison operations.

Usage

spider = Spider(start_url="https://example.com")
documents = await spider.run_async()

for doc in documents:
    if doc.broken_internal_links:
        print(f"Found {len(doc.broken_internal_links)} broken internal links:")
        for broken in doc.broken_internal_links:
            print(f"  {broken.url} - HTTP {broken.status_code}")

Export documents to multiple formats.

Constructor

Serializers(documents: List[Document])

Parameters: - documents — List of Document objects

Methods

to_json(output_path: str, include_html: bool = False) -> None

Export to JSON file with nested structure.

serializer = Serializers(documents)
serializer.to_json("output.json", include_html=False)

Parameters: - output_path (str) — Path to write JSON file - include_html (bool) — Include raw HTML in output

Example output:

[
  {
    "url": "https://example.com",
    "title": "Example",
    "status_code": 200,
    "domain": "example",
    "response_headers": {...},
    "internal_links": [{"url": "...", "text": "..."}],
    "external_links": [...]
  }
]

to_pandas(include_html: bool = False) -> pd.DataFrame

Export to pandas DataFrame with flattened links (one row per link).

df = serializer.to_pandas()
print(df[["url", "title", "link_url", "link_type"]].head())

Returns: pandas DataFrame

Columns: - url, title, status_code, domain - link_url, link_text, link_type (internal/external/None) - html (if include_html=True)

Notes: - One row per link - Documents without links have one row with NULL link fields - Requires pip install pandas

to_polars(include_html: bool = False) -> pl.DataFrame

Export to polars DataFrame (same schema as pandas, often faster).

Returns: polars DataFrame

Requires: pip install polars

to_arrow(include_html: bool = False) -> pa.Table

Export to PyArrow Table for data pipelines.

Returns: pyarrow Table

Requires: pip install pyarrow


ResponseCache

Disk-based response caching (used internally by Crawler).

Constructor

ResponseCache(cache_dir: str, ttl_seconds: int = 86400)

Parameters: - cache_dir (str) — Directory to store cache files - ttl_seconds (int) — Time-to-live for entries (default: 1 day)

Methods

async get(url: str) -> CachedResponse | None

Retrieve cached response if not expired.

async set(url: str, status_code: int, headers: dict, content: str) -> None

Store response in cache.

async clear() -> None

Clear all cache files.


Exceptions

ValueError

Raised when: - Invalid traversal_strategy (not "bfs" or "dfs") - Missing required parameters - SSL certificate not found

CrawlException

Raised for crawl-specific errors.

class CrawlException(Exception):
    url: str        # URL that caused error
    message: str    # Error message

Logging

linktrace uses Python's standard logging module.

Configure Logging

import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)

# Or specific logger
logger = logging.getLogger("linktrace.Spider")
logger.setLevel(logging.DEBUG)

Log Levels

  • DEBUG: Detailed traversal, cache hits/misses, retries
  • INFO: Pages visited, crawl progress
  • WARNING: SSL verification disabled, expired cache entries
  • ERROR: Failed requests after retries, parse errors