Core Concepts¶
Architecture Overview¶
linktrace has three main components:
┌─────────────────────────────────────────────────┐
│ Spider (Orchestrator) │
│ - Manages crawl queue (to_visit) │
│ - Tracks visited pages │
│ - Implements BFS or DFS traversal │
│ - Returns collection of Documents │
└──────────────────┬──────────────────────────────┘
│
│ creates one persistent
│
┌──────────────────▼──────────────────────────────┐
│ Crawler (HTTP Engine) │
│ - Persistent aiohttp session │
│ - Connection pooling │
│ - Retries with exponential backoff │
│ - Caching (optional) │
│ - SSL verification │
│ - Cookie jar │
└──────────────────┬──────────────────────────────┘
│
┌──────────┼──────────┐
│ │ │
┌───▼──┐ ┌───▼──┐ ┌───▼──┐
│aiohttp │ lxml │ Cache │
│ (HTTP) │ (parse) │ (disk) │
└────────┘ └────────┘ └────────┘
Spider¶
The Spider is the main entry point. It orchestrates crawling a website.
spider = Spider(
start_url="https://example.com",
max_depth=3,
traversal_strategy="bfs" # or "dfs"
)
documents = await spider.run_async()
Attributes¶
start_url— Starting URL for crawlmax_depth— How many levels deep to follow linkstraversal_strategy— "bfs" (breadth-first) or "dfs" (depth-first)visited— Set of visited URLsto_visit— Queue of (url, depth) tuples waiting to be crawleddocuments— List of Document objects (results)
Methods¶
run_async()— Async crawl, returns list of Documentsrun()— Sync crawl (blocking), returns list of Documents
Internal Logic¶
- Pop URL from
to_visitqueue (FIFO for BFS, LIFO for DFS) - Skip if visited or exceeds max_depth
- Tell Crawler to fetch + parse the URL
- Add returned Document to results
- Extract internal links, add to queue
- Repeat until queue empty
Crawler¶
The Crawler handles individual document fetching and parsing. Manages: - HTTP requests via aiohttp - Connection pooling (10 concurrent, 10 per-host) - Retries with exponential backoff - Caching (optional) - SSL verification - Automatic cookie handling
Usage¶
from linktrace import Crawler
async with Crawler(ssl_verify=True, cache_dir=".cache") as crawler:
doc = await crawler.crawl_document_async("https://example.com")
Retry Logic¶
Transient errors (timeouts, connection errors, 5xx responses) are retried with exponential backoff:
wait_time = 2^attempt * backoff_factor
Attempt 1: wait 2s
Attempt 2: wait 4s
Attempt 3: wait 8s
Attempt 4: fail
Default: 3 retries, backoff_factor=2.
SSL Verification¶
- Default (ssl_verify=True): Verify cert with system CA bundle (secure)
- Corporate proxy (ssl_verify="/path/to/ca.pem"): Verify cert with custom CA
- Self-signed (ssl_verify=False): Skip verification (⚠️ insecure, testing only)
Cookies¶
Cookies are handled automatically: - Set-Cookie response headers extracted - Cookies sent on subsequent requests to same domain - Persists across all requests in a single crawl - Does NOT persist across separate Spider runs (in-memory only)
Caching¶
Optional disk-based cache:
crawler = Crawler(cache_dir=".webcrawler_cache")
- Responses stored as
{cache_dir}/{url_hash}.json - Default TTL: 1 day
- Expired entries auto-deleted on retrieval
- Highly effective for repeat crawls (2-50x speedup)
Document¶
The Document object represents a crawled page.
class Document:
url: str # Page URL
title: str # HTML <title> tag
source: str # Raw HTML
status_code: int # HTTP status (200, 404, etc)
response_headers: dict # HTTP response headers
domain: str # Domain extracted via tldextract
internal_links: List[Link] # Links to same domain
external_links: List[Link] # Links to other domains
links: List[Link] # All links (internal + external)
HtmlLink¶
class HtmlLink:
url: str # Link URL (absolute)
text: str # Link text (anchor text)
schema: str # Protocol (http, https, etc)
Example¶
for doc in documents:
print(f"Title: {doc.title}")
print(f"Status: {doc.status_code}")
print(f"Domain: {doc.domain}")
for link in doc.internal_links:
print(f" Internal: {link.url} → '{link.text}'")
for link in doc.external_links:
print(f" External: {link.url} → '{link.text}'")
Traversal Strategies¶
BFS (Breadth-First Search) — Default¶
Level 0: start_url
↓
Level 1: [all links from start_url]
↓
Level 2: [all links from level 1 pages]
↓
Level 3: [all links from level 2 pages]
Pros: - Natural depth-limiting (discovers shallow links first) - Balanced memory use - Good for broad site exploration
Cons: - Queue can grow large for wide sites - Slower for deep hierarchies
Use when: Exploring general site structure, shallow links matter more
DFS (Depth-First Search)¶
Start at url1 → follows links deep
→ when blocked, backtracks
→ explores url2 → deep again
Pros: - Memory-efficient for wide/shallow sites - Explores complete subtrees - Good for hierarchical sites
Cons: - Can hit slow/hanging pages on deep branches - May take longer to find diverse links
Use when: Crawling documentation, nested directories, exploring single paths
Connection Pooling¶
Both BFS and DFS benefit from persistent sessions:
# GOOD: One session, connection reuse (10-100x faster)
spider = Spider(start_url="https://example.com")
await spider.run_async()
# vs
# BAD: New session per request (default in many libraries)
# Slow handshake/TLS negotiation for each request
linktrace reuses one persistent session across all requests in a single crawl. Same-domain requests are dramatically faster.
Error Handling¶
Transient Errors (retried)¶
- Timeout
- Connection reset
- 5xx responses
Non-transient Errors (not retried)¶
- 4xx responses (invalid URL, etc)
- DNS failures
- SSL errors
Failed requests log errors but don't crash the crawl. Document is returned with status_code and empty links.
Memory Model¶
Memory usage scales with: - Number of pages (each Document ~100KB with HTML) - Queue size (larger for BFS on wide sites) - Cache size (if enabled)
Typical: ~1MB per 100 pages.
To reduce memory:
- Disable HTML in Document: don't store doc.source
- Use DFS instead of BFS (smaller queue)
- Disable caching
- Lower max_depth
Concurrency Model¶
Spider creates tasks concurrently but processes results sequentially:
# Pseudocode
while to_visit:
# Create tasks for all current queue items (concurrent HTTP)
tasks = [crawler.fetch(url) for url, depth in to_visit[:batch_size]]
# Wait for all tasks (concurrent)
results = await gather(*tasks)
# Process results (sequential - add new links to queue)
for doc in results:
add_internal_links_to_queue(doc)
This means: 10 HTTP requests happen concurrently, but link discovery is sequential per batch.
For fine-grained concurrency control, use lower max_retries or adjust timeouts (these affect the aiohttp connector settings indirectly).