Skip to content

Getting Started

Installation

From PyPI

pip install linktrace

With Optional Export Formats

# All three: pandas, polars, pyarrow
pip install linktrace[serializers]

# Individual formats
pip install linktrace[pandas]
pip install linktrace[polars]
pip install linktrace[pyarrow]

Development Installation

git clone https://github.com/JayBaywatch/linktrace
cd webcrawler
pip install -e .
pip install -e ".[serializers]"  # Optional formats

Your First Crawl

1. Basic Example

import asyncio
from linktrace import Spider

async def main():
    spider = Spider(start_url="https://example.com", max_depth=1)
    documents = await spider.run_async()
    print(f"Crawled {len(documents)} pages")

asyncio.run(main())

2. Inspect Documents

for doc in documents:
    print(f"URL: {doc.url}")
    print(f"Title: {doc.title}")
    print(f"Status: {doc.status_code}")
    print(f"Internal links: {len(doc.internal_links)}")
    print(f"External links: {len(doc.external_links)}")
    print()
# Find all external domains
external_domains = set()
for doc in documents:
    for link in doc.external_links:
        domain = link.url.split("/")[2]
        external_domains.add(domain)

print(f"Found {len(external_domains)} external domains")

4. Export Data

from linktrace import Serializers

serializer = Serializers(documents)

# JSON
serializer.to_json("output.json")

# Pandas DataFrame
df = serializer.to_pandas()
print(df[["url", "title", "link_type"]].head())

# Polars
df_polars = serializer.to_polars()

# PyArrow
table = serializer.to_arrow()

Common Patterns

Crawl with Caching

spider = Spider(
    start_url="https://example.com",
    max_depth=2,
    cache_dir=".webcrawler_cache"  # Enable disk caching
)

# First run: fetches from network
documents = await spider.run_async()

# Second run: uses cache (10-50x faster)
documents = await spider.run_async()

Deep Crawling (DFS)

spider = Spider(
    start_url="https://docs.example.com",
    max_depth=5,
    traversal_strategy="dfs"  # Depth-first
)
documents = await spider.run_async()

Custom Timeouts & Retries

spider = Spider(
    start_url="https://slow-api.example.com",
    request_timeout=60,   # 60 second timeout
    max_retries=5         # Retry 5 times
)
documents = await spider.run_async()

Corporate Proxy with CA Certificate

spider = Spider(
    start_url="https://internal.company.com",
    ssl_verify="/etc/ssl/certs/company-ca.pem"  # Custom CA
)
documents = await spider.run_async()

Jupyter Notebook

See notebooks/crawl_cnn.ipynb for interactive examples with Jupyter:

jupyter notebook notebooks/crawl_cnn.ipynb

The notebook demonstrates: - Basic crawling - Analyzing link structure - Exporting to JSON - Pandas/Polars/PyArrow analysis

Next Steps