Troubleshooting¶

SSL/Certificate Issues¶

"SSL: CERTIFICATE_VERIFY_FAILED"¶

Problem: Certificate verification failed.

Solutions:

Self-signed certificates (testing only):

spider = Spider(
    start_url="https://self-signed.example.com",
    ssl_verify=False  # ⚠️ Insecure
)

Corporate proxy with CA bundle:

spider = Spider(
    start_url="https://internal.company.com",
    ssl_verify="/etc/ssl/certs/company-ca.pem"
)

Skip hostname verification only:

spider = Spider(
    start_url="https://example.com",
    ssl_verify=True,
    verify_hostname=False
)

"Can't connect to HTTPS URL because the SSL module is not available"¶

Problem: Python built without SSL support.

Solution: Rebuild Python with OpenSSL or use a prebuilt Python distribution.

Connection & Timeout Issues¶

"Connection refused" or "No route to host"¶

Problem: Cannot reach the target server.

Debugging: - Check URL spelling - Verify server is running - Test connectivity: ping example.com - Try in browser first

# Verify URL is correct
spider = Spider(start_url="https://example.com")

"Request timeout" / "Timeout waiting for response"¶

Problem: Server too slow or network latency.

Solutions:

Increase timeout:

spider = Spider(
    start_url="https://slow-server.example.com",
    request_timeout=60  # 60 seconds instead of default 30
)

Increase retries:

spider = Spider(
    start_url="https://example.com",
    max_retries=5  # Retry 5 times
)

Use DFS for deep sites:

spider = Spider(
    start_url="https://example.com",
    traversal_strategy="dfs"  # Fewer concurrent requests
)

"Too many connections" / "Connection pool is full"¶

Problem: Too many concurrent requests.

Solution: Wait between crawls or reduce depth:

import asyncio

spider = Spider(start_url="https://example.com", max_depth=1)
documents = await spider.run_async()

# Wait before next crawl
await asyncio.sleep(5)

spider2 = Spider(start_url="https://example.com/section2", max_depth=1)
documents2 = await spider2.run_async()

Parsing & Document Issues¶

"No links found" / "Empty internal_links"¶

Problem: Spider isn't finding links on the page.

Debugging:

spider = Spider(start_url="https://example.com", max_depth=1)
documents = await spider.run_async()

doc = documents[0]
print(f"URL: {doc.url}")
print(f"Status: {doc.status_code}")
print(f"Title: {doc.title}")
print(f"HTML length: {len(doc.source)}")
print(f"Internal links: {len(doc.internal_links)}")
print(f"External links: {len(doc.external_links)}")

# Print first few links
for link in doc.internal_links[:3]:
    print(f"  - {link.url}")

Common causes: - Relative links not resolved correctly - JavaScript-generated links (linktrace doesn't execute JS) - Links hidden behind onclick or data-href - Page returned error status (check doc.status_code)

Solution for JS-heavy sites: Use Selenium or Playwright instead.

"Wrong internal/external classification"¶

Problem: Link marked as external when it should be internal (or vice versa).

Cause: Domain extraction issue.

doc.url = "https://example.com"
doc.domain  # "example"

# Subdomain issue?
doc.url = "https://api.example.com"
doc.domain  # "example" (correctly recognizes base domain)

# Link mismatch?
link.url = "https://www.example.com"  # "www" prefix
# Link is still internal because tldextract handles this

If still misclassified: Check link URL format:

for link in doc.links:
    print(f"{link.url} → {link.url.split('/')[2]}")

Caching Issues¶

"Cache file corrupted" warning¶

Problem: Cache file is invalid JSON or incomplete.

Solution: Clear cache:

import shutil
shutil.rmtree(".webcrawler_cache")

Cache is automatically cleaned up on corruption; the crawl continues.

"Cache not being used" / "Slow second run"¶

Problem: Cache directory doesn't exist or wrong path.

Debugging:

import os
from linktrace import Spider

cache_dir = ".webcrawler_cache"
print(f"Cache enabled: {os.path.exists(cache_dir)}")
print(f"Cache files: {os.listdir(cache_dir) if os.path.exists(cache_dir) else 'N/A'}")

spider = Spider(
    start_url="https://example.com",
    cache_dir=cache_dir
)

Solutions:

Ensure cache_dir is set:

spider = Spider(
    start_url="https://example.com",
    cache_dir=".webcrawler_cache"  # Don't forget this!
)

Use absolute path:

import os
cache_dir = os.path.join(os.getcwd(), ".webcrawler_cache")
spider = Spider(start_url="...", cache_dir=cache_dir)

Performance Issues¶

"Crawl is slow"¶

Problem: Waiting for network or parsing.

Debugging:

import time
from linktrace import Spider

start = time.time()
spider = Spider(start_url="https://example.com", max_depth=1)
documents = await spider.run_async()
elapsed = time.time() - start

print(f"Pages: {len(documents)}")
print(f"Time: {elapsed:.2f}s")
print(f"Per page: {elapsed / len(documents):.2f}s")

Solutions:

Increase max_depth cautiously (exponential time):

# Depth 1 = fast
# Depth 2 = slower
# Depth 3+ = can be very slow
spider = Spider(start_url="https://example.com", max_depth=2)

Use caching:

spider = Spider(
    start_url="https://example.com",
    cache_dir=".webcrawler_cache"  # 2-50x faster on 2nd run
)

Try BFS instead of DFS (or vice versa):

# DFS might find slow branch first
spider = Spider(
    start_url="https://example.com",
    traversal_strategy="bfs"  # Try this instead
)

Export/Serialization Issues¶

"pandas not found" error¶

Problem: pandas not installed.

Solution:

pip install pandas
# or
pip install linktrace[pandas]

"DataFrame is huge / running out of memory"¶

Problem: Too many rows from large crawl.

Solutions:

Export only specific fields:

df = serializer.to_pandas()
df_small = df[["url", "title", "link_url", "link_type"]]
df_small.to_csv("output.csv")

Process in batches:

df = serializer.to_pandas()
for chunk in [df[i:i+1000] for i in range(0, len(df), 1000)]:
    process(chunk)

Don't include HTML:

serializer = Serializers(documents)
df = serializer.to_pandas(include_html=False)  # Much smaller

Logging & Debugging¶

"No debug output" / "Silent failure"¶

Problem: Can't see what's happening.

Solution: Enable logging:

import logging

logging.basicConfig(
    level=logging.DEBUG,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)

spider = Spider(start_url="https://example.com")
documents = await spider.run_async()

# Now you'll see:
# DEBUG: Spider initialized: strategy=BFS, max_depth=3
# DEBUG: Fetching https://example.com (attempt 1)
# INFO: Visited: https://example.com
# etc.

Specific logger filtering:¶

# Just Spider logs
logging.getLogger("linktrace.Spider").setLevel(logging.DEBUG)

# Just Crawler logs
logging.getLogger("linktrace.Crawler").setLevel(logging.DEBUG)

# Silence everything else
logging.getLogger().setLevel(logging.WARNING)

Common Mistakes¶

❌ Not using async context manager¶

# Wrong
crawler = Crawler()
doc = await crawler.crawl_document_async(url)  # Session is None!

# Correct
async with Crawler() as crawler:
    doc = await crawler.crawl_document_async(url)

❌ Forgetting `await`¶

# Wrong
spider = Spider(start_url="https://example.com")
documents = spider.run_async()  # Returns coroutine, not documents!

# Correct
spider = Spider(start_url="https://example.com")
documents = await spider.run_async()

❌ Invalid traversal_strategy¶

# Wrong
spider = Spider(start_url="...", traversal_strategy="dfs-fast")

# Correct
spider = Spider(start_url="...", traversal_strategy="dfs")  # or "bfs"

❌ Expecting JavaScript execution¶

# Won't work - linktrace doesn't execute JavaScript
spider = Spider(start_url="https://react-app.example.com")
documents = await spider.run_async()

# Solution: Use Playwright or Selenium for JS-heavy sites

Getting Help¶

Check the logs — Enable DEBUG logging
Verify network — Test URL in browser
Check docs — See API Reference and Examples
File an issue — Include logs, error message, minimal reproduction

Troubleshooting¶

SSL/Certificate Issues¶

"SSL: CERTIFICATE_VERIFY_FAILED"¶

"Can't connect to HTTPS URL because the SSL module is not available"¶

Connection & Timeout Issues¶

"Connection refused" or "No route to host"¶

"Request timeout" / "Timeout waiting for response"¶

"Too many connections" / "Connection pool is full"¶

Parsing & Document Issues¶

"No links found" / "Empty internal_links"¶

"Wrong internal/external classification"¶

Caching Issues¶

"Cache file corrupted" warning¶

"Cache not being used" / "Slow second run"¶

Performance Issues¶

"Crawl is slow"¶

Export/Serialization Issues¶

"pandas not found" error¶

"DataFrame is huge / running out of memory"¶

Logging & Debugging¶

"No debug output" / "Silent failure"¶

Specific logger filtering:¶

Common Mistakes¶

❌ Not using async context manager¶

❌ Forgetting await¶

❌ Invalid traversal_strategy¶

❌ Expecting JavaScript execution¶

Getting Help¶

❌ Forgetting `await`¶