Troubleshooting¶
SSL/Certificate Issues¶
"SSL: CERTIFICATE_VERIFY_FAILED"¶
Problem: Certificate verification failed.
Solutions:
-
Self-signed certificates (testing only):
spider = Spider( start_url="https://self-signed.example.com", ssl_verify=False # ⚠️ Insecure ) -
Corporate proxy with CA bundle:
spider = Spider( start_url="https://internal.company.com", ssl_verify="/etc/ssl/certs/company-ca.pem" ) -
Skip hostname verification only:
spider = Spider( start_url="https://example.com", ssl_verify=True, verify_hostname=False )
"Can't connect to HTTPS URL because the SSL module is not available"¶
Problem: Python built without SSL support.
Solution: Rebuild Python with OpenSSL or use a prebuilt Python distribution.
Connection & Timeout Issues¶
"Connection refused" or "No route to host"¶
Problem: Cannot reach the target server.
Debugging:
- Check URL spelling
- Verify server is running
- Test connectivity: ping example.com
- Try in browser first
# Verify URL is correct
spider = Spider(start_url="https://example.com")
"Request timeout" / "Timeout waiting for response"¶
Problem: Server too slow or network latency.
Solutions:
-
Increase timeout:
spider = Spider( start_url="https://slow-server.example.com", request_timeout=60 # 60 seconds instead of default 30 ) -
Increase retries:
spider = Spider( start_url="https://example.com", max_retries=5 # Retry 5 times ) -
Use DFS for deep sites:
spider = Spider( start_url="https://example.com", traversal_strategy="dfs" # Fewer concurrent requests )
"Too many connections" / "Connection pool is full"¶
Problem: Too many concurrent requests.
Solution: Wait between crawls or reduce depth:
import asyncio
spider = Spider(start_url="https://example.com", max_depth=1)
documents = await spider.run_async()
# Wait before next crawl
await asyncio.sleep(5)
spider2 = Spider(start_url="https://example.com/section2", max_depth=1)
documents2 = await spider2.run_async()
Parsing & Document Issues¶
"No links found" / "Empty internal_links"¶
Problem: Spider isn't finding links on the page.
Debugging:
spider = Spider(start_url="https://example.com", max_depth=1)
documents = await spider.run_async()
doc = documents[0]
print(f"URL: {doc.url}")
print(f"Status: {doc.status_code}")
print(f"Title: {doc.title}")
print(f"HTML length: {len(doc.source)}")
print(f"Internal links: {len(doc.internal_links)}")
print(f"External links: {len(doc.external_links)}")
# Print first few links
for link in doc.internal_links[:3]:
print(f" - {link.url}")
Common causes:
- Relative links not resolved correctly
- JavaScript-generated links (linktrace doesn't execute JS)
- Links hidden behind onclick or data-href
- Page returned error status (check doc.status_code)
Solution for JS-heavy sites: Use Selenium or Playwright instead.
"Wrong internal/external classification"¶
Problem: Link marked as external when it should be internal (or vice versa).
Cause: Domain extraction issue.
doc.url = "https://example.com"
doc.domain # "example"
# Subdomain issue?
doc.url = "https://api.example.com"
doc.domain # "example" (correctly recognizes base domain)
# Link mismatch?
link.url = "https://www.example.com" # "www" prefix
# Link is still internal because tldextract handles this
If still misclassified: Check link URL format:
for link in doc.links:
print(f"{link.url} → {link.url.split('/')[2]}")
Caching Issues¶
"Cache file corrupted" warning¶
Problem: Cache file is invalid JSON or incomplete.
Solution: Clear cache:
import shutil
shutil.rmtree(".webcrawler_cache")
Cache is automatically cleaned up on corruption; the crawl continues.
"Cache not being used" / "Slow second run"¶
Problem: Cache directory doesn't exist or wrong path.
Debugging:
import os
from linktrace import Spider
cache_dir = ".webcrawler_cache"
print(f"Cache enabled: {os.path.exists(cache_dir)}")
print(f"Cache files: {os.listdir(cache_dir) if os.path.exists(cache_dir) else 'N/A'}")
spider = Spider(
start_url="https://example.com",
cache_dir=cache_dir
)
Solutions:
-
Ensure cache_dir is set:
spider = Spider( start_url="https://example.com", cache_dir=".webcrawler_cache" # Don't forget this! ) -
Use absolute path:
import os cache_dir = os.path.join(os.getcwd(), ".webcrawler_cache") spider = Spider(start_url="...", cache_dir=cache_dir)
Performance Issues¶
"Crawl is slow"¶
Problem: Waiting for network or parsing.
Debugging:
import time
from linktrace import Spider
start = time.time()
spider = Spider(start_url="https://example.com", max_depth=1)
documents = await spider.run_async()
elapsed = time.time() - start
print(f"Pages: {len(documents)}")
print(f"Time: {elapsed:.2f}s")
print(f"Per page: {elapsed / len(documents):.2f}s")
Solutions:
-
Increase max_depth cautiously (exponential time):
# Depth 1 = fast # Depth 2 = slower # Depth 3+ = can be very slow spider = Spider(start_url="https://example.com", max_depth=2) -
Use caching:
spider = Spider( start_url="https://example.com", cache_dir=".webcrawler_cache" # 2-50x faster on 2nd run ) -
Try BFS instead of DFS (or vice versa):
# DFS might find slow branch first spider = Spider( start_url="https://example.com", traversal_strategy="bfs" # Try this instead )
Export/Serialization Issues¶
"pandas not found" error¶
Problem: pandas not installed.
Solution:
pip install pandas
# or
pip install linktrace[pandas]
"DataFrame is huge / running out of memory"¶
Problem: Too many rows from large crawl.
Solutions:
-
Export only specific fields:
df = serializer.to_pandas() df_small = df[["url", "title", "link_url", "link_type"]] df_small.to_csv("output.csv") -
Process in batches:
df = serializer.to_pandas() for chunk in [df[i:i+1000] for i in range(0, len(df), 1000)]: process(chunk) -
Don't include HTML:
serializer = Serializers(documents) df = serializer.to_pandas(include_html=False) # Much smaller
Logging & Debugging¶
"No debug output" / "Silent failure"¶
Problem: Can't see what's happening.
Solution: Enable logging:
import logging
logging.basicConfig(
level=logging.DEBUG,
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
spider = Spider(start_url="https://example.com")
documents = await spider.run_async()
# Now you'll see:
# DEBUG: Spider initialized: strategy=BFS, max_depth=3
# DEBUG: Fetching https://example.com (attempt 1)
# INFO: Visited: https://example.com
# etc.
Specific logger filtering:¶
# Just Spider logs
logging.getLogger("linktrace.Spider").setLevel(logging.DEBUG)
# Just Crawler logs
logging.getLogger("linktrace.Crawler").setLevel(logging.DEBUG)
# Silence everything else
logging.getLogger().setLevel(logging.WARNING)
Common Mistakes¶
❌ Not using async context manager¶
# Wrong
crawler = Crawler()
doc = await crawler.crawl_document_async(url) # Session is None!
# Correct
async with Crawler() as crawler:
doc = await crawler.crawl_document_async(url)
❌ Forgetting await¶
# Wrong
spider = Spider(start_url="https://example.com")
documents = spider.run_async() # Returns coroutine, not documents!
# Correct
spider = Spider(start_url="https://example.com")
documents = await spider.run_async()
❌ Invalid traversal_strategy¶
# Wrong
spider = Spider(start_url="...", traversal_strategy="dfs-fast")
# Correct
spider = Spider(start_url="...", traversal_strategy="dfs") # or "bfs"
❌ Expecting JavaScript execution¶
# Won't work - linktrace doesn't execute JavaScript
spider = Spider(start_url="https://react-app.example.com")
documents = await spider.run_async()
# Solution: Use Playwright or Selenium for JS-heavy sites
Getting Help¶
- Check the logs — Enable DEBUG logging
- Verify network — Test URL in browser
- Check docs — See API Reference and Examples
- File an issue — Include logs, error message, minimal reproduction