Back to Projects

doc-scraper

Go web crawler that scrapes documentation sites and converts content to clean Markdown for LLM ingestion (RAG, training data).

  • Go
  • BadgerDB
  • Open Source

Overview

A resilient web crawler built in Go for structured documentation extraction, enabling scalable ingestion pipelines for AI/LLM systems. The tool converts documentation websites into clean Markdown format, ready for RAG applications or model training.

Why I Built This

Documentation is scattered across countless sites, each with different structures. When building LLM applications, you need clean, structured data. This tool automates the extraction process while preserving the semantic structure of the content.

Technical Approach

  • Go for performance and concurrency handling
  • BadgerDB for efficient key-value storage of crawled content
  • Robust error handling and retry logic for resilient scraping
  • Configurable depth and domain restrictions

Community Adoption

Adopted by the open-source community on GitHub. Developers use it to streamline knowledge integration into ML workflows, build custom RAG systems, and prepare training datasets.