Chatmancer Docs

Knowledge Base & Training

Train your chatbot on your website, sitemaps, and PDFs.

Knowledge Base & Training

The knowledge base lets your chatbot answer questions based on your own content — your website, documentation, product pages, or any PDF you upload. Without a knowledge base, the chatbot relies only on its system prompt and the AI model's general training.


Source types

Website URL

Provide a website URL and Chatmancer will crawl it — following internal links up to a configurable depth — and index the text content of every page it finds.

Good for: marketing sites, help centres, product documentation hosted on a website.

Tips for best results:

  • Make sure your content is server-rendered or statically generated. JavaScript-only pages (SPAs without SSR) may not be fully crawlable.
  • Use clear headings (<h1>, <h2>, <h3>) — the crawler uses these to understand document structure.
  • Avoid pages hidden behind login walls — the crawler is unauthenticated.

Sitemap XML

Provide a sitemap.xml URL. Chatmancer will fetch every URL listed in the sitemap and index those pages. This is more reliable than URL crawling for large sites because you control exactly which pages are included.

https://example.com/sitemap.xml

PDF upload

Upload PDF files directly from the knowledge base page. Chatmancer extracts the text content and indexes it alongside your other sources.

Good for: product manuals, whitepapers, internal policy documents, compliance docs.

Note: Scanned PDFs (image-only) are not currently supported. The PDF must contain selectable text.


How indexing works

When you add a source, Chatmancer:

  1. Fetches and parses the content (crawl, sitemap fetch, or PDF extraction)
  2. Splits the content into overlapping chunks (~500 tokens each)
  3. Generates a vector embedding for each chunk using OpenAI's embedding model
  4. Stores the chunks and embeddings in your RDS database

When a visitor asks a question, the chatbot runs a semantic similarity search across all embeddings to find the most relevant chunks, then passes them as context to the language model.


Source status

Each source has a status indicator:

StatusMeaning
pendingQueued, not yet started
crawlingActively fetching and indexing
readyIndexed and available for search
failedCrawl or indexing error — see logs

For large websites, crawling can take several minutes to hours. The chatbot continues to use any previously indexed content while a re-crawl is in progress.


Re-crawling and updating content

Sources are not automatically re-crawled on a schedule. To pick up changes to your website or documents:

  1. Go to Chatbots → [your chatbot] → Knowledge Base
  2. Click the refresh icon next to the source you want to update
  3. The source status resets to pending and re-indexes from scratch

For time-sensitive content (e.g. a changelog), consider re-crawling after each publish.


Deleting a source

To remove a source and all its indexed content:

  1. Go to Knowledge Base for the relevant chatbot
  2. Click the delete (✕) button next to the source
  3. The source and all associated embeddings are permanently removed

The chatbot will no longer use that content in future conversations. Existing conversation history is not affected.


Limits

ItemLimit
Pages per website crawl500 (configurable)
PDF file size50 MB per file
PDF pages500 pages per file
Sources per chatbotUnlimited

On this page