Knowledge Base & Training
Train your chatbot on your website, sitemaps, and PDFs.
Knowledge Base & Training
The knowledge base lets your chatbot answer questions based on your own content — your website, documentation, product pages, or any PDF you upload. Without a knowledge base, the chatbot relies only on its system prompt and the AI model's general training.
Source types
Website URL
Provide a website URL and Chatmancer will crawl it — following internal links up to a configurable depth — and index the text content of every page it finds.
Good for: marketing sites, help centres, product documentation hosted on a website.
Tips for best results:
- Make sure your content is server-rendered or statically generated. JavaScript-only pages (SPAs without SSR) may not be fully crawlable.
- Use clear headings (
<h1>,<h2>,<h3>) — the crawler uses these to understand document structure. - Avoid pages hidden behind login walls — the crawler is unauthenticated.
Sitemap XML
Provide a sitemap.xml URL. Chatmancer will fetch every URL listed in the sitemap and index those pages. This is more reliable than URL crawling for large sites because you control exactly which pages are included.
https://example.com/sitemap.xmlPDF upload
Upload PDF files directly from the knowledge base page. Chatmancer extracts the text content and indexes it alongside your other sources.
Good for: product manuals, whitepapers, internal policy documents, compliance docs.
Note: Scanned PDFs (image-only) are not currently supported. The PDF must contain selectable text.
How indexing works
When you add a source, Chatmancer:
- Fetches and parses the content (crawl, sitemap fetch, or PDF extraction)
- Splits the content into overlapping chunks (~500 tokens each)
- Generates a vector embedding for each chunk using OpenAI's embedding model
- Stores the chunks and embeddings in your RDS database
When a visitor asks a question, the chatbot runs a semantic similarity search across all embeddings to find the most relevant chunks, then passes them as context to the language model.
Source status
Each source has a status indicator:
| Status | Meaning |
|---|---|
pending | Queued, not yet started |
crawling | Actively fetching and indexing |
ready | Indexed and available for search |
failed | Crawl or indexing error — see logs |
For large websites, crawling can take several minutes to hours. The chatbot continues to use any previously indexed content while a re-crawl is in progress.
Re-crawling and updating content
Sources are not automatically re-crawled on a schedule. To pick up changes to your website or documents:
- Go to Chatbots → [your chatbot] → Knowledge Base
- Click the refresh icon next to the source you want to update
- The source status resets to
pendingand re-indexes from scratch
For time-sensitive content (e.g. a changelog), consider re-crawling after each publish.
Deleting a source
To remove a source and all its indexed content:
- Go to Knowledge Base for the relevant chatbot
- Click the delete (✕) button next to the source
- The source and all associated embeddings are permanently removed
The chatbot will no longer use that content in future conversations. Existing conversation history is not affected.
Limits
| Item | Limit |
|---|---|
| Pages per website crawl | 500 (configurable) |
| PDF file size | 50 MB per file |
| PDF pages | 500 pages per file |
| Sources per chatbot | Unlimited |