Managing Documents
Upload, process, and manage documents in the ThinkFleet knowledge base.
Managing Documents
This guide covers everything you need to know about adding documents to your knowledge base, monitoring processing, and maintaining document quality.
Uploading Documents
From the Dashboard
- Navigate to Knowledge Base in the sidebar
- Select your knowledge base
- Click Upload Documents
- Drag and drop files or click to browse
- Select one or more files (up to 50 MB per file)
- Click Upload
Supported Formats
| Format | Max Size | Notes |
|---|---|---|
| 50 MB | Text-based PDFs. Scanned PDFs with OCR support. | |
| DOCX | 50 MB | Microsoft Word documents |
| TXT | 10 MB | Plain text files |
| MD | 10 MB | Markdown files (headers used as chunk boundaries) |
| HTML | 10 MB | HTML pages (tags stripped during extraction) |
| CSV | 50 MB | Each row becomes a searchable chunk |
Bulk Upload
Upload multiple files at once by selecting them in the file picker or dragging a batch onto the upload area. Each file is processed independently.
Via API
curl -X POST \
https://your-instance.thinkfleet.com/api/v1/projects/{projectId}/knowledge-bases/{kbId}/documents \
-H "Authorization: Bearer {apiKey}" \
-F "file=@/path/to/document.pdf" \
-F "name=Product Guide v2"
Document Processing
After upload, documents go through an automated processing pipeline.
Processing Steps
1. Text Extraction
The parser extracts raw text from the uploaded file:
- PDF: Extracts text layer; falls back to OCR for scanned pages
- DOCX: Extracts text preserving paragraph structure
- HTML: Strips tags, extracts visible text content
- CSV: Converts rows to structured text entries
- TXT/MD: Used directly
2. Chunking
The extracted text is split into chunks based on your knowledge base settings:
Document: "Product Guide" (5,000 tokens)
│
▼
Chunk 1: tokens 1-500 (Introduction)
Chunk 2: tokens 451-950 (overlap: 50)
Chunk 3: tokens 901-1400 (overlap: 50)
...
Chunk 10: tokens 4501-5000
The chunking engine respects natural boundaries:
- Paragraph breaks
- Markdown headers
- Sentence endings
- List items
This means chunks rarely split mid-sentence.
3. Embedding Generation
Each chunk is converted to a vector embedding — a numerical representation that captures its semantic meaning.
Chunk: "To reset your password, go to Settings > Security > Change Password"
│
▼
Embedding: [0.023, -0.156, 0.891, ..., 0.044] (1536 dimensions)
4. Storage
Chunks and their embeddings are stored in pgvector, ready for similarity search.
Processing Status
Monitor processing in the document list:
| Status | Icon | Description |
|---|---|---|
| Uploading | Spinner | File is being uploaded |
| Processing | Spinner | Text extraction and chunking in progress |
| Embedding | Spinner | Generating vector embeddings |
| Ready | Check | Document is fully indexed and searchable |
| Error | Warning | Processing failed |
Processing time depends on document size:
| Document Size | Approximate Time |
|---|---|
| < 10 pages | 5-15 seconds |
| 10-50 pages | 15-60 seconds |
| 50-200 pages | 1-5 minutes |
| 200+ pages | 5-15 minutes |
Viewing Documents
Document List
The knowledge base dashboard shows all documents:
| Column | Description |
|---|---|
| Name | Document display name |
| Status | Processing status |
| Chunks | Number of chunks generated |
| Size | Original file size |
| Uploaded | Upload timestamp |
Document Details
Click on a document to view:
- Metadata: Name, size, upload date, chunk count
- Chunks: Browse individual chunks with their text content
- Search Preview: Test queries against this specific document
Updating Documents
Replacing a Document
If a document's content changes (e.g., an updated product guide):
- Click on the document in the list
- Click Replace
- Upload the new version
- ThinkFleet re-processes the document, replacing all old chunks
The document ID remains the same, so any agent references are preserved.
Renaming
Click the document name to edit it. The name is included in chunk metadata, which can improve search relevance.
Deleting Documents
- Select one or more documents in the list
- Click Delete
- Confirm the deletion
Deleting a document removes:
- The original file
- All generated chunks
- All vector embeddings
This is permanent and cannot be undone.
Search Testing
Test search quality directly from the knowledge base dashboard:
- Click Search in the knowledge base view
- Enter a test query
- View the returned chunks ranked by relevance
Interpreting Results
Each result shows:
| Field | Description |
|---|---|
| Document | Which document the chunk came from |
| Chunk Text | The content of the matching chunk |
| Relevance Score | Cosine similarity (0.0 to 1.0) |
| Position | Where in the document the chunk is located |
Improving Search Results
If search results aren't satisfactory:
- Adjust chunk size — Smaller chunks give more precise matches
- Increase overlap — Ensures boundary content is captured
- Rewrite document sections — Use clear, specific language
- Add section headers — They help the chunker create meaningful segments
- Remove noise — Delete boilerplate, disclaimers, and repeated content
Best Practices
Document Preparation
- Clean up before uploading — Remove headers, footers, page numbers, and watermarks from PDFs
- Use structured formats — Markdown and well-formatted DOCX files produce better chunks than unstructured text
- One topic per document — "Billing FAQ" is better than "Everything About Our Company"
- Include context — Each section should make sense on its own, since chunks are retrieved independently
Knowledge Base Organization
- Create separate knowledge bases for different domains — "Product Docs", "Legal", "HR Policies"
- Keep documents current — Replace outdated versions promptly
- Monitor no-result queries — They reveal content gaps
- Test regularly — After adding new documents, run representative queries to verify search quality
Scaling
| Knowledge Base Size | Documents | Chunks | Performance |
|---|---|---|---|
| Small | 1-50 | < 5,000 | Instant |
| Medium | 50-500 | 5,000-50,000 | < 1 second |
| Large | 500-5,000 | 50,000-500,000 | 1-3 seconds |
| Very Large | 5,000+ | 500,000+ | Consider partitioning |
For very large knowledge bases, consider splitting into multiple focused knowledge bases and assigning them to relevant agents.