Managing Documents

Upload, process, and manage documents in the ThinkFleet knowledge base.

6 min readKnowledge Base

Managing Documents

This guide covers everything you need to know about adding documents to your knowledge base, monitoring processing, and maintaining document quality.

Uploading Documents

From the Dashboard

Navigate to Knowledge Base in the sidebar
Select your knowledge base
Click Upload Documents
Drag and drop files or click to browse
Select one or more files (up to 50 MB per file)
Click Upload

Supported Formats

Format	Max Size	Notes
PDF	50 MB	Text-based PDFs. Scanned PDFs with OCR support.
DOCX	50 MB	Microsoft Word documents
TXT	10 MB	Plain text files
MD	10 MB	Markdown files (headers used as chunk boundaries)
HTML	10 MB	HTML pages (tags stripped during extraction)
CSV	50 MB	Each row becomes a searchable chunk

Bulk Upload

Upload multiple files at once by selecting them in the file picker or dragging a batch onto the upload area. Each file is processed independently.

Via API

curl -X POST \
  https://your-instance.thinkfleet.com/api/v1/projects/{projectId}/knowledge-bases/{kbId}/documents \
  -H "Authorization: Bearer {apiKey}" \
  -F "file=@/path/to/document.pdf" \
  -F "name=Product Guide v2"

Document Processing

After upload, documents go through an automated processing pipeline.

Processing Steps

1. Text Extraction

The parser extracts raw text from the uploaded file:

PDF: Extracts text layer; falls back to OCR for scanned pages
DOCX: Extracts text preserving paragraph structure
HTML: Strips tags, extracts visible text content
CSV: Converts rows to structured text entries
TXT/MD: Used directly

2. Chunking

The extracted text is split into chunks based on your knowledge base settings:

Document: "Product Guide" (5,000 tokens)
    │
    ▼
Chunk 1: tokens 1-500    (Introduction)
Chunk 2: tokens 451-950  (overlap: 50)
Chunk 3: tokens 901-1400 (overlap: 50)
...
Chunk 10: tokens 4501-5000

The chunking engine respects natural boundaries:

Paragraph breaks
Markdown headers
Sentence endings
List items

This means chunks rarely split mid-sentence.

3. Embedding Generation

Each chunk is converted to a vector embedding — a numerical representation that captures its semantic meaning.

Chunk: "To reset your password, go to Settings > Security > Change Password"
    │
    ▼
Embedding: [0.023, -0.156, 0.891, ..., 0.044]  (1536 dimensions)

4. Storage

Chunks and their embeddings are stored in pgvector, ready for similarity search.

Processing Status

Monitor processing in the document list:

Status	Icon	Description
Uploading	Spinner	File is being uploaded
Processing	Spinner	Text extraction and chunking in progress
Embedding	Spinner	Generating vector embeddings
Ready	Check	Document is fully indexed and searchable
Error	Warning	Processing failed

Processing time depends on document size:

Document Size	Approximate Time
< 10 pages	5-15 seconds
10-50 pages	15-60 seconds
50-200 pages	1-5 minutes
200+ pages	5-15 minutes

Viewing Documents

Document List

The knowledge base dashboard shows all documents:

Column	Description
Name	Document display name
Status	Processing status
Chunks	Number of chunks generated
Size	Original file size
Uploaded	Upload timestamp

Document Details

Click on a document to view:

Metadata: Name, size, upload date, chunk count
Chunks: Browse individual chunks with their text content
Search Preview: Test queries against this specific document

Updating Documents

Replacing a Document

If a document's content changes (e.g., an updated product guide):

Click on the document in the list
Click Replace
Upload the new version
ThinkFleet re-processes the document, replacing all old chunks

The document ID remains the same, so any agent references are preserved.

Renaming

Click the document name to edit it. The name is included in chunk metadata, which can improve search relevance.

Deleting Documents

Select one or more documents in the list
Click Delete
Confirm the deletion

Deleting a document removes:

The original file
All generated chunks
All vector embeddings

This is permanent and cannot be undone.

Search Testing

Test search quality directly from the knowledge base dashboard:

Click Search in the knowledge base view
Enter a test query
View the returned chunks ranked by relevance

Interpreting Results

Each result shows:

Field	Description
Document	Which document the chunk came from
Chunk Text	The content of the matching chunk
Relevance Score	Cosine similarity (0.0 to 1.0)
Position	Where in the document the chunk is located

Improving Search Results

If search results aren't satisfactory:

Adjust chunk size — Smaller chunks give more precise matches
Increase overlap — Ensures boundary content is captured
Rewrite document sections — Use clear, specific language
Add section headers — They help the chunker create meaningful segments
Remove noise — Delete boilerplate, disclaimers, and repeated content

Best Practices

Document Preparation

Clean up before uploading — Remove headers, footers, page numbers, and watermarks from PDFs
Use structured formats — Markdown and well-formatted DOCX files produce better chunks than unstructured text
One topic per document — "Billing FAQ" is better than "Everything About Our Company"
Include context — Each section should make sense on its own, since chunks are retrieved independently

Knowledge Base Organization

Create separate knowledge bases for different domains — "Product Docs", "Legal", "HR Policies"
Keep documents current — Replace outdated versions promptly
Monitor no-result queries — They reveal content gaps
Test regularly — After adding new documents, run representative queries to verify search quality

Scaling

Knowledge Base Size	Documents	Chunks	Performance
Small	1-50	< 5,000	Instant
Medium	50-500	5,000-50,000	< 1 second
Large	500-5,000	50,000-500,000	1-3 seconds
Very Large	5,000+	500,000+	Consider partitioning

For very large knowledge bases, consider splitting into multiple focused knowledge bases and assigning them to relevant agents.