Managing Documents

Upload, process, and manage documents in the ThinkFleet knowledge base.

6 min readKnowledge Base

Managing Documents

This guide covers everything you need to know about adding documents to your knowledge base, monitoring processing, and maintaining document quality.

Uploading Documents

From the Dashboard

  1. Navigate to Knowledge Base in the sidebar
  2. Select your knowledge base
  3. Click Upload Documents
  4. Drag and drop files or click to browse
  5. Select one or more files (up to 50 MB per file)
  6. Click Upload

Supported Formats

Format Max Size Notes
PDF 50 MB Text-based PDFs. Scanned PDFs with OCR support.
DOCX 50 MB Microsoft Word documents
TXT 10 MB Plain text files
MD 10 MB Markdown files (headers used as chunk boundaries)
HTML 10 MB HTML pages (tags stripped during extraction)
CSV 50 MB Each row becomes a searchable chunk

Bulk Upload

Upload multiple files at once by selecting them in the file picker or dragging a batch onto the upload area. Each file is processed independently.

Via API

curl -X POST \
  https://your-instance.thinkfleet.com/api/v1/projects/{projectId}/knowledge-bases/{kbId}/documents \
  -H "Authorization: Bearer {apiKey}" \
  -F "file=@/path/to/document.pdf" \
  -F "name=Product Guide v2"

Document Processing

After upload, documents go through an automated processing pipeline.

Processing Steps

1. Text Extraction

The parser extracts raw text from the uploaded file:

  • PDF: Extracts text layer; falls back to OCR for scanned pages
  • DOCX: Extracts text preserving paragraph structure
  • HTML: Strips tags, extracts visible text content
  • CSV: Converts rows to structured text entries
  • TXT/MD: Used directly

2. Chunking

The extracted text is split into chunks based on your knowledge base settings:

Document: "Product Guide" (5,000 tokens)
    │
    ▼
Chunk 1: tokens 1-500    (Introduction)
Chunk 2: tokens 451-950  (overlap: 50)
Chunk 3: tokens 901-1400 (overlap: 50)
...
Chunk 10: tokens 4501-5000

The chunking engine respects natural boundaries:

  • Paragraph breaks
  • Markdown headers
  • Sentence endings
  • List items

This means chunks rarely split mid-sentence.

3. Embedding Generation

Each chunk is converted to a vector embedding — a numerical representation that captures its semantic meaning.

Chunk: "To reset your password, go to Settings > Security > Change Password"
    │
    ▼
Embedding: [0.023, -0.156, 0.891, ..., 0.044]  (1536 dimensions)

4. Storage

Chunks and their embeddings are stored in pgvector, ready for similarity search.

Processing Status

Monitor processing in the document list:

Status Icon Description
Uploading Spinner File is being uploaded
Processing Spinner Text extraction and chunking in progress
Embedding Spinner Generating vector embeddings
Ready Check Document is fully indexed and searchable
Error Warning Processing failed

Processing time depends on document size:

Document Size Approximate Time
< 10 pages 5-15 seconds
10-50 pages 15-60 seconds
50-200 pages 1-5 minutes
200+ pages 5-15 minutes

Viewing Documents

Document List

The knowledge base dashboard shows all documents:

Column Description
Name Document display name
Status Processing status
Chunks Number of chunks generated
Size Original file size
Uploaded Upload timestamp

Document Details

Click on a document to view:

  • Metadata: Name, size, upload date, chunk count
  • Chunks: Browse individual chunks with their text content
  • Search Preview: Test queries against this specific document

Updating Documents

Replacing a Document

If a document's content changes (e.g., an updated product guide):

  1. Click on the document in the list
  2. Click Replace
  3. Upload the new version
  4. ThinkFleet re-processes the document, replacing all old chunks

The document ID remains the same, so any agent references are preserved.

Renaming

Click the document name to edit it. The name is included in chunk metadata, which can improve search relevance.

Deleting Documents

  1. Select one or more documents in the list
  2. Click Delete
  3. Confirm the deletion

Deleting a document removes:

  • The original file
  • All generated chunks
  • All vector embeddings

This is permanent and cannot be undone.

Search Testing

Test search quality directly from the knowledge base dashboard:

  1. Click Search in the knowledge base view
  2. Enter a test query
  3. View the returned chunks ranked by relevance

Interpreting Results

Each result shows:

Field Description
Document Which document the chunk came from
Chunk Text The content of the matching chunk
Relevance Score Cosine similarity (0.0 to 1.0)
Position Where in the document the chunk is located

Improving Search Results

If search results aren't satisfactory:

  1. Adjust chunk size — Smaller chunks give more precise matches
  2. Increase overlap — Ensures boundary content is captured
  3. Rewrite document sections — Use clear, specific language
  4. Add section headers — They help the chunker create meaningful segments
  5. Remove noise — Delete boilerplate, disclaimers, and repeated content

Best Practices

Document Preparation

  1. Clean up before uploading — Remove headers, footers, page numbers, and watermarks from PDFs
  2. Use structured formats — Markdown and well-formatted DOCX files produce better chunks than unstructured text
  3. One topic per document — "Billing FAQ" is better than "Everything About Our Company"
  4. Include context — Each section should make sense on its own, since chunks are retrieved independently

Knowledge Base Organization

  1. Create separate knowledge bases for different domains — "Product Docs", "Legal", "HR Policies"
  2. Keep documents current — Replace outdated versions promptly
  3. Monitor no-result queries — They reveal content gaps
  4. Test regularly — After adding new documents, run representative queries to verify search quality

Scaling

Knowledge Base Size Documents Chunks Performance
Small 1-50 < 5,000 Instant
Medium 50-500 5,000-50,000 < 1 second
Large 500-5,000 50,000-500,000 1-3 seconds
Very Large 5,000+ 500,000+ Consider partitioning

For very large knowledge bases, consider splitting into multiple focused knowledge bases and assigning them to relevant agents.

Next Steps