# Knowledge & Data Sources

## 🧠 **Powering Your Agent with Domain Knowledge**

The Knowledge tab is where you transform your AI agent from a general assistant into a domain expert. By connecting relevant data sources, documents, and knowledge bases, you give your agent access to the specific information it needs to provide accurate, contextual responses.

***

## 🎯 **Knowledge Source Types**

### **📄 Document Upload**

Upload files directly to your agent's knowledge base for semantic search and retrieval.

#### **Supported File Types**

* **PDF Documents**: Research papers, manuals, reports
* **Word Documents**: (.docx) Policies, procedures, guides
* **Text Files**: (.txt, .md) Documentation, notes, plain text
* **HTML Files**: (.html) Web pages, formatted documentation
* **Spreadsheets**: (.xlsx, .xls, .csv) Data tables, catalogs, structured data

#### **Document Processing Features**

* **Intelligent Chunking**: Configurable chunking strategies for optimal knowledge retrieval
* **Text Extraction**: Automatic text extraction from supported formats
* **Space Normalization**: Remove extra whitespace for cleaner text

#### **Best Practices for Document Upload**

```
✅ DO:
- Use clear, descriptive dataset names
- Keep documents focused on specific topics
- Ensure text is selectable (not just images)
- Configure chunking based on document structure

❌ AVOID:
- Uploading duplicate or conflicting information
- Using documents with poor formatting
- Including sensitive personal information without proper access controls
- Overwhelming with too many similar documents
```

### **📊 Table Upload**

Upload structured data in tabular format for precise lookups and semantic search.

#### **Supported Table Formats**

* **CSV Files**: Comma-separated values
* **Excel Files**: (.xlsx, .xls) Spreadsheets with single or multiple sheets
* **Manual Entry**: Create tables directly in the interface

#### **Table Configuration**

* **Column Types**: TEXT, NUMBER, INTEGER, BOOLEAN, DATE
* **Semantic Columns**: Mark columns for semantic search indexing
* **Column Sequencing**: Define display order for columns
* **Schema Analysis**: Automatic type detection from uploaded files

#### **Table Use Cases**

* Product catalogs with specifications and pricing
* Customer records and interaction history
* FAQ databases with questions and answers
* Knowledge articles with categorization
* Configuration and settings databases

### **🗄️ Database Schema (Manual)**

Create database-like schemas for structured knowledge organization.

#### **Database Format Features**

* **Custom Schema Design**: Define your own table structures
* **Column Type Support**: TEXT, NUMBER, INTEGER, BOOLEAN, DATE
* **Manual Data Entry**: Populate data through the interface
* **Structured Queries**: Enable precise data retrieval

***

## ⚙️ **Knowledge Processing Settings**

### **Chunking Strategy**

Control how documents are broken down for processing and retrieval.

#### **Chunking Configuration Options**

* **Chunk Type**: Strategy for dividing content
* **Max Tokens**: Maximum size per chunk (configurable)
* **Separator**: Custom separator for chunk boundaries
* **Remove Extra Spaces**: Clean up whitespace
* **Remove URLs/Emails**: Filter out contact information

#### **Best Practices**

```
Document Length Guidelines:
- Short documents (< 5 pages): Larger chunks (1000+ tokens)
- Medium documents (5-50 pages): Medium chunks (500-1000 tokens)
- Long documents (50+ pages): Smaller chunks (200-500 tokens)
- Technical docs with code: Preserve code blocks intact
```

***

## 🔍 **Agent Knowledge Configuration**

Configure how your agent retrieves and uses knowledge during conversations.

### **Retrieval Mode**

#### **Auto Retrieval** (Default: Off)

```
Automatic Knowledge Access:
- Agent automatically retrieves relevant knowledge for every query
- No explicit tool call needed
- Best for: Agents that always need domain knowledge
- Trade-off: May retrieve unnecessary information
```

#### **Manual Tool Call** (Default: On)

```
On-Demand Knowledge Access:
- Agent decides when to retrieve knowledge using available tools
- More control over when knowledge is accessed
- Best for: General-purpose agents that sometimes need knowledge
- Trade-off: Requires agent to recognize when knowledge is needed
```

### **Search Strategy**

#### **Hybrid Search** (Default - Recommended)

```
Combined Approach:
- Semantic search: Understanding context and meaning
- Full-text search: Exact keyword matching
- Best for: Most use cases, balances accuracy and coverage
- Returns: Semantically relevant + keyword-matched results
- Cost extra credits for re-ranking documents. 
```

#### **Semantic Search Only**

```
Vector-Based Retrieval:
- Understanding context and intent
- Finding conceptually related information
- Handling synonyms and variations
- Best for: Natural language queries and conceptual searches
```

#### **Full-Text Search Only**

```
Keyword-Based Retrieval:
- Exact term matching
- Faster for specific lookups
- Good for technical terminology and precise queries
- Best for: Known keywords and exact phrase matching
```

### **Retrieval Parameters**

#### **Top K** (Default: 5, Range: 1-10)

```
Number of Knowledge Chunks to Retrieve:
- Higher values: More comprehensive context, higher cost
- Lower values: Focused context, lower cost
- Recommended: 3-5 for most use cases
- Adjust based on: Query complexity and knowledge base size
```

#### **Threshold** (Default: 0.5, Range: 0.0-1.0)

```
Relevance Score Threshold:
- Higher threshold (0.7-1.0): Only highly relevant results
- Medium threshold (0.4-0.7): Balanced relevance
- Lower threshold (0.0-0.4): Include more potential matches
- Recommended: Start at 0.5, adjust based on retrieval quality
```

#### **Query Rewrite** (Default: On)

```
Query Optimization:
- Rewrites user query before knowledge retrieval
- Improves search relevance by clarifying intent
- Expands abbreviations and adds context
- Recommended: Enable for most use cases
```

#### **Rerank** (Default: Off)

```
Result Reranking:
- Post-processes retrieved results for better ordering
- Uses cross-encoder models for more accurate relevance
- Trade-off: Better results but additional latency
- Recommended: Enable for critical accuracy use cases
```

### **Connected Datasets**

#### **Multiple Dataset Support**

* Connect up to **100 datasets** per agent
* Each dataset appears as a searchable knowledge source
* Datasets maintain their own:
  * Name and ID
  * Source type (UPLOAD, MANUAL)
  * Format type (TEXT, TABLE, DATABASE)
  * Processing status

#### **Dataset Information Display**

For each connected dataset, the agent has access to:

* Dataset name (user-friendly identifier)
* Dataset ID (unique identifier)
* Source type (how data was added)
* Status (PENDING, SUCCESS, FAILURE)
* Format type (TEXT, TABLE, DATABASE)

***

## 📊 **Knowledge Analytics & Management**

### **Dataset Status Monitoring**

#### **Processing States**

* **PENDING**: Dataset creation or update in progress
* **SUCCESS**: Dataset ready for use
* **FAILURE**: Processing encountered errors

#### **Progress Tracking**

* Monitor document import progress
* Track embedding generation status
* View chunk processing metrics

### **Embedding Updates**

#### **Manual Embedding Refresh**

```
When to Update Embeddings:
- After modifying dataset content
- After bulk row updates in tables
- To incorporate new document versions
```

***

## 🔧 **Knowledge Configuration Best Practices**

### **Initial Setup Process**

1. **Audit Existing Information**: Catalog what knowledge you have
2. **Choose Dataset Format**: TEXT for documents, TABLE for structured data
3. **Configure Processing**: Set chunking and parsing options
4. **Select Embedding Model**: Choose based on language and domain
5. **Test Retrieval**: Verify agent responses with sample queries

### **Dataset Organization Strategies**

#### **By Topic**

```
Product Knowledge:
├── Product Features (TEXT dataset)
├── Technical Specifications (TABLE dataset)
├── Pricing & Plans (TABLE dataset)
└── Troubleshooting Guides (TEXT dataset)

Customer Support:
├── Common Issues (TEXT dataset)
├── Resolution Procedures (TEXT dataset)
└── Product Updates (TEXT dataset)
```

#### **By Source Type**

```
Documentation:
├── User Manual.pdf → TEXT dataset
├── API Reference.pdf → TEXT dataset
└── FAQ.csv → TABLE dataset

Internal Knowledge:
├── Training Materials → TEXT dataset
├── Process Documents → TEXT dataset
└── Policy Database → DATABASE dataset
```

### **Optimization Guidelines**

#### **Document Preparation**

```
Before Upload:
- Remove duplicate content across documents
- Ensure consistent formatting
- Split very large documents into logical sections
- Use descriptive filenames
- Remove or redact sensitive information
```

#### **Table Design**

```
Column Configuration:
- Mark relevant columns as "semantic" for search
- Use appropriate data types (TEXT, NUMBER, DATE, etc.)
- Include descriptive column names
- Maintain data consistency across rows
- Consider creating separate tables for different entity types
```

#### **Retrieval Tuning**

```
Adjustment Process:
1. Start with defaults (Hybrid, Top K=5, Threshold=0.5)
2. Test with representative queries
3. If too few results: Lower threshold, increase Top K
4. If too many irrelevant results: Raise threshold, enable rerank
5. If missing semantic matches: Switch to Semantic search
6. If missing exact matches: Switch to Full-text search
```

***

## 🚀 **Advanced Features**

### **Multi-Dataset Retrieval**

When connecting multiple datasets to an agent:

* Agent can search across all connected datasets
* Results merged and ranked by relevance
* Each result includes source dataset information
* Useful for comprehensive knowledge coverage

### **Semantic Column Configuration**

For TABLE and DATABASE formats:

* Mark specific columns for semantic search indexing
* Non-semantic columns remain queryable but not embedded
* Reduces embedding costs for large tables
* Improves search focus on relevant fields

***

## 🎯 **Knowledge Integration Checklist**

Before activating your agent's knowledge base:

* [ ] **Dataset Created**: Upload documents or create tables
* [ ] **Processing Complete**: Verify dataset status is SUCCESS
* [ ] **Embeddings Generated**: Check embedding progress is 100%
* [ ] **Search Strategy Configured**: Choose Hybrid, Semantic, or Full-text
* [ ] **Retrieval Parameters Set**: Configure Top K, threshold, rerank
* [ ] **Retrieval Mode Selected**: Auto or manual tool call
* [ ] **Test Queries Run**: Verify knowledge retrieval with sample questions
* [ ] **Access Controls Set**: Ensure appropriate workspace permissions

***

Your agent's knowledge is its competitive advantage—invest in building a comprehensive, well-organized knowledge base that enables intelligent, accurate responses.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.agenticflow.ai/ai-agents/knowledge.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
