# Knowledge & Data Sources

## 🧠 **Powering Your Agent with Domain Knowledge**

The Knowledge tab is where you transform your AI agent from a general assistant into a domain expert. By connecting relevant data sources, documents, and knowledge bases, you give your agent access to the specific information it needs to provide accurate, contextual responses.

***

## 🎯 **Knowledge Source Types**

### **📄 Document Upload**

Upload files directly to your agent's knowledge base for semantic search and retrieval.

#### **Supported File Types**

* **PDF Documents**: Research papers, manuals, reports
* **Word Documents**: (.docx) Policies, procedures, guides
* **Text Files**: (.txt, .md) Documentation, notes, plain text
* **HTML Files**: (.html) Web pages, formatted documentation
* **Spreadsheets**: (.xlsx, .xls, .csv) Data tables, catalogs, structured data

#### **Document Processing Features**

* **Intelligent Chunking**: Configurable chunking strategies for optimal knowledge retrieval
* **Text Extraction**: Automatic text extraction from supported formats
* **Space Normalization**: Remove extra whitespace for cleaner text

#### **Best Practices for Document Upload**

```
✅ DO:
- Use clear, descriptive dataset names
- Keep documents focused on specific topics
- Ensure text is selectable (not just images)
- Configure chunking based on document structure

❌ AVOID:
- Uploading duplicate or conflicting information
- Using documents with poor formatting
- Including sensitive personal information without proper access controls
- Overwhelming with too many similar documents
```

### **📊 Table Upload**

Upload structured data in tabular format for precise lookups and semantic search.

#### **Supported Table Formats**

* **CSV Files**: Comma-separated values
* **Excel Files**: (.xlsx, .xls) Spreadsheets with single or multiple sheets
* **Manual Entry**: Create tables directly in the interface

#### **Table Configuration**

* **Column Types**: TEXT, NUMBER, INTEGER, BOOLEAN, DATE
* **Semantic Columns**: Mark columns for semantic search indexing
* **Column Sequencing**: Define display order for columns
* **Schema Analysis**: Automatic type detection from uploaded files

#### **Table Use Cases**

* Product catalogs with specifications and pricing
* Customer records and interaction history
* FAQ databases with questions and answers
* Knowledge articles with categorization
* Configuration and settings databases

### **🗄️ Database Schema (Manual)**

Create database-like schemas for structured knowledge organization.

#### **Database Format Features**

* **Custom Schema Design**: Define your own table structures
* **Column Type Support**: TEXT, NUMBER, INTEGER, BOOLEAN, DATE
* **Manual Data Entry**: Populate data through the interface
* **Structured Queries**: Enable precise data retrieval

***

## ⚙️ **Knowledge Processing Settings**

### **Chunking Strategy**

Control how documents are broken down for processing and retrieval.

#### **Chunking Configuration Options**

* **Chunk Type**: Strategy for dividing content
* **Max Tokens**: Maximum size per chunk (configurable)
* **Separator**: Custom separator for chunk boundaries
* **Remove Extra Spaces**: Clean up whitespace
* **Remove URLs/Emails**: Filter out contact information

#### **Best Practices**

```
Document Length Guidelines:
- Short documents (< 5 pages): Larger chunks (1000+ tokens)
- Medium documents (5-50 pages): Medium chunks (500-1000 tokens)
- Long documents (50+ pages): Smaller chunks (200-500 tokens)
- Technical docs with code: Preserve code blocks intact
```

***

## 🔍 **Agent Knowledge Configuration**

Configure how your agent retrieves and uses knowledge during conversations.

### **Retrieval Mode**

#### **Auto Retrieval** (Default: Off)

```
Automatic Knowledge Access:
- Agent automatically retrieves relevant knowledge for every query
- No explicit tool call needed
- Best for: Agents that always need domain knowledge
- Trade-off: May retrieve unnecessary information
```

#### **Manual Tool Call** (Default: On)

```
On-Demand Knowledge Access:
- Agent decides when to retrieve knowledge using available tools
- More control over when knowledge is accessed
- Best for: General-purpose agents that sometimes need knowledge
- Trade-off: Requires agent to recognize when knowledge is needed
```

### **Search Strategy**

#### **Hybrid Search** (Default - Recommended)

```
Combined Approach:
- Semantic search: Understanding context and meaning
- Full-text search: Exact keyword matching
- Best for: Most use cases, balances accuracy and coverage
- Returns: Semantically relevant + keyword-matched results
- Cost extra credits for re-ranking documents. 
```

#### **Semantic Search Only**

```
Vector-Based Retrieval:
- Understanding context and intent
- Finding conceptually related information
- Handling synonyms and variations
- Best for: Natural language queries and conceptual searches
```

#### **Full-Text Search Only**

```
Keyword-Based Retrieval:
- Exact term matching
- Faster for specific lookups
- Good for technical terminology and precise queries
- Best for: Known keywords and exact phrase matching
```

### **Retrieval Parameters**

#### **Top K** (Default: 5, Range: 1-10)

```
Number of Knowledge Chunks to Retrieve:
- Higher values: More comprehensive context, higher cost
- Lower values: Focused context, lower cost
- Recommended: 3-5 for most use cases
- Adjust based on: Query complexity and knowledge base size
```

#### **Threshold** (Default: 0.5, Range: 0.0-1.0)

```
Relevance Score Threshold:
- Higher threshold (0.7-1.0): Only highly relevant results
- Medium threshold (0.4-0.7): Balanced relevance
- Lower threshold (0.0-0.4): Include more potential matches
- Recommended: Start at 0.5, adjust based on retrieval quality
```

#### **Query Rewrite** (Default: On)

```
Query Optimization:
- Rewrites user query before knowledge retrieval
- Improves search relevance by clarifying intent
- Expands abbreviations and adds context
- Recommended: Enable for most use cases
```

#### **Rerank** (Default: Off)

```
Result Reranking:
- Post-processes retrieved results for better ordering
- Uses cross-encoder models for more accurate relevance
- Trade-off: Better results but additional latency
- Recommended: Enable for critical accuracy use cases
```

### **Connected Datasets**

#### **Multiple Dataset Support**

* Connect up to **100 datasets** per agent
* Each dataset appears as a searchable knowledge source
* Datasets maintain their own:
  * Name and ID
  * Source type (UPLOAD, MANUAL)
  * Format type (TEXT, TABLE, DATABASE)
  * Processing status

#### **Dataset Information Display**

For each connected dataset, the agent has access to:

* Dataset name (user-friendly identifier)
* Dataset ID (unique identifier)
* Source type (how data was added)
* Status (PENDING, SUCCESS, FAILURE)
* Format type (TEXT, TABLE, DATABASE)

***

## 📊 **Knowledge Analytics & Management**

### **Dataset Status Monitoring**

#### **Processing States**

* **PENDING**: Dataset creation or update in progress
* **SUCCESS**: Dataset ready for use
* **FAILURE**: Processing encountered errors

#### **Progress Tracking**

* Monitor document import progress
* Track embedding generation status
* View chunk processing metrics

### **Embedding Updates**

#### **Manual Embedding Refresh**

```
When to Update Embeddings:
- After modifying dataset content
- After bulk row updates in tables
- To incorporate new document versions
```

***

## 🔧 **Knowledge Configuration Best Practices**

### **Initial Setup Process**

1. **Audit Existing Information**: Catalog what knowledge you have
2. **Choose Dataset Format**: TEXT for documents, TABLE for structured data
3. **Configure Processing**: Set chunking and parsing options
4. **Select Embedding Model**: Choose based on language and domain
5. **Test Retrieval**: Verify agent responses with sample queries

### **Dataset Organization Strategies**

#### **By Topic**

```
Product Knowledge:
├── Product Features (TEXT dataset)
├── Technical Specifications (TABLE dataset)
├── Pricing & Plans (TABLE dataset)
└── Troubleshooting Guides (TEXT dataset)

Customer Support:
├── Common Issues (TEXT dataset)
├── Resolution Procedures (TEXT dataset)
└── Product Updates (TEXT dataset)
```

#### **By Source Type**

```
Documentation:
├── User Manual.pdf → TEXT dataset
├── API Reference.pdf → TEXT dataset
└── FAQ.csv → TABLE dataset

Internal Knowledge:
├── Training Materials → TEXT dataset
├── Process Documents → TEXT dataset
└── Policy Database → DATABASE dataset
```

### **Optimization Guidelines**

#### **Document Preparation**

```
Before Upload:
- Remove duplicate content across documents
- Ensure consistent formatting
- Split very large documents into logical sections
- Use descriptive filenames
- Remove or redact sensitive information
```

#### **Table Design**

```
Column Configuration:
- Mark relevant columns as "semantic" for search
- Use appropriate data types (TEXT, NUMBER, DATE, etc.)
- Include descriptive column names
- Maintain data consistency across rows
- Consider creating separate tables for different entity types
```

#### **Retrieval Tuning**

```
Adjustment Process:
1. Start with defaults (Hybrid, Top K=5, Threshold=0.5)
2. Test with representative queries
3. If too few results: Lower threshold, increase Top K
4. If too many irrelevant results: Raise threshold, enable rerank
5. If missing semantic matches: Switch to Semantic search
6. If missing exact matches: Switch to Full-text search
```

***

## 🚀 **Advanced Features**

### **Multi-Dataset Retrieval**

When connecting multiple datasets to an agent:

* Agent can search across all connected datasets
* Results merged and ranked by relevance
* Each result includes source dataset information
* Useful for comprehensive knowledge coverage

### **Semantic Column Configuration**

For TABLE and DATABASE formats:

* Mark specific columns for semantic search indexing
* Non-semantic columns remain queryable but not embedded
* Reduces embedding costs for large tables
* Improves search focus on relevant fields

***

## 🎯 **Knowledge Integration Checklist**

Before activating your agent's knowledge base:

* [ ] **Dataset Created**: Upload documents or create tables
* [ ] **Processing Complete**: Verify dataset status is SUCCESS
* [ ] **Embeddings Generated**: Check embedding progress is 100%
* [ ] **Search Strategy Configured**: Choose Hybrid, Semantic, or Full-text
* [ ] **Retrieval Parameters Set**: Configure Top K, threshold, rerank
* [ ] **Retrieval Mode Selected**: Auto or manual tool call
* [ ] **Test Queries Run**: Verify knowledge retrieval with sample questions
* [ ] **Access Controls Set**: Ensure appropriate workspace permissions

***

Your agent's knowledge is its competitive advantage—invest in building a comprehensive, well-organized knowledge base that enables intelligent, accurate responses.
