Overview
Ollama allows you to run open-source AI models locally on your own infrastructure. Perfect for privacy-sensitive applications, development, and cost optimization.Available Models
Llama 3.3 (70B) Local
2 credits • Powerful local model
- 128K context window
- Excellent reasoning for self-hosted
- Speed: Medium • Cost: Very Low (local)
- Best for: Local powerful performance
Llama 3.1 (405B) Local
4 credits • Massive local model
- 128K context window
- Exceptional capabilities self-hosted
- Speed: Slow • Cost: Very Low (local)
- Best for: Maximum local intelligence
Llama 3.1 (70B) Local
2 credits • Balanced local model
- 128K context window
- Excellent for self-hosted apps
- Speed: Medium • Cost: Very Low (local)
- Best for: Strong local performance
Llama 3.1 (8B) Local
1 credit • Efficient local model
- 128K context window
- Good for basic local tasks
- Speed: Fast • Cost: Very Low (local)
- Best for: Light local processing
Setup
Prerequisites
Configure BoostGPT
- Dashboard Setup
- Core SDK
- Router SDK
1
Navigate to Integrations
Go to app.boostgpt.co and select Integrations
2
Select Ollama
Find and click on the Ollama provider
3
Configure Host
Enter your Ollama host URL (default:
http://localhost:11434)Select which agents will use Ollama4
Save Configuration
Click save to apply your Ollama configuration
Hardware Requirements
| Model | Min VRAM | Recommended RAM | CPU Cores | Best Hardware |
|---|---|---|---|---|
| Llama 3.1 (8B) | 8GB | 16GB | 4+ | Gaming PC, M1 Mac |
| Llama 3.1 (70B) | 40GB | 64GB | 8+ | Workstation, A100 |
| Llama 3.3 (70B) | 40GB | 64GB | 8+ | Workstation, A100 |
| Llama 3.1 (405B) | 200GB+ | 256GB+ | 16+ | Multi-GPU server |
Best Practices
Using provider_host for Ollama
When using the Core SDK chat method, specify the Ollama host withprovider_host:
The
provider_host parameter is required when using Ollama with the Core SDK chat method, as it tells BoostGPT where your Ollama instance is running.Model Selection for Hardware
Production Deployment
1
Use Docker
2
Configure Firewall
Ensure Ollama port (11434) is accessible to your BoostGPT application
3
Monitor Resources
Watch GPU/CPU usage and scale hardware as needed
4
Set Up Load Balancing
For high volume, run multiple Ollama instances behind a load balancer
Troubleshooting
Connection refused
Connection refused
Cause: Ollama server not running or firewall blockingSolutions:
- Run
ollama serveto start server - Check firewall allows port 11434
- Verify host URL in configuration
Model not found
Model not found
Cause: Model not pulled locallySolutions:
- Run
ollama pull <model-name> - Verify model name matches exactly
- Check
ollama listfor available models
Slow responses
Slow responses
Cause: Insufficient hardware resourcesSolutions:
- Use smaller model (8B instead of 70B)
- Add more RAM/VRAM
- Reduce max_reply_tokens
- Close other GPU-intensive applications
Out of memory errors
Out of memory errors
Cause: Model too large for available VRAMSolutions:
- Switch to smaller model
- Reduce context window
- Use CPU fallback (slower but works)
- Upgrade hardware