locallama-mcp
by: Heratiki
An MCP Server that works with Roo Code/Cline.Bot/Claude Desktop to optimize costs by intelligently routing coding tasks between local LLMs free APIs and paid APIs.
📌Overview
Purpose: The LocalLama MCP Server aims to optimize costs by efficiently routing coding tasks between local language models and paid APIs, minimizing token usage.
Overview: LocalLama MCP Server intelligently decides whether to execute coding tasks using less-capable local instruct LLMs or to leverage paid APIs. Its architecture includes monitoring and decision-making modules that assess costs and performance.
Key Features:
-
Cost & Token Monitoring Module: Continuously tracks API usage and costs, providing real-time data to inform routing decisions.
-
Decision Engine: Compares costs and quality of local LLMs versus paid APIs, with customizable thresholds for efficient task allocation.
-
API Integration & Configurability: Offers an easy configuration interface for local model endpoints and integrates with OpenRouter for accessing various models.
-
Fallback & Error Handling: Ensures reliability through fallback mechanisms during API failures and includes thorough logging for error management.
-
Benchmarking System: Enables performance comparison between local and paid models, offering insights into response time, success rates, and quality, thereby assisting users in making informed decisions.
LocaLLama MCP Server
An MCP Server that works with Roo Code or Cline.Bot (currently untested with Claude Desktop or CoPilot MCP VS Code Extension) to optimize costs by intelligently routing coding tasks between local LLMs and paid APIs.
Overview
LocaLLama MCP Server reduces token usage and costs by dynamically deciding whether to offload a coding task to a local, less capable instruct LLM (e.g., LM Studio, Ollama) or use a paid API.
Key Components
Cost & Token Monitoring Module
- Queries the current API service for context usage, cumulative costs, API token prices, and available credits.
- Gathers real-time data to inform the decision engine.
Decision Engine
- Defines rules comparing the cost of using the paid API against the cost and potential quality trade-offs of offloading to a local LLM.
- Includes configurable thresholds for when to offload.
- Uses preemptive routing based on benchmark data to make faster decisions without API calls.
API Integration & Configurability
- Provides a configuration interface to specify endpoints for local instances (e.g., LM Studio, Ollama).
- Interacts with these endpoints via standardized API calls.
- Integrates with OpenRouter to access free and paid models from various providers.
- Features robust directory handling and caching mechanisms.
Fallback & Error Handling
- Implements fallback strategies if the paid API's data or local service is unavailable.
- Includes robust logging and error handling.
Benchmarking System
- Compares performance of local LLM models against paid API models.
- Measures response time, success rate, quality score, and token usage.
- Generates detailed reports for analysis.
- Includes tools for benchmarking free models and updating prompting strategies.
Installation
# Clone the repository
git clone https://github.com/yourusername/locallama-mcp.git
cd locallama-mcp
# Install dependencies
npm install
# Build the project
npm run build
Configuration
Copy the .env.example
file to create your own .env
file:
cp .env.example .env
Edit the .env
file with your specific configuration:
# Local LLM Endpoints
LM_STUDIO_ENDPOINT=http://localhost:1234/v1
OLLAMA_ENDPOINT=http://localhost:11434/api
# Configuration
DEFAULT_LOCAL_MODEL=qwen2.5-coder-3b-instruct
TOKEN_THRESHOLD=1500
COST_THRESHOLD=0.02
QUALITY_THRESHOLD=0.7
# Benchmark Configuration
BENCHMARK_RUNS_PER_TASK=3
BENCHMARK_PARALLEL=false
BENCHMARK_MAX_PARALLEL_TASKS=2
BENCHMARK_TASK_TIMEOUT=60000
BENCHMARK_SAVE_RESULTS=true
BENCHMARK_RESULTS_PATH=./benchmark-results
# API Keys (replace with your actual keys)
OPENROUTER_API_KEY=your_openrouter_api_key_here
# Logging
LOG_LEVEL=debug
Environment Variables Explained
-
Local LLM Endpoints
LM_STUDIO_ENDPOINT
: URL where your LM Studio instance is running.OLLAMA_ENDPOINT
: URL where your Ollama instance is running.
-
Configuration
DEFAULT_LOCAL_MODEL
: The local LLM model to use when offloading tasks.TOKEN_THRESHOLD
: Maximum token count before considering offloading to local LLM.COST_THRESHOLD
: Cost threshold (in USD) that triggers local LLM usage.QUALITY_THRESHOLD
: Quality score below which to use paid APIs regardless of cost.
-
API Keys
OPENROUTER_API_KEY
: Your OpenRouter API key for accessing various LLM services.
-
New Tools
clear_openrouter_tracking
: Clears OpenRouter tracking data and forces an update.benchmark_free_models
: Benchmarks the performance of free models from OpenRouter.
Environment Variables for Cline.Bot and Roo Code
When integrating with Cline.Bot or Roo Code, pass environment variables directly:
- Use basic env variables in your MCP setup for simple configuration.
- Configure thresholds to fine-tune when local vs. cloud models are used.
- Specify local models to handle different types of requests.
Usage
Starting the Server
npm start
OpenRouter Integration
- Automatically retrieves and tracks free and paid models from OpenRouter.
- Maintains a local cache of available models to reduce API calls.
- Includes a clear_openrouter_tracking tool to force updates of models.
- Features robust directory handling and enhanced error logging.
To use OpenRouter integration:
- Set your
OPENROUTER_API_KEY
in environment variables. - The server retrieves available models on startup.
- Use the
clear_openrouter_tracking
tool if free models do not appear or to get latest model information.
Current integration provides access to ~240 models, including 30+ free models from providers like Google, Meta, Mistral, and Microsoft.
Using with Cline.Bot
Add this MCP Server to your Cline MCP settings:
{
"mcpServers": {
"locallama": {
"command": "node",
"args": ["/path/to/locallama-mcp"],
"env": {
"LM_STUDIO_ENDPOINT": "http://localhost:1234/v1",
"OLLAMA_ENDPOINT": "http://localhost:11434/api",
"DEFAULT_LOCAL_MODEL": "qwen2.5-coder-3b-instruct",
"TOKEN_THRESHOLD": "1500",
"COST_THRESHOLD": "0.02",
"QUALITY_THRESHOLD": "0.07",
"OPENROUTER_API_KEY": "your_openrouter_api_key_here"
},
"disabled": false
}
}
}
Available MCP tools in Cline.Bot:
get_free_models
: Retrieve free models from OpenRouter.clear_openrouter_tracking
: Refresh OpenRouter model data.benchmark_free_models
: Benchmark free models from OpenRouter.
Example usage in Cline.Bot:
/use_mcp_tool locallama clear_openrouter_tracking {}
Running Benchmarks
To benchmark local and paid API models:
# Run a simple benchmark
node run-benchmarks.js
# Run a comprehensive benchmark across multiple models
node run-benchmarks.js comprehensive
Benchmark results are saved in benchmark-results
directory and include:
- Individual task performance metrics (JSON).
- Summary reports (JSON and Markdown).
- Detailed analysis of model performance.
Benchmark Results
Benchmark results provide insights into:
- Model response times.
- Success rates.
- Quality scores.
- Token usage.
These help inform the decision engine and understand trade-offs between local LLMs and paid APIs.
Development
Running in Development Mode
npm run dev
Running Tests
npm test
Security Notes
.gitignore
prevents sensitive data from being committed.- Store API keys and secrets in
.env
file, excluded from version control. - Benchmark results contain no sensitive information.
License
ISC