MCP HubMCP Hub
Heratiki

locallama-mcp

by: Heratiki

An MCP Server that works with Roo Code/Cline.Bot/Claude Desktop to optimize costs by intelligently routing coding tasks between local LLMs free APIs and paid APIs.

23created 25/02/2025
Visit
Optimization
Routing

📌Overview

Purpose: The LocalLama MCP Server aims to optimize costs by efficiently routing coding tasks between local language models and paid APIs, minimizing token usage.

Overview: LocalLama MCP Server intelligently decides whether to execute coding tasks using less-capable local instruct LLMs or to leverage paid APIs. Its architecture includes monitoring and decision-making modules that assess costs and performance.

Key Features:

  • Cost & Token Monitoring Module: Continuously tracks API usage and costs, providing real-time data to inform routing decisions.

  • Decision Engine: Compares costs and quality of local LLMs versus paid APIs, with customizable thresholds for efficient task allocation.

  • API Integration & Configurability: Offers an easy configuration interface for local model endpoints and integrates with OpenRouter for accessing various models.

  • Fallback & Error Handling: Ensures reliability through fallback mechanisms during API failures and includes thorough logging for error management.

  • Benchmarking System: Enables performance comparison between local and paid models, offering insights into response time, success rates, and quality, thereby assisting users in making informed decisions.


LocaLLama MCP Server

An MCP Server that works with Roo Code or Cline.Bot (currently untested with Claude Desktop or CoPilot MCP VS Code Extension) to optimize costs by intelligently routing coding tasks between local LLMs and paid APIs.

Overview

LocaLLama MCP Server reduces token usage and costs by dynamically deciding whether to offload a coding task to a local, less capable instruct LLM (e.g., LM Studio, Ollama) or use a paid API.

Key Components

Cost & Token Monitoring Module

  • Queries the current API service for context usage, cumulative costs, API token prices, and available credits.
  • Gathers real-time data to inform the decision engine.

Decision Engine

  • Defines rules comparing the cost of using the paid API against the cost and potential quality trade-offs of offloading to a local LLM.
  • Includes configurable thresholds for when to offload.
  • Uses preemptive routing based on benchmark data to make faster decisions without API calls.

API Integration & Configurability

  • Provides a configuration interface to specify endpoints for local instances (e.g., LM Studio, Ollama).
  • Interacts with these endpoints via standardized API calls.
  • Integrates with OpenRouter to access free and paid models from various providers.
  • Features robust directory handling and caching mechanisms.

Fallback & Error Handling

  • Implements fallback strategies if the paid API's data or local service is unavailable.
  • Includes robust logging and error handling.

Benchmarking System

  • Compares performance of local LLM models against paid API models.
  • Measures response time, success rate, quality score, and token usage.
  • Generates detailed reports for analysis.
  • Includes tools for benchmarking free models and updating prompting strategies.

Installation

# Clone the repository
git clone https://github.com/yourusername/locallama-mcp.git
cd locallama-mcp

# Install dependencies
npm install

# Build the project
npm run build

Configuration

Copy the .env.example file to create your own .env file:

cp .env.example .env

Edit the .env file with your specific configuration:

# Local LLM Endpoints
LM_STUDIO_ENDPOINT=http://localhost:1234/v1
OLLAMA_ENDPOINT=http://localhost:11434/api

# Configuration
DEFAULT_LOCAL_MODEL=qwen2.5-coder-3b-instruct
TOKEN_THRESHOLD=1500
COST_THRESHOLD=0.02
QUALITY_THRESHOLD=0.7

# Benchmark Configuration
BENCHMARK_RUNS_PER_TASK=3
BENCHMARK_PARALLEL=false
BENCHMARK_MAX_PARALLEL_TASKS=2
BENCHMARK_TASK_TIMEOUT=60000
BENCHMARK_SAVE_RESULTS=true
BENCHMARK_RESULTS_PATH=./benchmark-results

# API Keys (replace with your actual keys)
OPENROUTER_API_KEY=your_openrouter_api_key_here

# Logging
LOG_LEVEL=debug

Environment Variables Explained

  • Local LLM Endpoints

    • LM_STUDIO_ENDPOINT: URL where your LM Studio instance is running.
    • OLLAMA_ENDPOINT: URL where your Ollama instance is running.
  • Configuration

    • DEFAULT_LOCAL_MODEL: The local LLM model to use when offloading tasks.
    • TOKEN_THRESHOLD: Maximum token count before considering offloading to local LLM.
    • COST_THRESHOLD: Cost threshold (in USD) that triggers local LLM usage.
    • QUALITY_THRESHOLD: Quality score below which to use paid APIs regardless of cost.
  • API Keys

    • OPENROUTER_API_KEY: Your OpenRouter API key for accessing various LLM services.
  • New Tools

    • clear_openrouter_tracking: Clears OpenRouter tracking data and forces an update.
    • benchmark_free_models: Benchmarks the performance of free models from OpenRouter.

Environment Variables for Cline.Bot and Roo Code

When integrating with Cline.Bot or Roo Code, pass environment variables directly:

  • Use basic env variables in your MCP setup for simple configuration.
  • Configure thresholds to fine-tune when local vs. cloud models are used.
  • Specify local models to handle different types of requests.

Usage

Starting the Server

npm start

OpenRouter Integration

  • Automatically retrieves and tracks free and paid models from OpenRouter.
  • Maintains a local cache of available models to reduce API calls.
  • Includes a clear_openrouter_tracking tool to force updates of models.
  • Features robust directory handling and enhanced error logging.

To use OpenRouter integration:

  1. Set your OPENROUTER_API_KEY in environment variables.
  2. The server retrieves available models on startup.
  3. Use the clear_openrouter_tracking tool if free models do not appear or to get latest model information.

Current integration provides access to ~240 models, including 30+ free models from providers like Google, Meta, Mistral, and Microsoft.

Using with Cline.Bot

Add this MCP Server to your Cline MCP settings:

{
  "mcpServers": {
    "locallama": {
      "command": "node",
      "args": ["/path/to/locallama-mcp"],
      "env": {
        "LM_STUDIO_ENDPOINT": "http://localhost:1234/v1",
        "OLLAMA_ENDPOINT": "http://localhost:11434/api",
        "DEFAULT_LOCAL_MODEL": "qwen2.5-coder-3b-instruct",
        "TOKEN_THRESHOLD": "1500",
        "COST_THRESHOLD": "0.02",
        "QUALITY_THRESHOLD": "0.07",
        "OPENROUTER_API_KEY": "your_openrouter_api_key_here"
      },
      "disabled": false
    }
  }
}

Available MCP tools in Cline.Bot:

  • get_free_models: Retrieve free models from OpenRouter.
  • clear_openrouter_tracking: Refresh OpenRouter model data.
  • benchmark_free_models: Benchmark free models from OpenRouter.

Example usage in Cline.Bot:

/use_mcp_tool locallama clear_openrouter_tracking {}

Running Benchmarks

To benchmark local and paid API models:

# Run a simple benchmark
node run-benchmarks.js

# Run a comprehensive benchmark across multiple models
node run-benchmarks.js comprehensive

Benchmark results are saved in benchmark-results directory and include:

  • Individual task performance metrics (JSON).
  • Summary reports (JSON and Markdown).
  • Detailed analysis of model performance.

Benchmark Results

Benchmark results provide insights into:

  • Model response times.
  • Success rates.
  • Quality scores.
  • Token usage.

These help inform the decision engine and understand trade-offs between local LLMs and paid APIs.

Development

Running in Development Mode

npm run dev

Running Tests

npm test

Security Notes

  • .gitignore prevents sensitive data from being committed.
  • Store API keys and secrets in .env file, excluded from version control.
  • Benchmark results contain no sensitive information.

License

ISC