Skip to content

o3-cloud/evaluate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Evaluate

This is an example of doing simple model evaluation of RAG based prompt pipeline. Quality of the model response is evaulated using another LLM model. The Judge model compares the input prompt against the output responses to determine the quality of the response. The score is then used to rank the models.

Requirements

  • cllm installed
  • OpenAI API key and OPENAI_API_KEY set in the environment
  • Ollama installed and models downloaded with chat completion API running
  • Groq Cloud and GROQ_API_KEY set in the environment

Examples

Evaluate a single model ./.cllm/systems directory against the prompting guide vector store.

RAG=promptingguide
PROMPT="What is a RAG?"
task evaluate rag=${RAG} rag_fetch=5 prompt="${PROMPT}" model=l/llama

Evaluate all the models in the ./.cllm/systems directory against the prompting guide vector store.

RAG=promptingguide
PROMPT="What is a RAG?"
task all rag=${RAG} rag_fetch=5 prompt="${PROMPT}"

task report rag=${RAG}

Evaluate just the defined ollama the models using the cllm vector store.

RAG=cllm
PROMPT="What is cllm and how does it work?"
task exp-local-models rag=${RAG} rag_fetch=5 prompt="${PROMPT}"

task report rag=${RAG}

Output

The output of the evaluation is a table of the models and their scores. The scores are based on the quality of the response to the prompt. The higher the score the better the response.

Prompting Guide Eval Chart

Example output of the evaluation of the models against the cllm vector store. See traces/cllm directory.

About

Simple model evaluation using cllm

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published