Benchmarking AI Models

Choosing the right base model is pretty important and at the moment, I can’t find a comprehensive comparison between all of the popular models. As I go through and experiment with them, I’ll try to update this page with my findings.

Define Benchmarking

Benchmarking is quite difficult to get right. As a humble beginner in the AI field, I’m quite confident my benchmarks will be flawed. I’ll do my best to make them interesting, fair, and informative.

Methodology 1.0

Prefer the 13B/15B parameter model and/or the Q5_1 model if available.

Ask a few standard questions (e.g. How many legs does a cat have?)
Ask a few questions that are specific to the model (e.g. coding tasks, ridddles, text summarization)
Compare the results and document its performance

Models

StarCoder2: Coding LLM
- “Supporting a context window of up to 16,384 tokens, StarCoder2 is the next generation of transparently trained open code LLMs.”
DeepSeek Coder: Coding LLM
- “DeepSeek Coder is trained from scratch on both 87% code and 13% natural language in English and Chinese.”

Notes

Explorer

Benchmarking AI Models

Methodology 1.0

Models

Graph View

Table of Contents

Backlinks

Notes

Explorer

Benchmarking AI Models

Methodology 1.0 §

Models §

Graph View

Table of Contents

Backlinks

Methodology 1.0

Models