Choosing the right base model is pretty important and at the moment, I can’t find a comprehensive comparison between all of the popular models. As I go through and experiment with them, I’ll try to update this page with my findings.

Define Benchmarking

Benchmarking is quite difficult to get right. As a humble beginner in the AI field, I’m quite confident my benchmarks will be flawed. I’ll do my best to make them interesting, fair, and informative.

Methodology 1.0

Prefer the 13B/15B parameter model and/or the Q5_1 model if available.

  • Ask a few standard questions (e.g. How many legs does a cat have?)
  • Ask a few questions that are specific to the model (e.g. coding tasks, ridddles, text summarization)
  • Compare the results and document its performance

Models

  • StarCoder2: Coding LLM
    • “Supporting a context window of up to 16,384 tokens, StarCoder2 is the next generation of transparently trained open code LLMs.”
  • DeepSeek Coder: Coding LLM
    • “DeepSeek Coder is trained from scratch on both 87% code and 13% natural language in English and Chinese.”