What LLM Coding Benchmarks Actually Measure (and What They Don't)

AI coding assistants (Cursor, Claude Code, Antigravity, etc.) no longer need an introduction. Many of these tools let you choose the underlying LLM, and that choice definitely matters. However, with so many models on the market—and new ones coming out every few weeks—how do you make that choice?

Benchmarks are the obvious place to look. But how should you use them? And if you’ve followed this space even a little, you’ve probably heard plenty of criticism: benchmark gaming, training contamination, misaligned tasks, and so on. So are benchmarks useful at all?

A colleague recently asked me this, which triggered a longer train of thought that I wanted to write down.

Read more →

Diffusion models

In this article I want to tell you about diffusion models, which is an actively developing approach to image generation. Recent research shows that this paradigm can generate images of quality on par with or even exceeding the one of the best GANs. Moreover, the design of such models allows them to surpass two main GANs’ weaknesses, i.e. mode collapsing and sensitivity to hyperparameter choice. However, the same design, that makes diffusion models so powerful, makes them considerably slower on inference.

intro

| Table taken from Aran Komatsuzaki’s blog post. |

Read more →

Understanding positional encoding

The transformer model introduced in Attention Is All You Need uses positional encoding to enreach token embeddings with positional information. Authors note that there are several possible implementations of positional encoding, one of the most obvious ones being the trainable embedding layer. However, there are drawbacks to this approach, such as inability of model to work with sequences of length more than in training examples. Hence, authors search for alternative methods and settle for the following:

$$ PE_{(pos,2i)} = \sin \left( \frac{1}{10000^{2i/d_{\text{model}}}} pos \right) = \sin(\omega_i \cdot pos) $$$$ PE_{(pos,2i+1)} = \cos \left( \frac{1}{10000^{2i/d_{\text{model}}}} pos \right) = \cos(\omega_i \cdot pos) $$ Read more →

Normalizing flows in simple words

Suppose we have a sample of objects $X = \{x_i\}_{i=1}^n$ that come from an unknown distribution $p_x(x)$ and we want our model to learn this distribution. What do I mean by learning a distribution? There are many ways to define such task, but data scientists mostly settle for 2 things:

  1. learning to score the objects’ probability, i.e. learning the probability density function $p_x(x)$, and/or
  2. learning to sample from this unknown distribution, which implies the ability to sample new, unseen objects.

Does this description ring a bell? Yes, I’m talking precisely about generative models!

Read more →

Backpropagation through softmax layer

Have you ever wondered, how can we backpropagate the gradient through a softmax layer? If you were to google it, you would find lots of articles (such as this one, which helped me a lot), but most of them prove the formula of the softmax’s derivative and then jump straight to the backpropagation of cross-entropy loss through the softmax layer. And while normalizing the networks’ output before computing the classification loss is the most common use of softmax, those formulas have little to do with the actual backpropagation through the softmax layer itself, more like the backpropagation through the cross-entropy loss.

Read more →

Confusing detail about chain rule in linear layer backpropagation

Have you ever been wondering, how come gradients for a linear layer $Y = XW$ have this weird formulas?

$$ \frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} W^T,\ \frac{\partial L}{\partial W} = X^T \frac{\partial L}{\partial Y} $$

I mean the first one is easy: we just apply the chain rule et voila:

$$ \frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} \frac{\partial Y}{\partial X} = \frac{\partial L}{\partial Y} W^T $$ Read more →

What is git and how to start working with it?

As a software/machine learning engineer, I have been using git for quite a while. In fact, I cannot even recall the first time I started using it at all. And as such I fell into the trap of thinking that every software engineer knows how to use git. That was until recently I started working on a side project with two of my fellows. It turned out that there are people in the field who have no understanding whatsoever about what git is and how to work with it. Even more embarrassing, the person I am talking about was actually using git at his main job but was completely unaware of what it is and how does it work! He was just shown several buttons in an IDE and was pressing them and hoping for the best. I volunteered to help him out and came up with a little presentation, which he found very useful. Today I want to share slightly shortened and polished version of this presentation with you. I hope that it can help at least some of you out there to stop being afraid of and start using git.

Read more →