Visualising Linear Regression

I have been wanting to observe a linear model adapting to the data, in real time. A few days, ago, I was successfully able to implement it in C using the raylib library. The logic is actually pretty simple to follow and I hope to state it in as much ease as it took me to implement it.

The Math:

Suppose you have 'n' samples of ('dependent variable' Y; 'explanatory variables' $\widetilde{X}_{1 \times p}$) and you wish to establish a linear relationship between the explanatory variables and the dependent variable. The same can be represented in the following way -

\[ Y_{n \times 1} = X_{n \times (p+1)} \beta_{(p+1) \times 1} + \epsilon_{n \times 1} \]

where, $\epsilon_{n \times 1} \sim MVN(\widetilde{0},\mathcal{I}_{n \times n})$ *

note - This condition * is called as homoscedasticity and is an assumption in linear regression.

Even though, the optimal parameters i.e. $\widehat{\beta}$ can be given as follows -

\[ \widehat{\beta} = (X^{T} X)^{-1} X^{T} Y~ * \]

we are not going to use this as this gives us the optimal model right away but since we wish to see the changes dynamically, we shall use the method of gradient descent instead.

There are several ways to prove *, but the most elegant one in my opinion requires the use of elementary Linear Algebra.

Proof

We have to find: $$ \widehat{\beta} = \arg \limits_{\beta} \min \|Y_{real} - Y_{predicted}\|^2 $$ which is the same as: $$ \widehat{\beta} = \arg \limits_{\beta} \min \|Y - X\beta\|^2 $$ Pick $\widehat{\beta} \ni \widehat{Y} = X\widehat{\beta}~$ is the projection of Y on $\mathcal{C}(X)$ (the column space of X). Observe that minimizing the quantity $\|Y - X\beta\|$ literally means finding the vector $X\widehat{\beta}$ closest to $Y$, lying in $\mathcal{C}(X)$, and that happens when $Y - X\widehat{\beta} \perp \mathcal{C}(X)$. $\Rightarrow$ $$ \forall~ C \in \mathcal{C}(X),~ < Y-X\widehat{\beta} , C > = 0$$ Since $C \in \mathcal{C}(X), C = XC'$ for some $C'$.
$\Rightarrow$ $$ < Y-X\widehat{\beta} , XC' > = 0$$ $\Rightarrow$ $$ < Y,XC' > - < X\widehat{\beta},XC' >~ = 0$$ $\Rightarrow$ $$(XC')^{T}Y - (XC')^{T}X\widehat{\beta} = 0$$ $\Rightarrow$ $$(XC')^{T}Y = (XC')^{T}X\widehat{\beta}$$ $\Rightarrow$ $$C'^{T}X^{T}Y = C'^{T}X^{T}X\widehat{\beta}$$ $\Rightarrow$ $$(X^{T}X)^{-1}X^{T}Y = \widehat{\beta}$$ To prove that $\widehat{\beta}$ is indeed our optimal candidate, pick another $\beta$ and notice that: $$ Y - X\beta = Y - X\widehat{\beta} + X(\widehat{\beta} - \beta)$$ now notice that $Y - X\widehat{\beta} \perp X(\widehat{\beta} - \beta)$ therefore, we can apply the pythagoras theorem: $$ \|Y - X\beta\| ^2 = \|Y - X\widehat{\beta}\|^2 + \|X(\widehat{\beta} - \beta)\|^2$$ the minimum of which occurs when $Y - X\beta = Y - X\widehat{\beta}$;
$\Rightarrow$ $$ \beta = \widehat{\beta} $$ And we are done!

Gradient Descent:

Suppose you have a continuous and a differentiable function, $\mathcal{f}:\mathbb{R}^{n} \rightarrow \mathbb{R}$ and you wish to find the parameter set $\widetilde{X} = (x_1',...,x_n')$ such that under the transformation, \[ x_i \leftarrow x_i',~ \forall~ i \in (1,...,n) \]

the value of $\mathcal{f}(x_1,...,x_n)$ decreases.

In simple words, in an 'n' - dimensional vector space, $\mathbb{R}^{n}$, we must 'move' in such a way such that the value of the function decreases, i.e. we must move in the direction in which the function $\mathcal{f}(x_1,...,x_n)$ achieves a minimum.

A few minutes of mathematical thinking can easily allow one to observe that the following transformation:

\[ x_i \leftarrow x_i + k_i,~ \forall~ i \in (1,...,n) ,\]

taking $k_i = - c \frac{\partial\mathcal{f}(x_1,...,x_n)}{ \partial x_i}$, where c > 0 is any constant, works!.*

The Intuition behind * is actually pretty simple and beautiful. We know that if $\frac{\partial\mathcal{f}(x_1,...,x_n)}{ \partial x_i} < 0$, then $\mathcal{f}(x_1,...,x_i + \delta,...,x_n) < \mathcal{f}(x_1,...,x_i,...,x_n)$ for some finite $\delta > 0$. That is, the funcion is decreasing in the direction of $x_i$, therefore, if we were to add a positive quantity to $x_i$, we would essentially be decreasing the value of the function, which is what we want. How do we find a suitable quantity? fortunately, instead of using random small numbers, we can just subtract $c \frac{\partial\mathcal{f}(x_1,...,x_n)}{ \partial x_i}$ from $x_i$, as $\frac{\partial\mathcal{f}(x_1,...,x_n)}{ \partial x_i} < 0$, we are basically just adding a positive quantity to $x_i$. A similar argument holds when $\frac{\partial\mathcal{f}(x_1,...,x_n)}{ \partial x_i} > 0$.

note - I am going to implement a simple linear regression model (2-dimensional, a single dependent variable and a single explanatory variable) as its easy to visualise. Basically, we shall be working with the model, \[ y = ax + b \] (y,x) as defined previously and (a,b) as the parameters that we need to adjust accordingly.

Now that we have gotten the math right!, let's try to actually implement it in C.

The C Code:

For graphical visualisation, I shall be making use of the Raylib library. The logic for implementing the graph and for projecting the data on the screen is pretty computational and boring and therefore, I am not going to bother about it. Check out the github repository to see its implementation.

Cost Function -

The cost function is nothing but the RSS, the Residual sum of squares (basically $\sum_{i=1}^{n} (Y_{i_{real}} - Y_{i_{predicted}})^2$), i.e. $(Y - X\widehat{\beta})^{T} (Y - X\widehat{\beta})$, in our case its just, $\sum_{i=1}^{n} (Y_i - a x_i - b)^2$.

Why do we need it? So that we can find the function using which we will adjust the parameters, a,b, accordingly using gradient descent. Observe that the whole point of keeping track of the cost function is to minimize the error or rather, the square of the errors which is exactly what we want to do to get to the closest approximation of the real value by the predicted value.

We are going to denote the cost function by '$f_c$'.

float cost_function(data dat,model parameters) {

  float cf = 0;

  for (int i = 0; i < data_size; i++) {
    float y_real = dat.points[i].y;
    float y_predicted = (float) parameters.a * dat.points[i].x + parameters.b;

    cf += (y_real - y_predicted) * (y_real - y_predicted);
  }

  return 0.5 * cf / data_size;
}

Gradient Descent / Updating the Model:

Updating model is basically just computing the gradient descent and applying the transformation, \[ x_i \leftarrow x_i - c \frac{\partial\mathcal{f_c}(x_1,...,x_n)}{ \partial x_i} \]

The constant c used in this optimisation method is something called as a "learning rate", often denoted by $\eta~$. So the transformation is essentially, \[ x_i \leftarrow x_i - \eta \frac{\partial\mathcal{f_c}(x_1,...,x_n)}{ \partial x_i} \]

void update_model(model *parameters, data dat) {

  model temp = {0};
  temp.a = parameters->a;
  temp.b = parameters->b;

  float cf = cost_function(dat, *parameters);
  temp.a += eps;
  parameters->a =
    parameters->a - learning_rate * (cost_function(dat, temp) - cf) / eps;
  temp.a -= eps;
  temp.b += eps;
  parameters->b =
    parameters->b - learning_rate * (cost_function(dat, temp) - cf) / eps;

}

What we are doing here is using the first principle to calculate the derivative of the cost function with respect to each of its parameters. The mathematics involved is fairly elementary:

\[ \frac{\partial\mathcal{f_c}(x_1,...,x_n)}{\partial x_i} = \lim_{\epsilon \to 0} \frac{\mathcal{f_c}(x_1,..,x_i + \epsilon,..,x_n) - \mathcal{f_c}(x_1,...,x_n)}{\epsilon} \]

In our case, \[ a \leftarrow a - \eta \frac{\mathcal{f_c}(a + \epsilon,b) - \mathcal{f_c}(a,b)}{\epsilon} \] \[ b \leftarrow b - \eta \frac{\mathcal{f_c}(a,b + \epsilon) - \mathcal{f_c}(a,b)}{\epsilon} \] and we have taken the value of $\epsilon$ as 0.00001 and $\eta$ as 0.01 (the learning rate is used to accelerate or decelerate the optimisation process, as necessary).

Everything else is just programming gymnastics.

Draw the Line!

Finally, we draw the line with the updated parameters -

void draw_line(model parameters) {

  /* y = ax + b; */

  Vector2 above_extreme_point = {0};
  Vector2 lower_extreme_point = {0};

  if (fabsf(parameters.a) > (float)screen_height / screen_width) {

    above_extreme_point = (Vector2){
      (float)(y_partitions - parameters.b) / parameters.a, y_partitions};
    transform_point(&above_extreme_point);

    lower_extreme_point = (Vector2){
      (float)(-y_partitions - parameters.b) / parameters.a, -y_partitions};
    transform_point(&lower_extreme_point);
  } else {

    above_extreme_point = (Vector2){
      x_partitions, (float)parameters.a * x_partitions + parameters.b};
    transform_point(&above_extreme_point);

    lower_extreme_point = (Vector2){
      -x_partitions, (float)parameters.a * (-x_partitions) + parameters.b};
    transform_point(&lower_extreme_point);
  }
  DrawLineV(above_extreme_point,lower_extreme_point,RED);
}

All the fuss that can be seen in the above code snippet owe its existence thanks to my messy method of transforming sample points to screen points and projecting data on the screen, not to mention the troubling convention of raylib itself in which the y-axis is apparently inverted.

This gist is that this function simply draws a line with a slope 'a' and a y-intercept 'b' whenever the parameters are updated, which happens once every frame.

Again, the complete code can be found in this file.

Results:

For the simulation, for the sample points, I am drawing out random numbers in such a way so that they form a good linear relationship with each other for the sake of visualisation and visualisation alone, the program works for any sample having a good enough correlation index. The initial parameters are set at a = -1.1 and b = -80 (far away from the optimal model so that the visualisation is more apparent, I have also deliberately kept the FPS low for the same reason).

for (int i = 0; i < data_size; i++) {

    float x, y;

    x = random_value(40) - 20;
    y = x - random_value(10) + 5;

    dat.points[i] = (Vector2){x, y};
  }
model parameters;
parameters.a = -1.1;
parameters.b = -80;

for (int i = 0; i < data_size; i++) {

    float x, y;

    x = random_value(40) - 20;

    if (x >= 0) y = x - random_value(10) + 5;
    else y = random_value(40) - 20;

    if (x >= 0) y = -y;

    dat.points[i] = (Vector2){x, y};
   }
model parameters;
parameters.a = -1.1;
parameters.b = -80;