# Gradient Descent

# Taha Azzaoui - 2017.12.18

### Note

The following is not meant to be a rigorous proof of the ideas behind Gradient Descent. Rather it’s aim is to use fundamental ideas in Calculus to provide an intuitive understanding of what the algorithm claims, and why it makes the assumptions that it does.

### Introduction

The goal of optimization problems can be simple to state, yet sometimes nontrivial to achieve. Given a cost
function *f* : *A* → *R*, we generally seek some value
*x*^{*} ∈ *A* which satisfies the constraint *f*(*x*^{*}) ≤ *f*(*x*) ∀_{x ∈ A}. Note that we limit the conversation to cost functions
which are strictly convex in order to defer the problem of local minima.

### Intuition

There are many algorithms for optimizing the value of a function, perhaps the most intuitive of which is the
iterative method of Gradient Descent.The idea is simple, and can be illustrated by the following scenario. Suppose
you are standing on some mountainous convex region well above sea level. Furthermore, suppose you wish to
*descend* upon the lowest point in said region in the shortest amount of time. How would you go about
accomplishing such a goal? One way to go about it is to turn 360 degrees and ask yourself the following question:
*“Which direction points the most downwards?”* Then take a step in that direction, and repeat until you find
yourself at the lowest point in the region.

This is where the gradient comes in. Given a vector *v⃗* and a
vector-valued function *F* (differentiable around *v⃗*), the direction of maximum decrease in *F*
is that opposite to the gradient of *F* evaluated at *v⃗*, or − ∇*F*(*v⃗*). In other words,
the largest decrease in *F* relative to *v⃗* is some scalar multiple *α*, along
− ∇*F*(*v⃗*).

### Why does this work?

Recall that in two-dimensional Euclidean space, the optimization of a differentiable function *f*(*x*) results in the search space: $\{\alpha \in
\mathbb{R} : \frac{df}{dx}\|_{\alpha} = 0\}$. Where we define the derivative of a function as

$$\frac{df}{dx} = \lim_{\Delta x\to\infty} \frac{f(x + \Delta x) - f(x)}{\Delta
x}$$

This definition works well in two dimensions, where there are only two possible directions from which *Δ**x* can approach ∞, namely the from left and
the right (recall that the function must approach the same value from both sides for the limit to exist). This idea
of a limit holds in multiple dimensions as well, however we face a much stronger set of conditions. Not only must
we verify that the function is approaching the same value from the left and right, we must go a step further and
verify that it approaches the same value *irrespective* of the direction of approach. This presents a
problem as there exist an infinite number of directions from which to approach a value in multi-dimensional
space.

As such, the concept of a “derivative” becomes much less strictly defined. We can specify a direction and define
the **directional derivative** of a function *g* : *A* → ℝ^{n} in the direction of some unit vector
*v⃗* ∈ *A* as the value ∇_{v⃗}*g* = ∇*g* ⋅ *v⃗*. Intuitively, the directional
derivative determines the rate of change of *g* along the direction of
*v⃗*. Now we can rephrase our initial question. *“Which direction points
the most downwards?”* translates to *“For what value of v⃗ is the
directional derivative minimized?”* In other words, we seek a vector

*v⃗*∈

*A*that satisfies the constraint ∇

_{v⃗}

*g*≤ ∇

_{x⃗}

*g*∀

_{x⃗ ∈ A}.

By the definitions of the directional derivative and the dot product, we have:

∇_{v⃗}*g* = ∇*g* ⋅ *v⃗* = |∇*g*||*v⃗*|*c**o**s**θ* = |∇*g*|*c**o**s**θ*

(since *v⃗* is a unit vector). Since the gradient term is fixed,
minimizing the value of *c**o**s**θ* is sufficient for
minimizing the expression itself. This is obviously the case when *θ* = *π*, which implies that *v⃗* must
have the same magnitude but point in the opposite direction of ∇*g*.

# Therefore *v⃗* = − ∇*g* must satisfy our constraint.

### Convergence

Gradient Descent uses this concept iteratively, by starting out with an arbitrary initial value for the parameters of the cost function and incrementally updating each parameter along the direction of steepest descent. In doing so, Gradient Descent aims to get closer to a minima after every iteration until eventually converging after a finite number of steps.

Convergence here can be application specific. In general, we can run the algorithm for a pre-determined number
of iterations (on the order of 10^{3}), or we can declare convergence if
the incremental update for a given iteration is less than some *ε* = 10^{ − 3}.