Gradient Descent

Taha Azzaoui - 2017.12.18

Note

The following is not meant to be a rigorous proof of the ideas behind Gradient Descent. Rather it’s aim is to use fundamental ideas in Calculus to provide an intuitive understanding of what the algorithm claims, and why it makes the assumptions that it does.

Introduction

The goal of optimization problems can be simple to state, yet sometimes nontrivial to achieve. Given a cost function f : A → R, we generally seek some value x* ∈ A which satisfies the constraint f(x*) ≤ f(x) x ∈ A. Note that we limit the conversation to cost functions which are strictly convex in order to defer the problem of local minima.

Intuition

There are many algorithms for optimizing the value of a function, perhaps the most intuitive of which is the iterative method of Gradient Descent.The idea is simple, and can be illustrated by the following scenario. Suppose you are standing on some mountainous convex region well above sea level. Furthermore, suppose you wish to descend upon the lowest point in said region in the shortest amount of time. How would you go about accomplishing such a goal? One way to go about it is to turn 360 degrees and ask yourself the following question: “Which direction points the most downwards?” Then take a step in that direction, and repeat until you find yourself at the lowest point in the region.

This is where the gradient comes in. Given a vector v⃗ and a vector-valued function F (differentiable around v⃗), the direction of maximum decrease in F is that opposite to the gradient of F evaluated at v⃗, or  − ∇F(v⃗). In other words, the largest decrease in F relative to v⃗ is some scalar multiple α, along  − ∇F(v⃗).

Why does this work?

Recall that in two-dimensional Euclidean space, the optimization of a differentiable function f(x) results in the search space: $\{\alpha \in \mathbb{R} : \frac{df}{dx}\|_{\alpha} = 0\}$. Where we define the derivative of a function as
$$\frac{df}{dx} = \lim_{\Delta x\to\infty} \frac{f(x + \Delta x) - f(x)}{\Delta x}$$
This definition works well in two dimensions, where there are only two possible directions from which Δx can approach , namely the from left and the right (recall that the function must approach the same value from both sides for the limit to exist). This idea of a limit holds in multiple dimensions as well, however we face a much stronger set of conditions. Not only must we verify that the function is approaching the same value from the left and right, we must go a step further and verify that it approaches the same value irrespective of the direction of approach. This presents a problem as there exist an infinite number of directions from which to approach a value in multi-dimensional space.

As such, the concept of a “derivative” becomes much less strictly defined. We can specify a direction and define the directional derivative of a function g : A → ℝn in the direction of some unit vector v⃗ ∈ A as the value v⃗g = ∇g ⋅ v⃗. Intuitively, the directional derivative determines the rate of change of g along the direction of v⃗. Now we can rephrase our initial question. “Which direction points the most downwards?” translates to “For what value of v⃗ is the directional derivative minimized?” In other words, we seek a vector v⃗ ∈ A that satisfies the constraint v⃗g ≤ ∇x⃗g x⃗ ∈ A.

By the definitions of the directional derivative and the dot product, we have:


v⃗g = ∇g ⋅ v⃗ = |∇g||v⃗|cosθ = |∇g|cosθ

(since v⃗ is a unit vector). Since the gradient term is fixed, minimizing the value of cosθ is sufficient for minimizing the expression itself. This is obviously the case when θ = π, which implies that v⃗ must have the same magnitude but point in the opposite direction of g.

Therefore v⃗ =  − ∇g must satisfy our constraint.

Convergence

Gradient Descent uses this concept iteratively, by starting out with an arbitrary initial value for the parameters of the cost function and incrementally updating each parameter along the direction of steepest descent. In doing so, Gradient Descent aims to get closer to a minima after every iteration until eventually converging after a finite number of steps.

Convergence here can be application specific. In general, we can run the algorithm for a pre-determined number of iterations (on the order of 103), or we can declare convergence if the incremental update for a given iteration is less than some ε = 10 − 3.