Search and Optimization
Numerical Optimization
Second-order Taylor Expansion
f ( x + p ) = f ( x ) + ∇ f ( x ) T p + 1 2 p T ∇ 2 f ( x + t p ) p t ∈ ( 0 , 1 ) f(x+p)=f(x)+\nabla f(x)^Tp+\frac{1}{2}p^T\nabla^2f(x+tp)p \quad t\in(0,1)
f ( x + p ) = f ( x ) + ∇ f ( x ) T p + 2 1 p T ∇ 2 f ( x + t p ) p t ∈ ( 0 , 1 )
Newton Direction
p = − ( ∇ 2 f ( x ) ) − 1 ∇ f ( x ) p=-\left(\nabla^2f(x)\right)^{-1}\nabla f(x)
p = − ( ∇ 2 f ( x ) ) − 1 ∇ f ( x )
Stochastic Search
Simulated Annealing
Repeat: Sample a step p ∼ P p \sim P p ∼ P . If f ( x + p ) ≥ f ( x ) f(x+p) \ge f(x) f ( x + p ) ≥ f ( x ) , with exp ( f ( x ) − f ( x + p ) T ) \exp\left(\frac{f(x)-f(x+p)}{T}\right) exp ( T f ( x ) − f ( x + p ) ) probability x ← x + p x \leftarrow x + p x ← x + p , else x ← x + p x \leftarrow x+p x ← x + p .
Cross Entropy Methods
Repeat: Collect a set A A A of samples ∼ p ( x ) \sim p(x) ∼ p ( x ) . Select top k k k elite samples E ⊆ A E\subseteq A E ⊆ A . Update p ( x ) p(x) p ( x ) to best fit E E E .
Search Gradient
Repeat: Draw λ \lambda λ samples z k ∼ π ( ⋅ ∣ θ ) \mathbf{z}_k\sim\pi(\cdot|\theta) z k ∼ π ( ⋅ ∣ θ ) . Evaluate the finesses f ( z k ) f(\mathbf{z}_k) f ( z k ) . Calculate log-derivatives ∇ θ log π ( z k ∣ θ ) \nabla_\theta\log\pi(\mathbf{z}_k|\theta) ∇ θ log π ( z k ∣ θ ) . ∇ θ J ← 1 λ ∑ k = 1 λ ∇ θ log π ( z k ∣ θ ) ⋅ f ( z k ) \nabla_\theta J\leftarrow\frac{1}{\lambda}\sum_{k=1}^\lambda\nabla_\theta\log\pi(\mathbf{z}_k|\theta)\cdot f(\mathbf{z}_k) ∇ θ J ← λ 1 ∑ k = 1 λ ∇ θ log π ( z k ∣ θ ) ⋅ f ( z k ) . θ ← θ + η ⋅ ∇ θ J \theta\leftarrow\theta+\eta\cdot\nabla_\theta J θ ← θ + η ⋅ ∇ θ J .
Reinforcement Learning
Values
V π ( s 0 ) = E π [ ∑ i = 0 ∞ γ i R ( s i ) ] V^\pi(s_0)=\mathbb{E}_\pi\left[\sum_{i=0}^\infty\gamma^iR(s_i)\right]
V π ( s 0 ) = E π [ i = 0 ∑ ∞ γ i R ( s i ) ]
Bellman Expectation Equation
V π ( s 0 ) = R ( s 0 ) + γ E π [ V π ( s 1 ) ] = R ( s 0 ) + γ ∑ i P ( s 1 i ∣ s 0 , π ( s 0 ) ) ⋅ V π ( s 1 i ) V^\pi(s_0)=R(s_0)+\gamma \mathbb{E}_\pi\left[V^\pi(s_1)\right]=R(s_0)+\gamma\sum_iP\left(s_1^i|s_0,\pi (s_0)\right)\cdot V^\pi\left(s_1^i\right)
V π ( s 0 ) = R ( s 0 ) + γ E π [ V π ( s 1 ) ] = R ( s 0 ) + γ i ∑ P ( s 1 i ∣ s 0 , π ( s 0 ) ) ⋅ V π ( s 1 i )
Bellman Optimality Equation
V ∗ ( s 0 ) = max π V π ( s 0 ) = R ( s 0 ) + γ max a ∈ A ( s 0 ) ∑ i P ( s 1 i ∣ s 0 , a ) ⋅ V ∗ ( s 1 i ) V^*(s_0)=\max_{\pi}V^\pi(s_0)=R(s_0)+\gamma\max_{a\in A(s_0)}\sum_iP\left(s_1^i|s_0,a\right)\cdot V^*\left(s_1^i\right)
V ∗ ( s 0 ) = π max V π ( s 0 ) = R ( s 0 ) + γ a ∈ A ( s 0 ) max i ∑ P ( s 1 i ∣ s 0 , a ) ⋅ V ∗ ( s 1 i )
Value Iteration
Repeat until δ < ϵ ( 1 − γ ) / γ \delta < \epsilon (1-\gamma)/\gamma δ < ϵ ( 1 − γ ) / γ : U ← U ′ , δ ← 0 U \leftarrow U', \delta \leftarrow 0 U ← U ′ , δ ← 0 . For each state s ∈ S s \in S s ∈ S , U ′ [ s ] ← R ( s ) + γ max a ∈ A ( s ) ∑ s ′ P ( s ′ ∣ s , a ) U [ s ′ ] U^{\prime}[s]\gets R(s)+\gamma\max_{a\in A(s)}\sum_{s^{\prime}}P(s^{\prime}|s,a)U[s^{\prime}] U ′ [ s ] ← R ( s ) + γ max a ∈ A ( s ) ∑ s ′ P ( s ′ ∣ s , a ) U [ s ′ ] . If ∣ U ′ [ s ] − U [ s ] ∣ > δ \left|U^{\prime}[s]-U[s]\right|>\delta ∣ U ′ [ s ] − U [ s ] ∣ > δ , then δ ← ∣ U ′ [ s ] − U [ s ] ∣ \delta\leftarrow|U^{\prime}[s]-U[s]| δ ← ∣ U ′ [ s ] − U [ s ] ∣ .
Monte Carlo Policy Evaluation
V π ( s 0 ) = E π [ G ( s 0 ) ] = E π [ ∑ i = 0 T γ i R ( s i ) ] V^\pi(s_0) = \mathbb{E}_\pi[G(s_0)]=\mathbb{E}_\pi\left[\sum_{i=0}^T\gamma^{i}R(s_i)\right]
V π ( s 0 ) = E π [ G ( s 0 ) ] = E π [ i = 0 ∑ T γ i R ( s i ) ]
Temporal-Difference (TD) Prediction
V π ( s 0 ) ← V π ( s 0 ) + α ( R ( s 0 ) + γ V π ( s 1 ) − V π ( s 0 ) ) V^\pi(s_0)\leftarrow V^\pi(s_0)+\alpha(R(s_0)+\gamma V^\pi(s_1)-V^\pi(s_0))
V π ( s 0 ) ← V π ( s 0 ) + α ( R ( s 0 ) + γ V π ( s 1 ) − V π ( s 0 ) )
Tabular Q-Learning
Q ( s , a ) ← Q ( s , a ) + α ( R ( s ) + γ max a ′ Q ( s ′ , a ′ ) − Q ( s , a ) ) Q(s,a)\leftarrow Q(s,a)+\alpha\left(R(s)+\gamma\max_{a^{\prime}}Q(s^{\prime},a^{\prime})-Q(s,a)\right)
Q ( s , a ) ← Q ( s , a ) + α ( R ( s ) + γ a ′ max Q ( s ′ , a ′ ) − Q ( s , a ) )
Deep Q-Learning
θ ← θ − α ∇ θ [ R ( s ) + γ max a ′ Q θ ( s ′ , a ′ ) − Q θ ( s , a ) ] 2 \theta\leftarrow\theta-\alpha\nabla_\theta\left[R(s)+\gamma\max_{a^{\prime}}Q_\theta(s^{\prime},a^{\prime})-Q_\theta(s,a)\right]^2
θ ← θ − α ∇ θ [ R ( s ) + γ a ′ max Q θ ( s ′ , a ′ ) − Q θ ( s , a ) ] 2
Policy Gradient
θ ← θ + α R ( τ ) ∑ i = 0 ∣ τ ∣ − 1 ∇ log ( π θ ( a i ∣ s i ) ) \theta\leftarrow\theta+\alpha R(\tau)\sum_{i=0}^{|\tau|-1}\nabla\log(\pi_\theta(a_i|s_i))
θ ← θ + α R ( τ ) i = 0 ∑ ∣ τ ∣ − 1 ∇ log ( π θ ( a i ∣ s i ) )
Bandits and MCTS
Regret
R σ ( t ) = E c i ∼ σ [ ∑ i = 1 t X c ∗ − ∑ i = 1 t X c i ] = μ ∗ t − E c i ∼ σ [ ∑ i = 1 t μ ( c i ) ] R_\sigma(t)=\mathbb{E}_{c_i\sim\sigma}\left[\sum_{i=1}^tX_{c^*}-\sum_{i=1}^tX_{c_i}\right]=\mu^*t-\mathbb{E}_{c_i\sim\sigma}\left[\sum_{i=1}^t\mu(c_i)\right]
R σ ( t ) = E c i ∼ σ [ i = 1 ∑ t X c ∗ − i = 1 ∑ t X c i ] = μ ∗ t − E c i ∼ σ [ i = 1 ∑ t μ ( c i ) ]
Concentration Bounds
P ( ∣ X ˉ − μ ∣ ≥ ε ) ≤ 2 e − 2 N ε 2 with i.i.d. X i ∼ P μ ∈ [ 0 , 1 ] P(|\bar{X}-\mu|\geq\varepsilon)\leq2e^{-2N\varepsilon^2} \quad \text{with i.i.d.}\ X_i\sim P_\mu \in [0,1]
P ( ∣ X ˉ − μ ∣ ≥ ε ) ≤ 2 e − 2 N ε 2 with i.i.d. X i ∼ P μ ∈ [ 0 , 1 ]
P ( ∣ X ˉ − μ ∣ ≥ ε ) ≤ 2 T 2 c where ε = c log T N P(|\bar{X}-\mu|\geq\varepsilon)\leq\frac{2}{T^{2c}} \quad \text{where}\ \varepsilon=\sqrt{\frac{c\log T}{N}}
P ( ∣ X ˉ − μ ∣ ≥ ε ) ≤ T 2 c 2 where ε = N c log T
Explore-Then-Commit
First, play each coin N = c ′ ⋅ T 2 3 log T 1 3 N=c'\cdot T^{\frac{2}{3}}\log T^{\frac{1}{3}} N = c ′ ⋅ T 3 2 log T 3 1 times. Compare the average reward μ ˉ 1 \bar{\mu}_1 μ ˉ 1 and μ ˉ 2 \bar{\mu}_2 μ ˉ 2 . Afterwards play the coin with higher μ ˉ i \bar{\mu}_i μ ˉ i .
R ( T ) ≤ N + c log T N ⋅ T R ( T ) ∼ O ( T 2 3 ( log T ) 1 3 ) R(T)\leq N+c\sqrt{\frac{\log T}{N}}\cdot T \quad R(T)\sim O\left(T^{\frac{2}{3}}(\log T)^{\frac{1}{3}}\right)
R ( T ) ≤ N + c N log T ⋅ T R ( T ) ∼ O ( T 3 2 ( log T ) 3 1 )
Epsilon-Greedy
Compare the average reward μ ˉ 1 \bar{\mu}_1 μ ˉ 1 and μ ˉ 2 \bar{\mu}_2 μ ˉ 2 . Choose ε = c ′ ⋅ t − 1 3 ( log t ) 1 3 \varepsilon=c'\cdot t^{-\frac{1}{3}}(\log t)^{\frac{1}{3}} ε = c ′ ⋅ t − 3 1 ( log t ) 3 1 . With probability 1 − ε 1-\varepsilon 1 − ε , play the one with better empirical mean. With ε \varepsilon ε , play the other one.
R ( t ) ≤ ε t + c log t ε t ⋅ t R ( t ) ∼ O ( t 2 3 ( log t ) 1 3 ) R(t)\leq\varepsilon t+c\sqrt{\frac{\log t}{\varepsilon t}}\cdot t \quad R(t)\sim O\left(t^{\frac{2}{3}}(\log t)^{\frac{1}{3}}\right)
R ( t ) ≤ ε t + c ε t log t ⋅ t R ( t ) ∼ O ( t 3 2 ( log t ) 3 1 )
Upper Confidence Bound
Compute the UCB value of each arm i ∈ [ K ] i\in [K] i ∈ [ K ] . Play the arm with max i B ( i , t , n i ( t ) ) \max_iB(i,t,n_i(t)) max i B ( i , t , n i ( t ) ) .
B ( i , t , n i ( t ) ) = X ˉ i + 2 log t n i ( t ) B(i,t,n_i(t))=\bar{X}_i+\sqrt{\frac{2\log t}{n_i(t)}}
B ( i , t , n i ( t ) ) = X ˉ i + n i ( t ) 2 log t
R U C B ( t ) = ∑ i = 1 K ( μ i ∗ − μ i ) ⋅ E [ n i ( t ) ] ≤ ∑ i = 1 K 8 log t μ i ∗ − μ i + O ( 1 ) R_{UCB}(t)=\sum_{i=1}^K(\mu_{i^*}-\mu_i)\cdot\mathbb{E}[n_i(t)]\leq\sum_{i=1}^K\frac{8\log t}{\mu_{i^*}-\mu_i}+O(1)
R U C B ( t ) = i = 1 ∑ K ( μ i ∗ − μ i ) ⋅ E [ n i ( t ) ] ≤ i = 1 ∑ K μ i ∗ − μ i 8 log t + O ( 1 )
— Mar 13, 2025