These notes cover basic knowledge in linear algebra, statistics and probability, and machine learning, based on the CS229: Machine Learning course. While preparing for my PhD interview, I compiled these notes, deliberately omitting certain topics that seemed unlikely to be covered in the interview (which turned out to be true). Given my limited experience, there may be some errors in these notes. If you notice any mistakes or have suggestions for improvement, please feel free to email me. My contact information is available on my homepage.

References:

A special thanks to everyone involved in creating and maintaining this course, for making it open and freely accessible.

Authored by Bingkui Tong
December 19, 2024


Linear Algebra

English中文Complementary
transpose转置(AB)T=BTAT|(A+B)T=AT+BT
column vector列向量一般向量认为只有一的矩阵
row vector行向量一般用列向量的转置表示行向量
product比如两个矩阵相乘的结果
inner / dot product内积/点积两个向量相乘的结果(常数)
coefficient系数 
associative结合律(AB)C = A(BC)
distributive分离律A(B+C) = AB + AC
commutative交换律AB = BA (For matrix multiplication generally no)
identity matrix单位矩阵对角线全为1,非对角线为0
diagonal matrix对角矩阵除了对角线全为0
symmetric对称的 (only for square matrix)A=AT, For any matric ARn×n, A+AT is symmetric
anti-symmetric反对称的 (only for square matrix)A=AT, For any matric ARn×n, AAT is anti-symmetric
  For any matric ARn×n can be represented as: A=12(A+AT)+12(AAT),
The first matrix on the right is symmetric and the second is anti-symmetric
trace迹 (only for square matrix)对角线元素之和trA=i=1nAii
  对于方阵AB,trAB=trBA, no matter how matrix times, it still works
square matrix方阵 
norm范数 (only for vector)欧几里得范数(Euclidean or 2norm)为x2=i=1i=nxi2=xTx
  The four properties a norm should be satisfied: non-negativity, definiteness, homogeneity, triangle inequality
image-20241211210434089
  p norm is (i=1n|xi|p)1p
linear independent线性无关性For a set of vectors {x1,x2,...,xn}, if no vector can be represented as a linear combination of the remain vectors, we can say this vector is linear independent.
  Linear dependent: xn=i=1n1αixi, if some scalar α make it works
rankrank(A): Row rank is equal to column rank, row or column rank can all be called rank
row rank行秩The size of largest subset of rows of A that constitute a linearly independent set
The number of linearly independent rows
column rank列秩The size of largest subset of columns of A that constitute a linearly independent set
The number of linearly independent columns
  For ARm×n,rank(A)min(m,n). If rank(A)=min(m,n), we call this matrix full rank
  rank(AB)min(rank(A),rank(B))rank(A+B)rank(A)+rank(B)
inverse逆 (only for square matrix)The inverse of A is A1, and AA1=I=A1A
Not all matrices have inverses
invertible/non-singular可逆/非奇异A must be full rank to have an inverse
non-invertible/singular不可逆/奇异(AB)1=B1A1(AT)1=(A1)T
orthogonal matrix正交矩阵Vectors: x and y is orthogonal if xTy=0
Matrix: a square matrix U is orthogonal if its all columns are orthogonal to each other and normalized (all the columns are referred to as being orthonormal)
normalized归一化UTU=I=UUT, in other words, the inverse of an orthogonal matrix is its transpose.
Usually we use orthogonal to describe the square matrix
  Ux2=x2: operating a vector with orthogonal matrix will not change its 2 norm
span生成子空间/张成The span of a set of vectors is the set of all vectors that can be expressed as a linear combination of {x1,x2,...,xn}
image-20241212145004605
  if {x1,..,xn} is n linearly independent vectors, and xiRn, the span({x1,..,xn})=Rn
projection投影一个向量到另外一组向量的投影就等于这一组向量的生成子空间中与这个向量欧几里得范数的最小值
image-20241212145850389
range/columnspace列空间/值空间R(A)=span(x1,...,xn),{x1,...,xn} is the columns of A
nullspace零空间nullspace of a matrix means a set of vectors equal to 0 when multiplied by A.
N(A)={xRn:Ax=0}
  ARm×n R(AT)N(A)={0}R(AT)=N(A)
determinant行列式detA or |A| (几何意义:我的理解就是将一对正交基构成的空间的体积(面积)放缩的倍数)
image-20241212154119231
  some properties:
image-20241212154245580image-20241212154314277
|A|=|AT||AB|=|A||B||A1|=1|A|
  full rank: determinant is not 0 / full rank (for square matrix)
  Laplace展开求行列式:|A|=j=1n(1)i+jaij|Mij| ,where Mij is A removed the ith row and jth column
这里可以任选展开一行或者一列
  image-20241212155900283
classical adjoint(经典)伴随矩阵adj(A)Rn×n, (adj(A))ij=(1)i+j|A\j,\i| (note that switch i and j indices in A\j,\i
  A1=1|A|adj(A)
quadratic form二次形式the scalar value xTAx image-20241212162246998
  Usually we explicitly assume that the matrix appearing in quadratic form is symmetric
image-20241212162805473
这里A=12(A+AT)+12(AAT),由于最终结果是一个标量,所以这里的AT可以直接写为A,可以看到只有左边对称的部分有贡献,右边的anti-symmetric并没有贡献
positive definite (PD)正定For symmetric matrix ASn, if all non-zero vectors xRn,xTAx>0, A is positive definite.
Usually denoted A as A0 or just A>0, the set of all positive definite matrices is denoted as S++n
positive semidefinite (PSD)半正定for all vectors (not all non-zero vectors) x, xTAx0,A0,S+n
negative definite (ND)负定For symmetric matrix ASn, if all non-zero vectors xRn,xTAx<0, A is positive definite.
Usually denoted A as A0 or just A<0
negative semidefinite (NSD)半负定for all vectors (not all non-zero vectors) x, xTAx0,A0
indefinite不定neither positive semidefinite nor negative semidefinite
  image-20241212165105442
  positive/negative definite matrices are full rank, hence, invertible
gram matrix格拉姆矩阵AAT, A can be in any shape and gram matrix is always positive semidefinite
image-20241212170635521
eigenvalue特征值Ax=λx,x0, where x is eigenvector and λ is eigen value
eigenvector特征向量Intuitively, this definition means that multiplying A by the vector x results in a new vector that points in the same direction as x, but scaled by a factor λ.
  Usually we assume the eigenvector is normalized because if x is eigenvector, then ax is also eigenvector
  Only when (λIA)x=0 has non-empty nullspace, the x has non-zero solution (the left part columns are not linearly independent), which is only the case if (λIA) is singular, which means |(λIA)|=0
  Some properties:
trA=i=1nλi|A|=i=1nλi一个矩阵可以变成A=PDP^-1,P是可逆矩阵,D的对角线都是特征值。trA=trDPP-1=trD=sum lambda,行列式同理detA=detPDP-1=detP detD detP-1=detD
逆矩阵的特征值是原矩阵的倒数
The eigenvalues of a diagonal matrix are just the diagonal entries
diagonalizable可对角化image-20241212212701879
  可逆矩阵的特征值数量为n(维度),不可逆矩阵行列式为0,因为有特征值为0
HessianHessian 矩阵image-20241212220905936
就是个二阶导矩阵
least square最小二乘法image-20241212223825526
image-20241212224349693image-20241212224357680
使用了这两个公式

Statistics and Probability

English中文Complementary
sample space样本空间Ω: The set of all the outcomes of a random experiment
set of events (event space)事件集(事件空间)F:image-20241213112614359
properties:
FAFΩ\AFA1,A2,...FiAiF
conditional probability条件概率P(A|B)P(AB)P(B)
independent独立P(AB)=P(A)P(B) or P(A|B)=P(A)
random variables随机变量A function: X:ΩR
discrete random variables离散随机变量image-20241213154718662
continuous random variable连续随机变量image-20241213154724072
cumulative distribution function累计分布函数FX(x)P(Xx)
properties:
image-20241213155155559
probability mass function (for discrete random variables)概率质量函数pX(x)=P(X=x)
properties:
image-20241213160245746
probability density function (for continuous random variable)概率密度函数fX(x)dFX(x)dx
properties:
image-20241213160814281
expectation期望(这里说的是概率期望)For discrete random variable:image-20241213161206971
For continuous random variable:
image-20241213161236565
  this is kind of “weighted average” of the g(X)
  properties:
image-20241213161511139
variance方差The variance of a random variable X is a measure of how concentrated the distribution of a random variable X is around its mean
Var[X]E[(XE[X])2]
image-20241213162803655
properties:
Var[a]=0Var[af(x)]=a2Var[f(x)]
Some common random variables  
Bernoulli伯努利分布image-20241213164827763
Binomial二项分布image-20241213164844211
Geomatric几何分布image-20241213164906363
Poisson泊松分布image-20241213164950056
Continuous random variables  
Uniform均匀分布image-20241213165140561
Exponential指数分布image-20241213165226132
Normal正态分布image-20241213165304815
  image-20241213170332571
joint cumulative distribution function联合累积分布函数FXY(x,y)=P(Xx,Yy)
marginal cumulative distribution function边缘累积分布函数image-20241213170615633
joint probability mass function (for discrete random variable )联合概率质量函数image-20241213170839661
marginal probability mass function (for discrete random variable )边缘概率质量函数image-20241213171023681
joint probability density function联合概率密度函数image-20241213171501585
image-20241213171515555
marginal probability density function边缘概率密度函数image-20241213171604225
conditional distributions条件分布image-20241213171827411
conditional probability density条件概率密度image-20241213171934297
bayes’s rule贝叶斯公式P(A|B)=P(AB)P(B)=P(B|A)P(A)P(B)
  image-20241213172953970
independent独立(这里说的是多个随机变量)FXY(x,y)=FX(x)FY(y)X and Y are independent
  image-20241213201743424
  if X is independent of Y then any function of X is independent of any function of Y
expectation and covariance期望与协方差(多随机变量)image-20241213202649864
image-20241213202701215
  image-20241213202827479
  if Cov[X,Y]=0, X and Y are uncorrelated
   
  For more random variablesimage-20241213203808079
  image-20241213204732866
this is for random vector and covariance matrix
rule of total probability总概率法则image-20241213205838770

Machine Learning

English中文Complementary
gradient descend梯度下降image-20241215104618290
J(θ) is the loss function
batch gradient descend批量梯度下降image-20241215105455437
  For gradients, the loss function is only one global, and no other local, optima.
stochastic gradient descent (incremental gradient descent)随机梯度下降(增量梯度下降)SGDimage-20241215105843047
一次只看一个,看一个更新一次
Often, stochastic gradient descent gets θ “close” to the minimum much faster than batch gradient descent.
当训练集非常大的时候,SGD每次更新都能看到效果,所以很不错。此时一次性更新一个训练集的批量梯度下降的开销太大
least squares最小二乘法前面的梯度下降需要迭代,这里使用最小二乘法直接算出θ的封闭解
independently and identically distributed=独立同分布 
likelihood似然函数image-20241215142701106
就是倒过来,数据集中已知x和y的情况。θ的取值使得取这个y最大

说白了就是找到使得p最大的参数θ
maximum likelihood最大似然we should choose θ so as to make the data as high probability as possible. I.e., we should choose θ to maximize L(θ).
为了方便计算,一般会使用对数似然函数,就是说白了取个log
sigmoid function / logistic function image-20241215143836026
image-20241215144039104
image-20241215144020262
regression回归 
multinominal distribution多项分布 
softmax归一化操作image-20241215152807048
We usually call the ti here as “logits”
Newton’s method牛顿法也是一个优化方法
找0点image-20241215155806022
通过寻找二阶导的零点来找寻梯度为0的点,进而寻找极值进行优化(找最大值需要让hessian矩阵为正,也就是二阶导或者说曲率为正,反之相反)image-20241215155818960
详见pdf
generalized linear models广义线性模型(GLM) 
exponential family指数族image-20241215161124658
如果能被写成这个形式,那么一个分布就可以被划为指数族
详见pdf
discriminative learning algorithms判别式学习算法P(y|x),也就是知道输入,直接输出y
generative learning algorithms生成式学习算法建模学习P(x|y),也就是学习已知类别的特征,然后使用贝叶斯或者别的算法来算最后的输出
image-20241215170601096
class priors类先验类先验p(y)是指在没有任何观测数据x的情况下,某一类别y出现的概率。
image-20241215170721596(利用贝叶斯来计算最后的值)
这里p(y)式是类先验。这里有几个y就可以带入几次来看最后的概率
这里p(x)=p(x|y=1)p(y=1)+p(x|y=0)p(x=0)(实际计算中不需要计算这个,因为这个是分母,我们找寻的是取最大值时的y,分母不影响)
multivariate normal distribution多随机变量正态分布image-20241215173152369
gaussian discriminant analysis (GDA)高斯判别分析详见pdf 4.1.2 基本就是一个multivariable normal distribution的generative learning algorithm
  GDA makes stronger modeling assumptions, and is more data efficient (i.e., requires less training data to learn “well”) when the modeling assumptions are correct or at least approximately correct. Logistic regression makes weaker assumptions, and is significantly more robust to deviations from modeling assumptions

GDA有着更强的假设,也就意味着更弱普适性
GDA数据效率更高,但是适合的数据类型少,实际应用中逻辑回归更为常见
Naive Bayes (NB) assumption朴素贝叶斯假设
计算P(x|y)的时候如果x过于大了,比如x是一个5000维的向量,那么就有250001个参数(这里是因为4999个参数能够确认49999个概率,概率和为1,所以第5000个自然可以知道)。
这个时候就需要假设:xi‘s are conditional independent given y (x直接互相独立)
image-20241215193820220
记住是在given y的情况下,也就是说
image-20241215193856439
并不一定成立
Naive Bayes classifier朴素贝叶斯分类器image-20241215193942424
discretize离散化image-20241215210013340
convert continuous variables to discrete variables (usually when using naive bayes to continuous variables)
laplace smoothing拉普拉斯平滑Sometimes the input may never appearing the training set, which will cause all the result to 0 or 0/0 like:
image-20241215210551546
To mitigate it:
image-20241215210613388image-20241215210638139
numerator分子 
denominator分母拉普拉斯平滑分母上加的k通常是所有样本数目,因为必须要保证概率和为1
cubic function三次函数 
RHS右边right hand side
kernel trick核技巧通过换元避免在高维空间中进行大量的计算(详见pdf 5.3)
  kernel的最后一节最后完全没看懂 重新看一遍
support machine learning支持向量机 
separating hyperplane分离超平面image-20241215215954343
图中斜线
margin边距离”separating hyperplane“的“距离”,一般而言点离这个分界线越远,对于该点的预测更加confident
functional margin功能边距image-20241215220741973
该值为正的时候预测正确(负负或者正正),为负的时候错误(一正一负)。同样希望这个距离很大,这样更加confident
geometric margin几何边距y=wTx+b,w始终垂直于separating hyperplane
image-20241215221525614
image-20241215222001984
具体推导看63页
  把这个边距γ理解为置信度就很轻松了(我们需要最大化最小将e)
lagrange duality拉格朗日对偶性 
activation function激活函数Typically, for multi-layer neural network, at the end, near the output, we don’t apply ReLU (or other activation functions), especially when the output is not necessarily a positive number

image-20241216150401838image-20241216150429581
NOTE: the tanh and sigmoid are used less and less because the gradient of them vanishes as the value goes both positive and negative infinity.
relu某种激活函数ReLU(t)max{t,0}
vectorization向量化如果一个neural network里面每次参数与x计算都单独算的话,会大量地使用for循环。这样计算效率会非常低。所以使用线性代数来进行向量化,加快并行计算
image-20241216145507938
identity function恒等函数e.g. y=x We do not use identity function as the activation function because we wanna introduce non-linear instead of:image-20241216150823029
(apply a linear function to another linear function will also end to a new linear function)
feature engineering特征工程the process of designing the feature maps ϕ(x)
We can view deep learning as a way to automatically learn the right feature map (sometimes also referred to as “the representation”)
神经网络通常不用人们像传统的特征工程里面去精心根据领域知识(domain knowledge)去设计特征映射,每一个神经元可以自己学习
multi-layer perceptronMLPimage-20241216152931289
residual connection残差连接image-20241216152952259
layer normalization (LN) image-20241216154607501image-20241216154734058image-20241216154740973
  这里求解deviation的时候除的m而不是m-1,是因为想让LN输出的和为1
  image-20241216155044882
Oftentimes zero mean and standard deviation 1 is not the most desired normalization scheme
scaling-invariant property放缩不变性image-20241216155234471
  when we talk about the weights’ scaling-invariant property, we usually talk about the weights except the last layer. Because usually we do not do normalization to the last layer (same as the activation function)
  Batch normalization and group normalization are more often used in computer vision applications whereas layer norm is used more often in language applications.
convolutional layers卷积层For one dimension: Conv1DS(x), the parameters are a filter vector (kernel) wRk, where the k is filter size.
We assume the k=2+1, is odd
image-20241216160245424
对于超出原始z的值,padding为0
image-20241216160410877
multiple channel version:image-20241216160804422
image-20241216160835680
  一维卷积往往用于NLP领域,二维卷积往往用于视觉处理
backpropagation反向传播 
partial derivatives偏导 
chain rule链式法则background: image-20241216162301955
chain rule (for scalars and vectors):image-20241216162350063
for simplicity:image-20241216163213295
  Backpropagation contains two stages: forward pass and backward pass.
During the forward pass, it will save all the intermediate variables
generalization泛化性 
overfit过拟合the model perform well on training set but not that good on test set
underfit欠拟合the model perform not good even training set
bias-variance tradeoff偏差-方差权衡 
bias偏差the bias captures the part of the error that are introduced due to the lack of expressivity of the model
很多时候是模型不够强大导致,比如使用一次函数预测二次函数生成的数据集。也可能是数据集不够。总而言之就是欠拟合
variance方差The variance term captures how the random nature of the finite dataset introduces errors in the learned model (It measures the sensitivity of the learned model to the randomness in the dataset. It often decreases as the size of the dataset increases.)
可能是因为模型过于强大(数据量不足会加剧这个),导致其捕捉了很多无关信息,比如数据生成过程的噪声。说白了就是过拟合
  Often, there is a tradeoff between bias and variance. If our model is too “simple” and has very few parameters, then it may have large bias (but small variance), and it typically may suffer from underfittng. If it is too “complex” and has very many parameters, then it may suffer from large variance (but have smaller bias), and thus overfitting.
image-20241216193924804
数学推导看pdf121页
double descent phenomenon双下降现象模型的error分为两部分,bias和variance,一般而言,随着模型参数量增加,error会呈现出一个先下降后增加的趋势(convex)。然而有时候会下降两次

image-20241216200215018image-20241216200422125
对于训练数据同理。图一model-wise,图二sample-wise
  To some extent, sample-wise double descent and model-wise double descent are essentially describing similar phenomena—the test error is peaked when n ≈ d.
image-20241216201440413
regularization正则化Jλ=J(θ)+λR(θ), λ is the regularization parameter
the regularizer R(θ) (e.g. 2 norm) is typically chosen to be some measure of the complexity of the model θ
weight decay权重衰减(欧几里得范数正则)image-20241216205129346
  image-20241216205935645
  θ1=i|θi|12θ22=12iθi2
implicit regularization effect隐式正则化效果optimizers can implicitly impose structures on parameters beyond what has been imposed by the regularized loss
正则化的训练loss也许不止一个全局最优解,即使在训练集上同样效果,但是有的优化器可能会因为将模型变得更加稀疏或怎样,使得其在测试集上效果更好
image-20241216210633177
  In summary, the takehome message here is that the choice of optimizer does not only affect minimizing the training loss, but also imposes implicit regularization and affects the generalization of the model. Even if your current optimizer already converges to a small training error perfectly, you may still need to tune your optimizer for a better generalization.
Lipschitz more Lipschitz model: 更平滑的模型,模型在输入空间中对输入变化的响应更为平缓或稳定。
cross validation交叉验证说白了就是把数据分成训练集和测试集,在训练集上训练,然后选择测试集上误差最小的模型。
k-fold cross validaitionk折交叉验证普通交叉验证要浪费一部分数据,因为测试集不能用于训练。
K折:image-20241216211844066
leave-one-out cross validation留一法交叉验证数据实在太少的时候,k就等于数据量。每次只留一个数据
maximum likelihood estimation (MLE)最大似然估计image-20241216212213144
  this is used to estimate the parameter θ
clustering聚类只有x没有y,是一种unsupervised learning problem
n the clustering problem, we are given a training set {x(1), . . . , x(n)}, and want to group the data into a few cohesive “clusters.”
k-meansk近邻image-20241216213428888
先随机选k个中心点。根据距离聚类,每一类产生一个新的中心点,直到收敛。
image-20241216213551330
distortion function畸变函数image-20241216213709933
J的值随着kmeans的循环是单调递减的,所以kmeans一定收敛
但是kmeans也有可能只是局部最佳,可以多次运行更换初始值避免
Mixture of Gaussians混合高斯模型background:
image-20241216214756678
就是说我们只能观测到x,但是x是由隐变量z决定的正态分布产生的
image-20241216214935528
每个x对应一个隐藏的不可观测的z所对应的高斯分布,所以是混合高斯模型
image-20241216215029335
latent random variable隐变量隐变量z决定了生成x的高斯分布,如果我们知道z是什么,那么上面的似然估计就可以写成:
image-20241216215243628
这里的优化问题就和gaussian discriminant analysis很像了,这里的z就像是GDA里面的y。
但是!不一样的是!我们这里并不知道z!
evidence lower boundELBO 
EM algorithm (expectation-maximization)期望最大化算法分为E和M两步,E步的时候就直接猜z的值,M步的时候就要假装E猜的是对的,然后进行优化
image-20241216220110296
image-20241216220117631
image-20241216220247007
(这里E的时候也不是乱猜的,是使用的贝叶斯去算的z的后验概率,然后把这个后验概率假装是正确的)
  EM和kmeans其实很相似。不同之处在于kmeans在初始化(可以看作第一个E步的时候)使用hard guess,直接随机初始化了几个点。而EM在E的时候(*这里)使用的是soft guess。
EM同样会有可能遭遇到局部最优的情况
  image-20241217113918027
使用z关于x的后验概率当作z的分布(soft guess)(推导见:pdf 11.3)
这里其实就是下界优化,最大化ELBO
Jensen’s inequality詹森不等式如果f(x)是严格convex(凸函数,向下凹),那么有E[f(x)]f(E[x]),当X=E[X] with probability 1
image-20241217105711880
如果这个函数的hessian矩阵式半正定 (positive semi-definite),那么他是凸函数,如果是正定的,那么是严格凸函数
对于凹函数同样适用,把符号改个方向就好了
concave/convex function凹函数/凸函数这里的理解和中文语境里面的凹凸其实是相反的
variational inference变分推理在EM算法中,我们是有posterior probability p(z|x, θ),来当作z的概率,但是实际情况中这个可能会十分难以计算。所有变分推理中寻找approximation of the true posterior distribution
image-20241217145357076
在一个分布族Q里面找到最大化下界ELBO的
mean filed assumption平均场假设对于高维discrete random variable z来说,使用平均场假设
image-20241217145846300
  mean filed assumption仅仅针对于离散的,对于continuous random variable而言
image-20241217150224491
q v分别是两个函数
re-parameterization重参数化image-20241217151257992image-20241217151323322
  变分推导和变分自编码器看pdf 13.5,目前没太看懂
大概意思是使用一个Q的估计,因为Q太难算了。但是这个估计的形式我没看懂
principal components analysis (PCA)主成分分析数据压缩
image-20241217155728057
to find the major axis of variation u1
  Before PCA, first normalize the data by xj(i)xj(i)μjσjμj=1ni=1nxj(i)andσj2=1ni=1n(xj(i)μj)2. ensures that different attributes are all treated on the same “scale.” Make different attribute more comparable
  image-20241217165405254image-20241217165341773
image-20241217165750900
independent components analysis (ICA)独立成分分析background:
image-20241217171147689
这里s是说话的人,A是混合,x是最后录入的结果,目标是复原s,也就是找到W=A1
image-20241217171359041
permutation matrix排序矩阵如:image-20241217171451999
每一行每一列有且仅有一个1
ICA ambiguitiesICA 模糊性pdf 13.1
大概意思就是对W的scale很多情况下不影响,即使还原的s会不同,比如放缩或者正负。但是放缩这些对于声音数据来说就是调节了一下音量。很多别的应用场景这个同样不影响
  高斯分布的旋转对称性导致无法通过 ICA 恢复原始独立源,但只要数据源是非高斯分布,就可以消除这种不确定性,从而唯一确定独立源。image-20241217172243525
  image-20241217172553111
  ICA:
image-20241217173311760
image-20241217173317624
image-20241217173408222
image-20241217173437900
这里指定z的CDF是sigmoid,这个可以自己选择,只要不是高斯分布就行
reinforcement learning强化学习In the reinforcement learning framework, we will instead provide our algorithms only a reward function, which indicates to the learning agent when it is doing well, and when it is doing poorly.
markov decision process (MDP)马尔可夫决策过程State:状态Action:行为γ:discount factor 折扣因子Reward:奖励函数 RRS×A
  image-20241217175454273
  最终优化目标:最大化奖励和的期望值
image-20241217175740813
policy决策π : SAa=πsvalue function:image-20241217175922481
Bellman equations贝尔曼等式如果policy π 固定,image-20241217180250517
这里是一个recursive function,第一个是intermediate reward,是初始阶段获得的奖励,后面就是一个递归(后面这项其实是一个期望,就是p(x) * )
这个贝尔曼等式用于解决有限状态(要求决策固定)的时候十分好用
value iteration价值迭代image-20241217190422293(有限状态与action)
synchronous update:每次一个loop之后再全部更新
asynchronous update:每次一个V都更新,异步
policy iteration策略迭代image-20241217190737463
  image-20241217195205745
  learning the MDP
image-20241217195500316
image-20241217195527046
Continuous state MDPs连续状态马尔科夫决策过程上面都的情况state都是discrete,现在讨论连续的,意思就是state就infinite个
discretization离散化用一些离散值来表示一个连续值
有两个缺点:
1. image-20241217200527093并不够平滑
2. curse of dimensionality 维度诅咒。For example, for SRd, for each of the d dimension, if we discretize it into k values, then there will be dk states. if d is 1000 and k is 10, then the value is very huge. (so discretization is usually used for low dimensions like 2- 6)
value function approximation值函数近似近似这个函数
  we assume we have a model, or simulator
image-20241217201217284
这个simulator可以有多种形式。比如使用物理模拟器、多跑几个trial得到distribution计算,或者就是自定义一个模型(神经网balalala)
Fitted value iteration拟合值迭代假设state连续,但是action很少并且离散
For continuous state: image-20241217202253797(就是从state空间里面取样)
image-20241217203611417image-20241217202703466
image-20241217202719105
image-20241217203741935这里也是采样k个,因为是连续的有无数个,所以只采样k个进行近似
finite horizon MDP有限视野MDPimage-20241217210404979
取消了discount factor,加入了time factor。 And this is general for both discrete and concrete states and actions

 

Flag Counter