1. 程式人生 > >演算法複習(個人整理簡亂版)

演算法複習(個人整理簡亂版)

pca: z = wT*x; x*xT*w = r * w; kpca: fi(x) * fi(x)T *w = r*w

svd: Ax = rx -> A W = W *S -> A = W * S * W-1 -> A = W*S*WT => A = U * S * VT

文字主題模型-潛在語義索引lsi: A = U*S*VT

文字主題模型-非負矩陣分解nmf: A = W*H; loss = argmin(W,H) 1/2*||A-WH||^2 +alpha*rho*||W||1 + alpha*rho*||H||1+alpha*(1-rho)/2*||W||^2+alpha*(1-rho)/2*||H||^2

分詞原理:r=argmax(i) P(Ai1, Ai2, ..., Aini), 馬爾可夫假設 P(Aij|Ai1,Ai2, ... , Ai(j-1))=P(Aij|Ai(j-1)) -> 2-gram, P(Ai1Ai2...Ain)=P(Ai1)*P(Ai2|Ai1)*P(Ai3|Ai2)*...*P(Aij|Ai(j-1)) 用維特比演算法; P(w2|w1)=freq(w1,w2)/freq(w1)

TF-IDF: IDF(x) = log((N+1)/(N(x)+1)) +1; tf-idf(x) = tf(x) * idf(x)

Bag of Words: 詞代模型,單詞個數;Set of Words: 詞集模型,出現與否;Hash Trick: fi(j) = sigma(h(i)=j) fi(i), fi(j) = sigma(h(i)=j) epsilon(i)*fi(i), epsilon(i)=+/-1

中文文字挖掘:資料收集、去除非文字部分、處理中文編碼、中文分詞、引入停用詞、特徵處理、建立分析模型。

英文文字挖掘:資料收集、去除非文字部分、拼寫檢查、詞幹提取和詞形還原、轉化為小寫、引入停用詞、特徵處理、建立分析模型。

word2vec:

lda:

gd: h(x) = X*theta;  J = 1/2 * (x*theta - y)T * ( x*theta - Y); theta = theta - alpha * patial(J)/patial(theta); patial = XT*(X*theta - Y);

ls: J = 1/2 * (X*theta - Y)T * (X*theta - Y); patial = XT * (X*theta - Y) = 0 => theta = (XTX)-1 * XTY

線性迴歸:h theta (X) = X * theta ; J = 1/2*(X*theta - Y)T*(X*theta-Y); gd法 theta = theta - alpha * XT*(X*theta - Y); ls法:theta = (XTX)-1*XTY; 多項式迴歸: (x1,x2) -> (1, x1, x2, x1^2, x2^2, x1*x2);廣義線性迴歸: lnY = X*theta=> g(Y) = X*theta, Y = g-1(X*theta); 正則化: J = J上面的 + alpha * ||theta||1 或 1/2*alpha*||theta||^2 => theta = (XTX+alpha*E)-1*XTY;

樸素貝葉斯:

knn:

k-means/ k-means++: