1. 程式人生 > >《機器學習基石》作業一

《機器學習基石》作業一

已入機器學習坑,下決心走下去。《統計學習方法》一書介紹了十種演算法,不算太難,但仍需重讀以仔細研究其中的推導。《機器學習實戰》一書則給出了各種演算法的具體例項,Python實現,適合入門者瞭解演算法的具體應用。另在Cousera上選了兩門課:斯坦福Andrew Ng的《Machine Learning》、臺大林田軒的《機器學習基石》和《機器學習技法》。Andrew的課程簡單,省去了很多的數學推導和證明,但很全面,對機器學習中的演算法作了很多總結與比較。林田軒的課包含很多數學證明,偏難,需細細研究。
本repository主要記錄了學習《機器學習基石》過程中的筆記和課後程式設計作業。

剛剛完成了《機器學習基石》的第一次作業,共20個選擇題:其中,前14道考察課程的理解,後6道則需code實現。下面節選了後6道程式設計題,通過Python實現了,儲存在MLFex1.py中。

題目

Question 15
For Questions 15-20, you will play with PLA and pocket algorithm. First, we use an artificial data set to study PLA. The data set is in

Each line of the data set contains one (xn,yn) with xn∈R4. The first 4 numbers of the line contains the components of xn orderly, the last number is yn.
Please initialize your algorithm with w=0 and take sign(0) as −1
Implement a version of PLA by visiting examples in the naive cycle using the order of examples in the data set. Run the algorithm on the data set. What is the number of updates before the algorithm halts?
≥201 updates
51 - 200 updates
<10 updates
31 - 50 updates
11 - 30 updates

Question 16
Implement a version of PLA by visiting examples in fixed, pre-determined random cycles throughout the algorithm. Run the algorithm on the data set. Please repeat your experiment for 2000 times, each with a different random seed. What is the average number of updates before the algorithm halts?
≥201 updates
11 - 30 updates
51 - 200 updates
31 - 50 updates
<10 updates

Question 17
Implement a version of PLA by visiting examples in fixed, pre-determined random cycles throughout the algorithm, while changing the update rule to be

Wt+1Wt+ηyn(t)Xn(t)

with η=0.5. Note that your PLA in the previous Question corresponds to η=1. Please repeat your experiment for 2000 times, each with a different random seed. What is the average number of updates before the algorithm halts?
51 - 200 updates
<10 updates
31 - 50 updates
D 11 - 30 updates
E ≥201 updates

Question 18
Next, we play with the pocket algorithm. Modify your PLA in Question 16 to visit examples purely randomly, and then add the ‘pocket’ steps to the algorithm. We will use

as the training data set D, and

as the test set for ”verifying” the g returned by your algorithm (see lecture 4 about verifying). The sets are of the same format as the previous one.
Run the pocket algorithm with a total of 50 updates on D, and verify the performance of wPOCKET using the test set. Please repeat your experiment for 2000 times, each with a different random seed. What is the average error rate on the test set?
0.6 - 0.8
<0.2
0.4 - 0.6
≥0.8
0.2 - 0.4

Question 19
Modify your algorithm in Question 18 to return w50 (the PLA vector after 50 updates) instead of w^ (the pocket vector) after 50 updates. Run the modified algorithm on D, and verify the performance using the test set. Please repeat your experiment for 2000 times, each with a different random seed. What is the average error rate on the test set?
<0.2
≥0.8
0.4 - 0.6
0.6 - 0.8
0.2 - 0.4

Question 20
Modify your algorithm in Question 18 to run for 100 updates instead of 50, and verify the performance of wPOCKET using the test set. Please repeat your experiment for 2000 times, each with a different random seed. What is the average error rate on the test set?
<0.2
0.2 - 0.4
0.6 - 0.8
≥0.8
0.4 - 0.6

程式碼說明

# 將資料從網上down下來,儲存到當前工作目錄下
# 針對此exercise,最終可能儲存在當前工作目錄下的檔案可能有三個:MLFex1_15_train.dat、MLFex1_18_train.dat、MLFex1_18_test.dat
def getRawDataSet(url):
    dataSet = urllib2.urlopen(url)
    filename = 'MLFex1_' + url.split('_')[1] + '_' + url.split('_')[2]
    with open(filename, 'w') as fr:
        fr.write(dataSet.read())
    return filename
# 從本地檔案讀取訓練資料或測試資料,儲存X,y兩個變數中
def getDataSet(filename):
    dataSet = open(filename, 'r')
    dataSet = dataSet.readlines()   # 將訓練資料讀出,存入dataSet變數中
    num = len(dataSet)  # 訓練資料的組數
    # 提取X, Y
    X = np.zeros((num, 5))
    Y = np.zeros((num, 1))
    for i in range(num):
        data = dataSet[i].strip().split()
        X[i, 0] = 1.0
        X[i, 1] = np.float(data[0])
        X[i, 2] = np.float(data[1])
        X[i, 3] = np.float(data[2])
        X[i, 4] = np.float(data[3])
        Y[i, 0] = np.int(data[4])
    return X, Y
# sigmoid函式,返回函式值
def sign(x, w):
    if np.dot(x, w)[0] >= 0:
        return 1
    else:
        return -1
# 最原始的PLA訓練演算法
# X, Y,儲存訓練資料的矩陣,shape分別是(n+1)*m, m*1
# w,最初的係數矩陣,shape是(n+1)*1
# eta,引數
# updates,迭代次數
# 函式返回一個標誌位(flag,用以說明訓練是否結束,即最終得到的w是否完全fit訓練資料),訓練結果w, 實際迭代次數iterations
# 具體執行過程請閱讀函式
def trainPLA_Naive(X, Y, w, eta, updates):
    iterations = 0  # 記錄實際迭代次數
    num = len(X)    # 訓練資料的個數
    flag = True
    for i in range(updates):
        flag = True
        for j in range(num):
            if sign(X[j], w) != Y[j, 0]:
                flag = False
                w += eta * Y[j, 0] * np.matrix(X[j]).T
                break
            else:
                continue
        if flag == True:
            iterations = i
            break
    return flag, iterations, w
# 改進的PLA訓練演算法,具體改動請結合題目閱讀函式
# 引數設定及返回同上一個函式
def trainPLA_Fixed(X, Y, w, eta, updates):
    iterations = 0  # 記錄實際迭代次數
    num = len(X)    # 訓練資料的個數
    flag = True
    for i in range(updates):
        flag = True
        rand_sort = range(len(X))
        rand_sort = random.sample(rand_sort, len(X))
        for j in range(num):
            if sign(X[rand_sort[j]], w) != Y[rand_sort[j], 0]:
                flag = False
                w += eta * Y[rand_sort[j], 0] * np.matrix(X[rand_sort[j]]).T
                break
            else:
                continue
        if flag == True:
            iterations = i
            break
    return flag, iterations, w
# 使用揹包策略訓練w的PLA演算法
# 引數設定同上
def pocketPLA(X, Y, w, eta, updates):
    num = len(X)    # 訓練資料的個數
    for i in range(updates):
        rand_sort = range(len(X))
        rand_sort = random.sample(rand_sort, len(X))
        for j in range(num):
            if (sign(X[rand_sort[j]], w) != Y[rand_sort[j], 0]):
                wt = w + eta * Y[rand_sort[j], 0] * np.matrix(X[rand_sort[j]]).T
                errate0 = errorTest(X, Y, w)
                errate1 = errorTest(X, Y, wt)
                if errate1 < errate0:
                    w = wt
                break

    return w
# 針對19題使用的訓練演算法,結構同前面幾個函式
def trainPLA(X, Y, w, eta, updates):
    num = len(X)
    for i in range(updates):
        rand_sort = range(len(X))
        rand_sort = random.sample(rand_sort, len(X))
        for j in range(num):
            if (sign(X[rand_sort[j]], w) != Y[rand_sort[j], 0]):
                w += eta * Y[rand_sort[j], 0] * np.matrix(X[rand_sort[j]]).T
                break
    return w
# 測試函式,返回w的錯誤率
def errorTest(X, y, w):
    error = 0.0   # remmember the number of the data the hypothesis doesn't fit
    rand_sort = range(len(X))
    rand_sort = random.sample(rand_sort, len(X))
    for i in range(len(X)):
        if sign(X[rand_sort[i]], w) != y[rand_sort[i], 0]:
            error += 1.0
    return error/len(X)
# 使用此函式可直接解答15題
def question15():
    url = 'https://d396qusza40orc.cloudfront.net/ntumlone%2Fhw1%2Fhw1_15_train.dat'
    filename = getRawDataSet(url)
    X, y = getDataSet(filename)
    w0 = np.zeros((5, 1))
    eta = 1
    updates = 80
    flag, iterations, w = trainPLA_Naive(X, y, w0, eta, updates)
    print flag
    print iterations
    print w
# 使用此函式可直接解答16題
def question16():
    # url = 'https://d396qusza40orc.cloudfront.net/ntumlone%2Fhw1%2Fhw1_15_train.dat'
    filename = 'MLFex1_15_train.dat'         # getRawDataSet(url)
    X, y = getDataSet(filename)
    w0 = np.zeros((5, 1))
    eta = 0.5
    updates = 200
    times = []
    for i in range(2000):
        w0 = np.zeros((5, 1))
        flag, iterations, w = trainPLA_Fixed(X, y, w0, eta, updates)
        if flag == True:
            times.append(iterations)
    print times
    print len(times)
    return sum(times)/len(times)

作者 [@Marcovaldo]
2016 年 03月 24日