Kaggle比賽之Artifical Neural Networks Applied to Taxi Destination Prediction程式碼整理

阿新 • • 發佈：2018-11-27

Code of the winning entry to the Kaggle ECML/PKDD taxi destination competition. Our approach is described in our paper.

依賴性

我們使用了MILA實驗室的這些包:

Theano. 通用的gpu加速python數學庫，具有類似numpy的介面 (see [3, 4]). See http://deeplearning.net/software/theano/
Blocks.基於Theano的Python深度學習神經網路框架。 As Blocks evolves very rapidly, we suggest you use commit 1e0aca9171611be4df404129d91a991354e67730

, which we had the code working on. See https://github.com/mila-udem/blocks
Fuel. A data pipelining framework for Blocks. Same that for Blocks, we suggest you use commit ed725a7ff9f3d080ef882d4ae7e4373c4984f35a. See https://github.com/mila-udem/fuel

對於均值漂移演算法使用的sklearn，而其他地方也使用了 numpy, cPickle and h5py .

結構

對存檔的python檔案做出簡要介紹:

config/*.py: 我們實驗的不同模型的配置檔案。其中 mlp_tgtcls_1_cswdtx_alexandre.py是結果最好的。
data/*.py : 與資料傳輸相關的檔案:
- __init__.py 包含一些關於資料的一般統計資訊
- csv_to_hdf5.py : 通過Fule將csv檔案轉化為hdfs檔案。
- hdf5.py : 處理HDFS檔案的實用函式
- init_valid.py : 初始化驗證集或者測試集（validation set）的HDF5檔案
- make_valid_cut.py
  
  : 通過時間切分列表生成驗證集（validation set）。切分列表被儲存在python檔案中。路徑為 data/cuts/ (我們使用了一個切割檔案)
- transformers.py : 通過Fuel，將訓練資料集轉化為模型可用的結構。
data_analysis/*.py : 通過scripts對資料集進行各種各樣的分析。
- cluster_arrival.py : 通過script去生成均值漂移聚類的目的地中心點。產生了3392個聚類中心點。
model/*.py : 我們嘗試過的各種模型的資原始碼。
- __init__.py 所有模型的公用程式碼，包含元資料嵌入程式碼。
- mlp.py 所有MLP模型的公用程式碼
- dest_mlp_tgtcls.py 輸出層使用聚類點的MLP目的地預測程式碼。
error.py 基於Haversine Distance的誤差計算函式。
ext_saveload.py 用於儲存和重新載入模型引數的塊擴充套件，防止訓練中斷。
ext_test.py在測試集上訓練模型，產生csv檔案輸出的塊擴充套件。
train.py 訓練和測試的主程式碼。主函式

如何重現得獎結果?

prepare.sh,助手指令碼，可以幫助執行1-6步並且做一些其他檢查。但是如果中途遇到錯誤，script將從頭開始執行。在訓練之前的2.4.5步耗時很長。

注意，有些指令碼希望儲存庫位於您的PYTHONPATH中(轉到儲存庫的根目錄，輸入’ export PYTHONPATH= ’ $PWD:$ PYTHONPATH ‘)。

Set the TAXI_PATH environment variable to the path of the folder containing the CSV files.設定‘TAXI_PATH’環境變數為包含CSV檔案的資料夾的路徑。
執行 data/csv_to_hdf5.py "$TAXI_PATH" "$TAXI_PATH/data.hdf5" 產生HDFS檔案 (儲存位置在TAXI_PATH下). 這個過程大約需要20分鐘。
執行 data/init_valid.py valid.hdf5 其初始化設定HDFS檔案的變數
執行 data/make_valid_cut.py test_times_0 生成驗證集（the validation set）. 這個過程需要幾分鐘
執行 data_analysis/cluster_arrival.py 生成目的地位置聚類中心. 這個過程大約需要幾分鐘。
建立資料夾 model_data 和資料夾 output (next to the training script), which will receive respectively a regular save of the model parameters and many submission files generated from the model at a regular interval.它將分別接收定期儲存的模型引數和從模型生成的許多提交檔案。
執行 ./train.py dest_mlp_tgtcls_1_cswdtx_alexandre 訓練模型。每1000個迭代，輸出結果儲存在 output/ .在任何時間，使用三個 Ctrl+C 即可中斷模型. 訓練指令碼設定在10000次迭代後停止訓練 but a result file produced after less than 2 000 000 iterations is already the winning solution. We trained our model on a GeForce GTX 680 card and it took about an afternoon to generate the winning solution.
When running the training script, set the following Theano flags environment variable to exploit GPU parallelism:
THEANO_FLAGS=floatX=float32,device=gpu,optimizer=fast_run

*More information in this pdf

Kaggle比賽之Artifical Neural Networks Applied to Taxi Destination Prediction程式碼整理

Code of the winning entry to the Kaggle ECML/PKDD taxi destination competition. Our approach is described in our paper. 依賴性我們使用了MILA實驗室的這些包:

Kggle比賽之Artifical Neural Networks Applied to Taxi Destination Prediction

Artifical Neural Networks Applied to Taxi Destination Prediction 摘要：本文主要是基於計程車軌跡對終點的預測。其中，資料為長度不同的來自GPS的節點和各種各樣相關的元資訊（meta-information）。比如，計程

Kaggle比賽之『舊金山犯罪分類預測』 demo

日期格式建模舊金山 mon feature sklearn nor model sin import pandas as pd import numpy as np #用pandas載入csv訓練數據，並解析第一列為日期格式 train=pd.read_csv(‘.

貝葉斯（Kaggle比賽之影評與觀影者情感判定）

########資料匯入 def review_to_wordlist(review): ”’ 把IMDB的評論轉成詞序列 ”’ # 去掉HTML標籤，拿到內容 review=BeautifulSoup

為什麼深度神經網路難以訓練Why are deep neural networks hard to train?

Imagine you're an engineer who has been asked to design a computer from scratch. One day you're working away in your office, designing

kaggle比賽之路(一) —— 新手註冊賬號並fork一個notebook

本文章屬於原創，若要轉載請註明出處。寫在前面：很久之前就想參加kaggle比賽了，一直沒有下手。今天終於點開了kaggle官網，並註冊了一個號。其中遇到了很多的問題，所以想記錄下來供自己以後檢視，也給想要入門kaggle的小夥伴一點指引。第一步：

PyTorch教程之Neural Networks

進行 print 數據圖像使用數字圖像 -1 idt work 我們可以通過torch.nn package構建神經網絡。現在我們已經了解了autograd，nn基於autograd來定義模型並對他們有所區分。一個 nn.Module模塊由如下部分構成：若幹層，以

論文筆記-Sequence to Sequence Learning with Neural Networks

map tran between work down all 9.png ever onf 大體思想和RNN encoder-decoder是一樣的，只是用來LSTM來實現。 paper提到三個important point： 1）encoder和decoder的LSTM

A Beginner's Guide To Understanding Convolutional Neural Networks Part One 筆記

不同 there level cto all guid line feature 函數原文鏈接：https://adeshpande3.github.io/adeshpande3.github.io/A-Beginner‘s-Guide-To-Understanding-

DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional Neural Networks

step with 圖片 eight enter sub img layer each 1、Introduction DL解決VO問題：End-to-End VO with RCNN 2、Network structure a.CNN based Feature Ext

【DeepLearning學習筆記】Coursera課程《Neural Networks and Deep Learning》——Week1 Introduction to deep learning課堂筆記

決定如同樣本理解你是水平包含 rod spa Coursera課程《Neural Networks and Deep Learning》 deeplearning.ai Week1 Introduction to deep learning What is a

課程一(Neural Networks and Deep Learning)，第一週（Introduction to Deep Learning）—— 0、學習目標

1. Understand the major trends driving the rise of deep learning. 2. Be able to explain how deep learning is applied to supervised learning. 3. Unde

Ranking with Recursive Neural Networks and Its Application to Multi-document Summarization

Cao Z, Wei F, Dong L, et al. Ranking with recursive neural networks and its application to multi-document summarization[C]// Twenty-Ninth AAAI Con

課程一(Neural Networks and Deep Learning)，第一週（Introduction to Deep Learning）—— 2、10個測驗題

1、What does the analogy “AI is the new electricity” refer to? (B) A. Through the “smart grid”, AI is delivering a new wave of electricity.

Sutskever2014_Sequence to Sequence Learning with Neural Networks

INFO: Sutskever2014_Sequence to Sequence Learning with Neural Networks ABSTRACT Use one LSTM to read the input sequence, one timestep at a

Training Neural Networks with Weights and Activations Constrained to +1 or -1論文閱讀

確定性與隨機二值化決定式的二值化：隨機式的二值化：第二種方法雖然看起來比第一種更合理，但是在實現時卻有一個問題，那就是每次生成隨機數會非常耗時，所以一般使用第一種方法。梯度計算與累積（梯度計算與累加）雖然BNN的引數和各層的啟用值是二值化的，但梯度不得不用較

論文Multi-Perspective Sentence Similarity Modeling with Convolution Neural Networks實現之網路模型搭建及訓練

環境： Python3.6 Tensorflow-GPU 1.8.0 本文所實現的網路模型是在https://blog.csdn.net/liuchonge/article/details/64440110的基礎上搭建的，不同的是為了應對loss為NAN的情況，本文在每一層卷積的後面

論文Multi-Perspective Sentence Similarity Modeling with Convolution Neural Networks實現之資料集製作

1.資料集本文采用的是STS資料集，如下圖所示，包括所有的2012-2016年的資料，而all資料夾包含2012-2015的所有資料。每一個檔案的具體資料如下所示，每一行為一個三元組：<相似性得分，句子1，句子2>. 在實現時將all資料夾中的所有資料當作

【論文閱讀】Sequence to Sequence Learning with Neural Networks

看論文時查的知識點前饋神經網路就是一層的節點只有前面一層作為輸入，並輸出到後面一層，自身之間、與其它層之間都沒有聯絡，由於資料是一層層向前傳播的，因此稱為前饋網路。 BP網路是最常見的一種前饋網路，BP體現在運作機制上，資料輸入後，一層層向前傳播，然後計算損失函式，得到損失函式的殘差

Sequence to Sequence Learning with Neural Networks

用神經網路進行序列到序列的學習摘要 1.介紹 2.模型 3.實驗 3.1 Dataset details 3.2 Decoding and Rescoring 3.3 Reversing the Source Sent

Kaggle比賽之Artifical Neural Networks Applied to Taxi Destination Prediction程式碼整理

依賴性

結構

如何重現得獎結果?

相關推薦