1. 程式人生 > >Ubuntu上安裝torque過程

Ubuntu上安裝torque過程

eat echo linux user -s nec highlight ret rom

過程參考(以及基本翻譯自)此文:https://jabriffa.wordpress.com/2015/02/11/installing-torquepbs-job-scheduler-on-ubuntu-14-04-lts/ 和此文: https://linuxcluster.wordpress.com/2012/04/01/enabling-torque-for-email-notification/ .

此過程會將當前計算機當作server, compue node, scheduler and submission host.

Step 1: 從Ubuntu上安裝Torque

apt-get install torque-server torque-client torque-mom torque-pam

  這兒下載安裝的是老版本Torque-2.4.16.一路Yes即可.

Step 2: 關閉當前開啟的默認服務

/etc/init.d/torque-mom stop
/etc/init.d/torque-scheduler stop
/etc/init.d/torque-server stop
pbs_server -t create

  以及:

killall pbs_server

  這一步很重要,否則接下來所做的修改將在下一次pbs_server重啟後被覆蓋.

Step 3: 因為Panther當前沒有FQDN只有IP, 所以選了個Domain Name為panther.ncsu.

(註: 按照參考博客的說法,這兒需要選一個兩單詞的server.domain形式的domain name, 否則後文可能會遇到問題.)

echo panther.ncsu > /etc/torque/server_name
echo panther.ncsu > /var/spool/torque/server_priv/acl_svr/acl_hosts
echo [email protected] > /var/spool/torque/server_priv/acl_svr/operators
echo [email protected] > /var/spool/torque/server_priv/acl_svr/managers

  並且在/etc/hosts中加入此行:

10.123.32.** panther.ncsu

  

Step 4: 將計算機本身當作compute node

echo "panther.ncsu np=4" > /var/spool/torque/server_priv/nodes

  這兒可根據實際情況修改np

告訴Mom_nodes compute node的具體位置:

echo panther.ncsu > /var/spool/torque/mom_priv/config

Step 5: 重啟torque服務

/etc/init.d/torque-server start
/etc/init.d/torque-scheduler start
/etc/init.d/torque-mom start

Step 6: 設置PBS參數

qmgr -c ‘set server scheduling = true‘
qmgr -c ‘set server keep_completed = 1000‘ #最長時間1000小時
qmgr -c ‘set server mom_job_sync = true‘
qmgr -c ‘create queue std‘ #創建std queue
qmgr -c ‘set queue batch queue_type = execution‘
qmgr -c ‘set queue batch started = true‘
qmgr -c ‘set queue batch enabled = true‘
qmgr -c ‘set queue batch resources_default.walltime = 10:00:00‘
qmgr -c ‘set queue batch resources_default.nodes = 1‘
qmgr -c ‘set server default_queue = std‘

以及設置submission pool:

qmgr -c ‘set server submit_hosts = panther‘
 qmgr -c ‘set server allow_node_submit = true‘

上面選了domain name為panther.ncsu,這兒需要選擇其name,panther為submission pool

Step 8: 提交測試任務

結果:

附錄. 使用ssmtp設置郵件通知: https://help.ubuntu.com/community/EmailAlerts

Errors and solutions:

1. Errors:

Unable to copy file /var/spool/torque/spool/15.panther.ncsu.OU to zjyx@Panther:/home/zjyx/work/tests/pbs/fdm/oe.15.panther.ncsu
*** error from copy
Host key verification failed.
lost connection
*** end error output
Output retained on that host in: /var/spool/torque/undelivered/15.panther.ncsu.OU

Solutions: (http://torqueusers.supercluster.narkive.com/Ut2n70R1/host-key-verification-failed: Host key verification failed)

Just try to delete ~/.ssh/known_hosts, and ssh between different nodes set up by torque. In my case, I did ssh panther.ncsu, ssh localhost, ssh Panther, and ssh panther.

Ubuntu上安裝torque過程