伺服器反爬蟲攻略:Apache/Nginx/PHP禁止某些User Agent抓取網站
一、Apache
①、通過修改 .htaccess 檔案
修改網站目錄下的.htaccess,新增如下程式碼即可(2 種程式碼任選):
可用程式碼 (1):
RewriteEngine On RewriteCond %{HTTP_USER_AGENT} (^$|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python–urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms) [NC] RewriteRule ^(.*)$ – [F]
可用程式碼 (2):
SetEnvIfNoCase ^User–Agent$ .*(FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python–urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms) BADBOT Order Allow,Deny Allow from all Deny from env=BADBOT
②、通過修改 httpd.conf 配置檔案
找到如下類似位置,根據以下程式碼 新增 / 修改,然後重啟 Apache 即可:
Shell
DocumentRoot /home/wwwroot/xxx <Directory “/home/wwwroot/xxx”> SetEnvIfNoCase User–Agent “.*(FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms)” BADBOT Order allow,deny Allow from all deny from env=BADBOT </Directory>
二、Nginx 程式碼
進入到 nginx 安裝目錄下的 conf 目錄,將如下程式碼儲存為 agent_deny.conf
cd /usr/local/nginx/conf vim agent_deny.conf
#禁止Scrapy等工具的抓取 if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) { return 403; } #禁止指定UA及UA為空的訪問 if ($http_user_agent ~* “FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|^$” ) { return 403; } #禁止非GET|HEAD|POST方式的抓取 if ($request_method !~ ^(GET|HEAD|POST)$) { return 403; }
然後,在網站相關配置中的 location / { 之後插入如下程式碼:
Shell
include agent_deny.conf;
如下的配置:
Shell
[marsge@Mars_Server ~]$ cat /usr/local/nginx/conf/zhangge.conf location / { try_files $uri $uri/ /index.php?$args; #這個位置新增1行: include agent_deny.conf; rewrite ^/sitemap_360_sp.txt$ /sitemap_360_sp.php last; rewrite ^/sitemap_baidu_sp.xml$ /sitemap_baidu_sp.php last; rewrite ^/sitemap_m.xml$ /sitemap_m.php last;
儲存後,執行如下命令,平滑重啟 nginx 即可:
Shell
/usr/local/nginx/sbin/nginx –s reload
三、PHP 程式碼
將如下方法放到貼到網站入口檔案 index.php 中的第一個
PHP
//獲取UA資訊 $ua = $_SERVER[‘HTTP_USER_AGENT’]; //將惡意USER_AGENT存入陣列 $now_ua = array(‘FeedDemon ‘,‘BOT/0.1 (BOT for JCE)’,‘CrawlDaddy ‘,‘Java’,‘Feedly’,‘UniversalFeedParser’,‘ApacheBench’,‘Swiftbot’,‘ZmEu’,‘Indy Library’,‘oBot’,‘jaunty’,‘YandexBot’,‘AhrefsBot’,‘MJ12bot’,‘WinHttp’,‘EasouSpider’,‘HttpClient’,‘Microsoft URL Control’,‘YYSpider’,‘jaunty’,‘Python-urllib’,‘lightDeckReports Bot’); //禁止空USER_AGENT,dedecms等主流採集程式都是空USER_AGENT,部分sql注入工具也是空USER_AGENT if(!$ua) { header(“Content-type: text/html; charset=utf-8”); die(‘請勿採集本站,因為採集的站長木有小JJ!’); }else{ foreach($now_ua as $value ) //判斷是否是陣列中存在的UA if(eregi($value,$ua)) { header(“Content-type: text/html; charset=utf-8”); die(‘請勿採集本站,因為採集的站長木有小JJ!’); } }
四、測試效果
如果是 vps,那非常簡單,使用 curl -A 模擬抓取即可,比如:
模擬宜搜蜘蛛抓取:
Shell
curl –I –A ‘YisouSpider’ bizhi.bcoderss.com
模擬 UA 為空的抓取:
Shell
curl –I –A ” bizhi.bcoderss.com
模擬百度蜘蛛的抓取:
Shell
curl –I –A ‘Baiduspider’ bizhi.bcoderss.com