1. 程式人生 > >Codeforces科學刷題指南,一圖一表便夠了

Codeforces科學刷題指南,一圖一表便夠了

簡要介紹如何科學地刷演算法題,來提高自己解決問題的能力,並利用爬蟲抓取Codeforces的題庫,來分析題目難度以及演算法分類的關係

無論做什麼事,多嘗試、找套路、然後刻意練習都是至關重要的。對資訊科學競賽(Olympiad in Informatics)愛好者來說,找套路的關鍵就是多刷題。然而題海茫茫,單以Codeforces來說,截止2017年1月3日,總共有3206道題。換言之,如果一個人足夠勤奮,能夠一天刷三道題,那也得快三年才能把題目刷完,而且題目數量還在擴充。所以盲目的刷題簡直是浪費生命,本人從16年上半年一直按照題目解決人數從高到低排序,不斷的刷水題。顯然易見,刷水題的後果就是沒有長進,熟悉的還是熟悉,不懂的還是不懂,唯一讓自己開心的就是刷題數量的累積。所以科學刷題的本質在於不斷挑戰新高度,在一個平臺練習足夠久足夠熟練之後,就要進入下一個難度平臺。為了方便大家,我把Codeforces上截止2017年1月3日的所有題目的基本資訊用爬蟲收集了下來,並存儲到excel裡。更進一步,本文試圖分析不同演算法在不同難度等級上的出現頻率分佈,以及不同演算法在不同難度等級上被解決次數的分佈。最後,我會簡要介紹的我的刷題觀,以及如何爬取Codeforces上的資訊。

先說結論

一張圖

Codeforces_Algorithms_Tag_Frequency.jpgCodeforces_Algorithms_Tag_Frequency.jpg
上面這張圖反映了不同演算法(第一列)在不同問題難度(第一行)上的頻率分佈,基於該圖,大概就可以知道在什麼樣的水平下應該掌握什麼樣的演算法。不過這裡我沒有區分Div1和Div2之間的差別,僅僅是按照題號(A、B、C等等)來推斷難度。可以看到對簡單的A題而言,大部分都是考察基本的程式設計功底,諸如implementation(大概就是題目說什麼,你做什麼就是了),math(四則運算、取模取整等等)以及brute force(暴力列舉)。而隨著難度的增加,比如說E題,主要就在於考察對dp(動態規劃),data structures(資料結構)。當然了,從圖中也可以看出,高難度題目主要在math,geometry(計算幾何),shortest path(圖論)以及games(博弈)上。下面再免費附送領一張圖,反映了不同演算法在不同問題難度上被解決次數的頻率分佈。
Codeforces_Algorithms_Tag_Solved.jpg
Codeforces_Algorithms_Tag_Solved.jpg

一張表

Codeforces-ProblemSet.jpgCodeforces-ProblemSet.jpg
然後祭上刷題目錄,也就是這一張表,彙總了截止2017年1月3日Codeforces題目上的所有演算法題。基於這張表,一來可以按照解決人數來進行刷題,二來可以按照題目難度進行刷題,三來還可以進行主題刷題。具體的檔案下載連結請見文末。

我的刷題觀

  • 如果想提升自己的思維能力 ,可以按照AC率或者解決人數由低到高二分查詢匹配自己當前水平難度的題目,然後適當挑戰高難度題(二分時間複雜度是 O(logN)O(log⁡N) ,至少比從易到難的 O
    (N)
    O(N)
     節省時間)
  • 如果想鞏固某一專題 ,那自然應該按照tag來刷題,但是因為所用的方法在求解前已知,不太利於思維能力的提升
  • 如果什麼都不懂 ,那麼建議隨機刷題,一來可以漲見識,二來進步空間比較大
  • 如果想提高AC率或者增加自信 ,那麼建議刷水題
  • 混搭以上策略 ,比如針對某一專題,然後用二分查詢來選擇問題求解

再有個建議,題目如果太難超過自己當前能力的話,嘗試一定時間後還是老老實實看題解吧,人與人之間還是有天賦差別的,但區別在於經驗可以慢慢積累。特別是即使做對題之後,還要想盡辦法看有沒有提高的餘地,並參考別人的程式碼,看如何精簡程式碼以及精簡時間空間複雜度。

tourist.jpgtourist.jpg
據說大神們的刷題量都是上萬的,所以正式比賽裡可以看到諸多大神不到一分鐘就秒了一道題,手速太快。對Competitive Programming而言,把題目做對是基本要求(題目太難則另當別論),用更快的速度求解才是頂尖高手之間的核心區別。如果說真的有天賦存在的話,那我們也無能為力;但希望能像賣油翁一樣說出,『無他,但手熟爾』。

如何用爬蟲獲取資訊

必要的庫

1: import re
2: import urllib.request
3: from bs4 import BeautifulSoup
4: import os
5: import csv
6: import time

爬取Codeforces的所有演算法題

 1: #%% retrieve the problem set
2: def spider(url):
3: response = urllib.request.urlopen(url)
4: soup = BeautifulSoup(response.read())
5: pattern = {'name': 'tr'}
6: content = soup.findAll(**pattern)
7: for row in content:
8: item = row.findAll('td')
9: try:
10: # get the problem id
11: id = item[0].find('a').string.strip()
12: col2 = item[1].findAll('a')
13: # get the problem title
14: title = col2[0].string.strip()
15: # get the problem tags
16: tags = [foo.string.strip() for foo in col2[1:]]
17: # get the number of AC submissions
18: solved = re.findall('x(\d+)', str(item[3].find('a')))[0]
19: # update the problem info
20: codeforces[id] = {'title':title, 'tags':tags, 'solved':solved, 'accepted':0,}
21: except:
22: continue
23: return soup
24:
25: codeforces = {}
26: wait = 15 # wait time to avoid the blocking of spider
27: last_page = 33 # the total page number of problem set page
28: url = ['http://codeforces.com/problemset/page/%d' % page for page in range(1,last_page+1)]
29: for foo in url:
30: print('Processing URL %s' % foo)
31: spider(foo)
32: print('Wait %f seconds' % wait)
33: time.sleep(wait)

標記已解決的演算法題

 1: #%% mark the accepted problems
2: def accepted(url):
3: response = urllib.request.urlopen(url)
4: soup = BeautifulSoup(response.read())
5: pattern = {'name':'table', 'class':'status-frame-datatable'}
6: table = soup.findAll(**pattern)[0]
7: pattern = {'name': 'tr'}
8: content = table.findAll(**pattern)
9: for row in content:
10: try:
11: item = row.findAll('td')
12: # check whether this problem is solved
13: if 'Accepted' in str(row):
14: id = item[3].find('a').string.split('-')[0].strip()
15: codeforces[id]['accepted'] = 1
16: except:
17: continue
18: return soup
19:
20: wait = 15 # wait time to avoid the blocking of spider
21: last_page = 10 # the total page number of user submission
22: handle = 'Greenwicher' # please input your handle
23: url = ['http://codeforces.com/submissions/%s/page/%d' % (handle, page) for page in range(1, last_page+1)]
24: for foo in url:
25: print('Processing URL %s' % foo)
26: accepted(foo)
27: print('Wait %f seconds' % wait)
28: time.sleep(wait)

輸出爬取資訊到csv文字

 1: #%% output the problem set to csv files
2: root = os.getcwd()
3: with open(os.path.join(root,"CodeForces-ProblemSet.csv"),"w", encoding="utf-8") as f_out:
4: f_csv = csv.writer(f_out)
5: f_csv.writerow(['ID', 'Title', 'Tags', 'Solved', 'Accepted'])
6: for id in codeforces:
7: title = codeforces[id]['title']
8: tags = ', '.join(codeforces[id]['tags'])
9: solved = codeforces[id]['solved']
10: accepted = codeforces[id]['accepted']
11: f_csv.writerow([id, title, tags, solved, accepted])
12: f_out.close()

分析題目難度以及演算法分類的關係

 1: #%% analyze the problem set
2: # initialize the difficult and tag list
3: difficult_level = {}
4: tags_level = {}
5: for id in codeforces:
6: difficult = re.findall('([A-Z])', id)[0]
7: tags = codeforces[id]['tags']
8: difficult_level[difficult] = difficult_level.get(difficult, 0) + 1
9: for tag in tags:
10: tags_level[tag] = tags_level.get(tag, 0) + 1
11: import operator
12: tag_level = sorted(tags_level.items(), key=operator.itemgetter(1))[::-1]
13: tag_list = [foo[0] for foo in tag_level]
14: difficult_level = sorted(difficult_level.items(), key=operator.itemgetter(0))
15: difficult_list = [foo[0] for foo in difficult_level]
16:
17: # initialize the 2D relationships matrix
18: # matrix_solved: the number of AC submission for each tag in each difficult level
19: # matrix_freq: the number of tag frequency for each diffiicult level
20: matrix_solved, matrix_freq = [[[0] * len(difficult_list) for _ in range(len(tag_list))] for _ in range(2)]
21:
22:
23: # construct the 2D relationships matrix
24: for id in codeforces:
25: difficult = re.findall('([A-Z])', id)[0]
26: difficult_id = difficult_list.index(difficult)
27: tags = codeforces[id]['tags']
28: solved = codeforces[id]['solved']
29: for tag in tags:
30: tag_id = tag_list.index(tag)
31: matrix_solved[tag_id][difficult_id] += int(solved)
32: matrix_freq[tag_id][difficult_id] += 1