爬取LeetCode資料,生成README檔案,美化GitHub倉庫
專案地址: LeetCodeCrawler" rel="nofollow,noindex" target="_blank">LeetCodeCrawler
概述
現在一般或多或少都會在 LeetCode 上面進行刷題練習,然後將程式碼放在 GitHub/">GitHub 上,當然我也一樣,這是我的刷題倉庫 Algorithm 。刷完題如果每次都去重新編輯 README.md
檔案進行更新,未免顯得有些費時,因此有了需求,個人就寫了一個工具—— LeetCodeCrawler :爬取 LeetCode 題目內容以及提交的AC程式碼的工具,並支援生成相應的 README.md 檔案,美化你的 LeetCode 倉庫的README。
使用方法
下載 LeetCodeCrawler.jar 到本地
建立好如下 config.json
檔案(可直接對 repo 的 config.json
進行更改), config.json
檔案需與 LeetCodeCrawler.jar
放置於同一目錄下:
{ "username": "leetcode@leetcode", "password": "leetcode", "language": ["cpp", "java"], "outputDir": "." }
-
username
和password
對應你的 LeetCode 賬號和密碼 -
language
對應於你在 LeetCode 刷題使用的程式語言,可多選,選填欄位如下( 請嚴格按照如下欄位填寫 ):- cpp
- java
- csharp
- javascript
- python
- python3
- ruby
- swift
- golang
- scala
- kotlin
-
outputDir
欄位表示你希望存放原始碼檔案的目錄,預設為.
,即當前目錄
執行 java -jar LeetCodeCrawler.jar
效果
爬取解析
幾個相關API
主要通過兩種方式來獲取我們想要的資料:1. RESTful API
2. GraphQL
以下是爬取過程中幾個有用的 API:
- 所有題目的相關資訊:
https://leetcode.com/api/problems/all/
,資料大致如下:
{ "user_name": "", "num_solved": 0, "num_total": 949, "ac_easy": 0, "ac_medium": 0, "ac_hard": 0, "stat_status_pairs": [ { "stat": { "question_id": 993, "question__article__live": true, "question__article__slug": "tallest-billboard", "question__title": "Tallest Billboard", "question__title_slug": "tallest-billboard", "question__hide": false, "total_acs": 1361, "total_submitted": 4295, "frontend_question_id": 956, "is_new_question": false }, "status": null, "difficulty": { "level": 3 }, "paid_only": false, "is_favor": false, "frequency": 0, "progress": 0 }, ...省略 ], "frequency_high": 0, "frequency_mid": 0, "category_slug": "all" }
- 某道題目提交的程式碼的資訊:
https://leetcode.com/api/submissions/two-sum/?offset=0&limit=10&lastkey=
,提交的程式碼列表可能超過一頁的顯示篇幅,因此需要做翻頁判斷的邏輯,資料大致如下:
{ "submissions_dump": [ { "id": xxx, "lang": "java", "time": "2 weeks, 5 days", "timestamp": 154****320, "status_display": "Accepted", "runtime": "4 ms", "url": "/submissions/detail/19****359/", "is_pending": "Not Pending", "title": "" }, ...省略 ], "has_next": true, "last_key": "xxx" }
- GraphQL:
https://leetcode.com/graphql
,向這個連結傳送query
請求,獲取我們想要的資料
模擬登陸
之前寫過一篇博文來說明了如何模擬登陸—— 使用OkHttp模擬登陸LeetCode ,可進一步檢視,這裡簡單說一下。根據抓包結果可以得到: 因此我們只要建立一個 Content-Type
型別為 multipart/form-data
的請求,然後帶上初始開啟登入頁返回的 Cookie
值即可完成模擬登陸。
/** * 模擬登陸 LeetCodo,登陸過程分析見:https://www.cnblogs.com/ZhaoxiCheung/p/9302510.html */ public boolean doLogin() throws IOException { boolean success; Connection.Response response = Jsoup.connect(URL.LOGIN) .method(Connection.Method.GET) .execute(); csrftoken = response.cookie("csrftoken"); __cfduid = response.cookie("__cfduid"); OkHttpClient client = new OkHttpClient.Builder() .followRedirects(false) .followSslRedirects(false) .cookieJar(new MyCookieJar()) .connectTimeout(30, TimeUnit.SECONDS) .readTimeout(30, TimeUnit.SECONDS) .writeTimeout(30, TimeUnit.SECONDS) .build(); String form_data = "--" + boundary + "\r\n" + "Content-Disposition: form-data; name=\"csrfmiddlewaretoken\"" + "\r\n\r\n" + csrftoken + "\r\n" + "--" + boundary + "\r\n" + "Content-Disposition: form-data; name=\"login\"" + "\r\n\r\n" + usrname + "\r\n" + "--" + boundary + "\r\n" + "Content-Disposition: form-data; name=\"password\"" + "\r\n\r\n" + passwd + "\r\n" + "--" + boundary + "\r\n" + "Content-Disposition: form-data; name=\"next\"" + "\r\n\r\n" + "/problems" + "\r\n" + "--" + boundary + "--"; RequestBody requestBody = RequestBody.create(MULTIPART, form_data); Request request = new Request.Builder() .addHeader("Content-Type", "multipart/form-data; boundary=" + boundary) .addHeader("Connection", "keep-alive") .addHeader("Accept", "*/*") .addHeader("Origin", "https://leetcode.com") .addHeader("Referer", URL.LOGIN) .addHeader("Cookie", "__cfduid=" + __cfduid + ";" + "csrftoken=" + csrftoken) .post(requestBody) .url(URL.LOGIN) .build(); Response loginResponse = client.newCall(request).execute(); if (Main.isDebug)out.println(loginResponse.message()); Headers headers = loginResponse.headers(); List<String>cookies = headers.values("Set-Cookie"); for (String cookie : cookies) { int found = cookie.indexOf("LEETCODE_SESSION"); if (found > -1) { if (Main.isDebug)out.println(cookie); int last = cookie.indexOf(";"); LEETCODE_SESSION = cookie.substring("LEETCODE_SESSION".length() + 1, last); if (Main.isDebug)out.println(LEETCODE_SESSION); } } if (LEETCODE_SESSION != null) { success = true; out.println("Login Successfully"); } else { success = false; out.println("Login Unsuccessfully"); } loginResponse.close(); return success; }
利用 GraphQL 獲取資料
並非所有的資料都可以通過 RESTful API
的形式獲取,LeetCode 對於有些資料用的是 GraphQL
的方式,比如題目的 Description
。之前也寫了一篇關於使用 GraphQL
來獲取 LeetCode 資料的文章—— 爬取LeetCode題目——如何傳送GraphQL Query獲取資料 ,可進一步檢視。這裡主要說一下怎麼知道我們要傳送怎樣的 query
語句。在 Chrome 瀏覽器下使用 F12,右鍵 Network 下,從 Header
中的 Request Payload
中我們可以看到一個 query 的欄位,這是我們要構造的 GraphQL Query 的一個重要資訊。,如下圖所示:
其他
獲取題目的描述
public String getProblemDescription(String problemTitle) throws IOException { String problemDescriptionString = ""; String postBody = "query{question(titleSlug:\"" + problemTitle + "\") {content}}\n"; RequestBody requestBody = RequestBody.create(MediaType.parse("application/graphql; charset=utf-8"), postBody); Headers headers = new Headers.Builder() .add("Content-Type", "application/graphql") .add("Referer", "https://leetcode.com/problems/" + problemTitle) .add("Cookie", "__cfduid=" + Login.__cfduid + ";" + "csrftoken=" + Login.csrftoken + ";" + "LEETCODE_SESSION=" + Login.LEETCODE_SESSION) .add("x-csrftoken", Login.csrftoken) .build(); Response graphqlResponse = okHttpHelper.post(URL.GRAPHQL, requestBody, headers); if (graphqlResponse != null) { ProblemContentBean problemContentBean = okHttpHelper.fromJson(graphqlResponse.body().string(), ProblemContentBean.class); problemDescriptionString = problemContentBean.getData().getQuestion().getContent(); graphqlResponse.close(); } else { //TODO 輸出錯誤資訊 } return problemDescriptionString; }
獲取某道題對於某個語言提交的程式碼
public String getSubmissionCode(String submissionUrl) throws IOException { String url = URL.LEETCODE + submissionUrl; if (Main.isDebug)out.println(url); String codeString = null; Headers headers = new Headers.Builder() .add("Cookie", "__cfduid=" + Login.__cfduid + ";" + "csrftoken=" + Login.csrftoken + ";" + "LEETCODE_SESSION=" + Login.LEETCODE_SESSION) .build(); Response response = okHttpHelper.get(url, headers); if (response != null) { String htmlString = response.body().string(); Document document = Jsoup.parse(htmlString); Elements elements = document.getElementsByTag("script"); for (Element element : elements) { int indexStart = element.toString().indexOf("submissionCode: '"); if (indexStart > -1) { int indexTo = element.toString().indexOf("editCodeUrl"); codeString = element.toString().substring(indexStart + ("submissionCode: '").length(), indexTo - 5); break; } } response.close(); } else { //TODO 錯誤資訊處理 } codeString = encode(codeString); return codeString; }
獲取題目對於 config 檔案指定的語言提交的程式碼
public synchronized Map<String, String> getSubmissions(String problemTitle, ResultBean resultBean) throws IOException { if (Main.isDebug)out.println("pre problemTitle = " + problemTitle); //儲存語言對應的提交程式碼 Map<String, String> submissionMap = new HashMap<>(); int offset = 0; int limit = 10; boolean hasNext = true; String lastKey = ""; List<String> languageList = Config.getSingleton().getLanguageList(); //已經在本地存有對應語言的程式碼 List<String> savedLanguageList = resultBean != null ? resultBean.getLanguage() : new ArrayList<>(0); //儲存某個語言的程式碼是否已經抓取 Map<String, Boolean>languageMap = new HashMap<>(); for (int i = 0; i < languageList.size(); i++) { boolean hasExist = false; //資料量較小,暴力搜尋 for (int j = 0; j < savedLanguageList.size(); j++) { if (languageList.get(i).equals(savedLanguageList.get(j))) { hasExist = true; break; } } if (!hasExist)languageMap.put(languageList.get(i), false); } //想要爬取的題目的對應語言提交的程式碼已經儲存在本地了 if (languageMap.size() == 0)return submissionMap; while(hasNext) { String submissionsUrl = String.format(URL.SUBMISSIONS_FORMAT, problemTitle, offset, limit, lastKey); Headers headers = new Headers.Builder() .add("Cookie", "__cfduid=" + Login.__cfduid + ";" + "csrftoken=" + Login.csrftoken + ";" + "LEETCODE_SESSION=" + Login.LEETCODE_SESSION) .build(); Response response = okHttpHelper.get(submissionsUrl, headers); if (response != null) { String responseData = response.body().string(); SubmissionBean submissionBean = okHttpHelper.fromJson(responseData, SubmissionBean.class); List<SubmissionBean.SubmissionsDumpBean> submissionsDumpList = submissionBean.getSubmissions_dump(); if (submissionsDumpList == null) { if (Main.isDebug) { out.println("submissionsUrl = " + submissionsUrl); out.println("problemTitle = " + problemTitle); out.println("responseData = " + responseData); out.println("status message = " + response.message()); out.println("message code = " + response.code()); } continue; } for (int i = 0; i < submissionsDumpList.size(); i++) { SubmissionBean.SubmissionsDumpBean submission = submissionsDumpList.get(i); String language = submission.getLang(); if (languageMap.containsKey(language) && languageMap.get(language) == false && submission.getStatus_display().equals("Accepted")) { submissionMap.put(language, getSubmissionCode(submission.getUrl())); languageMap.put(language, true); } } //翻頁邏輯 hasNext = submissionBean.isHas_next(); offset = (++offset) * limit; lastKey = submissionBean.getLast_key(); response.close(); } else { //TODO } } return submissionMap; }
更詳細的程式碼可在 GitHub 檢視—— LeetCodeCrawler