HttpURLConnection獲取網頁內容,解決亂碼的通用方法
阿新 • • 發佈:2019-01-07
由於網頁內容的字符集編碼不一定都採用UTF-8編碼,所以通過HttpURLConnection獲取的網頁內容經常會出現亂碼的問題。 網頁內容的編碼可能是UTF-8,也可能是GBK、GB2312,甚至其它編碼方式。
從下面的截圖可以看出,伺服器會在HTTP頭裡麵包含原始字符集編碼資訊,我們可以通過URLConnection類的getContentType()方法的返回值,然後用正則表示式從中提取出編碼。
public String requestWebContenFromUrl(String urlStr) { String result = ""; try { URL url = new URL(urlStr); HttpURLConnection conn = (HttpURLConnection) url.openConnection(); conn.setRequestMethod("GET"); conn.setConnectTimeout(5 * 1000); if (conn.getResponseCode() == 200) { InputStream is = conn.getInputStream(); String charset = "UTF-8"; Pattern pattern = Pattern.compile("charset=\\S*"); Matcher matcher = pattern.matcher(conn.getContentType()); if (matcher.find()) { charset = matcher.group().replace("charset=", ""); } BufferedReader reader = new BufferedReader(new InputStreamReader(is, charset)); StringBuilder sb = new StringBuilder(); String line; while ((line = reader.readLine()) != null) { sb.append(line + "\n"); } is.close(); result = sb.toString(); } } catch (Exception e) { e.printStackTrace(); } return result; }