1. 程式人生 > >HttpURLConnection獲取網頁內容,解決亂碼的通用方法

HttpURLConnection獲取網頁內容,解決亂碼的通用方法

      由於網頁內容的字符集編碼不一定都採用UTF-8編碼,所以通過HttpURLConnection獲取的網頁內容經常會出現亂碼的問題。 網頁內容的編碼可能是UTF-8,也可能是GBK、GB2312,甚至其它編碼方式。

       從下面的截圖可以看出,伺服器會在HTTP頭裡麵包含原始字符集編碼資訊,我們可以通過URLConnection類的getContentType()方法的返回值,然後正則表示式從中提取出編碼


    public String requestWebContenFromUrl(String urlStr) {
        String result = "";
        try {
            URL url = new URL(urlStr);
            HttpURLConnection conn = (HttpURLConnection) url.openConnection();
            conn.setRequestMethod("GET");
            conn.setConnectTimeout(5 * 1000);
            if (conn.getResponseCode() == 200) {
                InputStream is = conn.getInputStream();
                String charset = "UTF-8";
                Pattern pattern = Pattern.compile("charset=\\S*");
                Matcher matcher = pattern.matcher(conn.getContentType());
                if (matcher.find()) {
                    charset = matcher.group().replace("charset=", "");
                }
                BufferedReader reader = new BufferedReader(new InputStreamReader(is, charset));

                StringBuilder sb = new StringBuilder();
                String line;
                while ((line = reader.readLine()) != null) {
                    sb.append(line + "\n");
                }
                is.close();
                result = sb.toString();
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
        return result;
    }