spark 中文編碼處理

阿新 • • 發佈：2019-01-26

日誌的格式是GBK編碼的，而hadoop上的編碼是用UTF-8寫死的，導致最終輸出亂碼。

研究了下Java的編碼問題。

網上其實對spark輸入檔案是GBK編碼有現成的解決方案，具體程式碼如下

import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.mapred.TextInputFormat

rdd = ctx.hadoopFile(file_list, classOf[TextInputFormat],
            classOf[LongWritable], classOf[Text]).map(
            pair  
=> new String(pair._2.getBytes, 0, pair._2.getLength, "GBK"))

這種想法的來源是基於

public static Text transformTextToUTF8(Text text, String encoding) {
    String value = null;
    try {
    value = new String(text.getBytes(), 0, text.getLength(), encoding);
    } catch (UnsupportedEncodingException e) {
    e.printStackTrace();
    }
     
return new Text(value);
}

但這種方法還有一個問題，

大家都知道gbk是2~3個位元組編碼的。如果日誌中按照直接截斷，導致按照gbk讀取檔案的時候，將後面的分隔符\t一併讀取了，導致按照\t split的時候，欄位的個數不對（或者說順序錯位了）。

這個時候，需要找到一種單位元組的解析方案，即 ISO-8859-1編碼。程式碼如下

rdd = ctx.hadoopFile(file_list, classOf[TextInputFormat],
            classOf[LongWritable], classOf[Text]).map(
            pair  
=> new String(pair._2.getBytes, 0, pair._2.getLength, "ISO-8859-1"))

但這又帶來了一個問題，即輸出的結果（按照UTF-8儲存）是亂碼，不可用。

如果我們換一種思路來考慮這個問題，Java或scala中如何將一個gbk檔案轉換為UTF8？網上有很多的現成的程式碼，具體到我們的場景，以行為單位處理的話，示例程式碼如下

public class Encoding {
    private static String kISOEncoding = "ISO-8859-1";
    private static String kGBKEncoding = "GBK";
    private static String kUTF8Encoding = "UTF-8";
    
    public static void main(String[] args) throws UnsupportedEncodingException {
        try {
            File out_file = new File(args[1]);
            Writer out = new BufferedWriter(new OutputStreamWriter(
                         new FileOutputStream(out_file), kUTF8Encoding));
            List<String> lines = Files.readAllLines(Paths.get(args[0]), Charset.forName(kGBKEncoding));
            for (String line : lines) {
                out.append(line).append("\n");
            }
            out.flush();
            out.close();
        } catch (IOException e) {
            System.out.println(e);
        }
    }
}

如上的程式碼給了我們一個啟示，即在寫入檔案的時候，系統自動進行了編碼的轉換，我們沒必要對行進行單獨的直接轉換處理。

通過查詢資料，Java中字元編碼是內部編碼，即位元組流按照編碼轉化為String。

所謂結合以上兩點認識，我們模擬在spark上以ISO-8859-1

開啟檔案和以UTF-8寫入檔案的過程，發現只需要將其強制轉換為GBK的string即可，最終得到的檔案以UTF-8開啟不是亂碼，具體程式碼如下。

public class Encoding {
    private static String kISOEncoding = "ISO-8859-1";
    private static String kGBKEncoding = "GBK";
    private static String kUTF8Encoding = "UTF-8";
    
    public static void main(String[] args) throws UnsupportedEncodingException {
        try {
            File out_file = new File(args[1]);
            Writer out = new BufferedWriter(new OutputStreamWriter(
                         new FileOutputStream(out_file), kUTF8Encoding));
            List<String> lines = Files.readAllLines(Paths.get(args[0]), Charset.forName(kISOEncoding));
            for (String line : lines) {
                String gbk_str = new String(line.getBytes(kISOEncoding), kGBKEncoding);
                out.append(gbk_str).append("\n");
            }
            out.flush();
            out.close();
        } catch (IOException e) {
            System.out.println(e);
        }
    }
}

完美的解決了。。。花費了一個工作日解決才解決的問題，對Java還是不夠熟練啊。

總結出來，希望對大家有用。

總結

1. 要舉一反三

2. 學會google，最近我就指望著它活著了。

Spark踩坑系列4--spark 中文編碼處理

日誌的格式是GBK編碼的，而hadoop上的編碼是用UTF-8寫死的，導致最終輸出亂碼。研究了下Java的編碼問題。網上其實對spark輸入檔案是GBK編碼有現成的解決方案，具體程式碼如下import org.apache.hadoop.io.LongWritable imp

spark 中文編碼處理

日誌的格式是GBK編碼的，而hadoop上的編碼是用UTF-8寫死的，導致最終輸出亂碼。研究了下Java的編碼問題。網上其實對spark輸入檔案是GBK編碼有現成的解決方案，具體程式碼如下 import org.apache.hadoop.io.LongWrit

Servlet中接受引數的中文編碼處理

在servlet中接收HttpServletRequest中引數的時候，如果有中文，不進行處理就會變成亂碼，甚是煩惱。由於經常會遇到這個問題，所以寫下來方便以後查用。有一種簡單便捷的方式可以避免這種問題，程式碼如下： String reqstr = new

MSVC中C++ UTF8中文編碼處理探究

　　字元編碼的問題，上大學那會兒就遇到過，一直都是雲裡霧裡，沒太搞清楚。最近又遇到了問題，想在C++的控制檯上輸出Utf-8編碼的漢字位元組流。嘗試了好多次都是亂碼，後來花了些時間查查資料，又和同事交流了一下，算是把C++上對於UTF8編碼的處理大概摸清楚了。字符集　　先說一個名詞：字符集

python 中文url編碼處理

python url 中文編碼可以直接處理中英混排的urlfrom urllib.parse import quote （python3）from urllib import quote (python2) url = ‘http://www.baidu.com?search=中文在這裏‘

【中文編碼】使用Python處理中文時的文字編碼問題

0x00 正文最近，在處理中文編碼的資料的時候，遇到了一些還是令人頭疼的問題。亂碼！亂碼！！亂碼！！！稍微整理一下處理過程，順帶著記錄一下解決方案啥的…… 0x01 文字轉碼最初，拿到很多GB2312(Simplify)編碼的HTM

GBK,UTF-8,和ISO8859-1編碼區別與get,post請求中文亂碼處理

1.編碼基礎知識最早的編碼是iso8859-1，和ascii編碼相似。但為了方便表示各種各樣的語言，逐漸出現了很多標準編碼，重要的有如下幾個。 1.1. iso8859-1

asp對中文編碼及解碼,Decode和Encode中文網址處理

<%'-------------------------------------------------------------------------- '=======================================================

python處理中文編碼問題總結

如何處理中文編碼的問題 python的UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0xxx in position 這個錯誤是因為你程式碼中的某個字串使用了費ascii編碼的字元，也就是它代表

python中處理中文編碼問題

今天在嘗試Python的CGI模組時遇到中文字元不能正確顯示的問題,很鬱悶. 在網上仔細找了找,終於解決了這個問題,現在將解決方法陳述如下,以防下次失誤. 頁面原始碼如下 #-*- coding: utf8 -*- import cgitb , cgi cgitb.enable() form = cgi

C#對URL中的中文亂碼處理

res quest 類庫處理 odin .dll 前言中文亂碼 ring 前言：UTF-8中，一個漢字對應三個字節，GB2312中一個漢字占用兩個字節。不論何種編碼，字母數字都不編碼，特殊符號編碼後占用一個字節。 1、直接在C#後臺編碼URL參數引用類庫：Syste

.Net Core中文編碼問題整理

figure ide 添加 run 編碼 div 方法 http read 1、添加System.Text.Encoding.CodePages包(Install-Package System.Text.Encoding.CodePages) 2、控制臺應用程序在Main方

Datastage JDBC Connector 中文亂碼處理

default 中文亂碼在Datastage中，通常處理中文字符編碼的時候是通過設置工程、JOB、stage三個級別的NLS但JDBC Connector stage這個組件並沒有NLS選項，而是通過 stage裏面的“Properties”選項卡裏面的“Session”-->“Charact

python 字符編碼處理問題總結徹底擊碎亂碼！

解析有意義 odi span data- posit 網頁 class ack Python中常常遇到這種字符編碼問題，尤其在處理網頁源代碼時（特別是爬蟲中）： UnicodeDecodeError: ‘XXX‘ codec can‘t decode bytes in

Apache服務器URL訪問中文編碼設置

img nco eva .com usr author col div load 在/usr/local/apache/conf/httpd.conf文件末位添加以下信息： 1 #add chinese url code 2 LoadModule encoding

Java中文亂碼處理

art data- data XML direct 中文亂碼處理 new size tracking 一、處理get方法中文亂碼方法1 String name = new String(request.getParameter("userName&

YAML 對中文的處理

odin enc import int 中文 Coding pen all true from yaml import load,dump f = open(‘xx.ymal‘,encoding=‘utf-8‘) l = load(f) print(f) w

如何讓sublime text 2/3支持中文編碼

pat 編輯 install pri boa 安裝 source 亂碼格式由於每個編輯器默認編碼格式不一樣，所以大致在一些編輯器中編輯的代碼註釋在另外一些編輯器中出現亂碼。在sourceinsight裏面編輯的中文在sublime text3中出現亂碼，所以

解決全站字符亂碼（POST和GET中文編碼問題）

{} tomcat ont throws turn nco cat doget pro 1　說明亂碼問題：獲取請求參數中的亂碼問題； POST請求：request.setCharacterEncoding(“utf-8”)； GET請求：new String(r

windows修改PowerShell（命令提示符）默認中文編碼方式

提示 lec user dex ole big gb2312 ons containe 如果以下方法都沒有作用的話，可以直接在代碼中調用<stdlib.h>中的system("mode con cp select=65001")或者是system("chcp 6

spark 中文編碼處理

相關推薦