1. 程式人生 > >R語言:因子與字串的互轉

R語言:因子與字串的互轉

在匯入大批量資料時,如果沒有顯式地指定“stringsAsFactors = FALSE”,預設會將所有的字串轉換為因子,導致資料處理速度較慢。

示例資料如下:

name,math,english,sex,year
"yiifaa",65,68,"M",2018
"yiifee",95,98,"F",2018
"guagua",75,78,"M",2018
"MM",85,88,"F",2018

檢視資料概要,發現預設將字串轉換為因子,並進行了分組計數(這也是處理速度較慢的原因之一),概要如下:

  name        math         english     sex        year
guagua:1 Min. :65.0 Min. :68.0 F:2 Min. :2018 MM :1 1st Qu.:72.5 1st Qu.:75.5 M:2 1st Qu.:2018 yiifaa:1 Median :80.0 Median :83.0 Median :2018 yiifee:1 Mean :80.0 Mean :83.0 Mean :2018 3rd Qu.:87.5 3rd Qu.:90.5 3rd Qu.:2018
Max. :95.0 Max. :98.0 Max. :2018

但這樣的分組計數並沒有意義,所以需要利用“as.character”轉換為字元,如下:

#! /usr/bin/env RScript
setwd("D:/Workspace/R-Works/R-Stat")
scores <- read.table("Score.txt", header = TRUE, sep = ",", quote="\"", encoding = "UTF-8", stringsAsFactors = TRUE)
# 將因子轉換為字元
scores$name <- as.character(scores$name)
# 多轉一個進行測試
scores$sex <- as.character(scores$sex)

再次檢視概要,如下:

name                math         english         sex                 year     
 Length:4           Min.   :65.0   Min.   :68.0   Length:4           Min.   :2018  
 Class :character   1st Qu.:72.5   1st Qu.:75.5   Class :character   1st Qu.:2018  
 Mode  :character   Median :80.0   Median :83.0   Mode  :character   Median :2018  
                    Mean   :80.0   Mean   :83.0                      Mean   :2018  
                    3rd Qu.:87.5   3rd Qu.:90.5                      3rd Qu.:2018  
                    Max.   :95.0   Max.   :98.0                      Max.   :2018  

可以看到,概要中已經沒有了分組計數,但多了總數計量,如果要恢復分組計數,則需要重新建立因子,如下:

scores$sex <- factor(scores$sex, levels=c("M", "F"), ordered = TRUE)

結論

在匯入大批量資料時,為了提高效能,儘可能分兩步走:
1. 顯式指定“stringsAsFactors = FALSE”;
2. 依次將所需要的資料列(向量)轉換為因子;