R語言:因子與字串的互轉
阿新 • • 發佈:2018-12-18
在匯入大批量資料時,如果沒有顯式地指定“stringsAsFactors = FALSE”,預設會將所有的字串轉換為因子,導致資料處理速度較慢。
示例資料如下:
name,math,english,sex,year
"yiifaa",65,68,"M",2018
"yiifee",95,98,"F",2018
"guagua",75,78,"M",2018
"MM",85,88,"F",2018
檢視資料概要,發現預設將字串轉換為因子,並進行了分組計數(這也是處理速度較慢的原因之一),概要如下:
name math english sex year
guagua:1 Min. :65.0 Min. :68.0 F:2 Min. :2018
MM :1 1st Qu.:72.5 1st Qu.:75.5 M:2 1st Qu.:2018
yiifaa:1 Median :80.0 Median :83.0 Median :2018
yiifee:1 Mean :80.0 Mean :83.0 Mean :2018
3rd Qu.:87.5 3rd Qu.:90.5 3rd Qu.:2018
Max. :95.0 Max. :98.0 Max. :2018
但這樣的分組計數並沒有意義,所以需要利用“as.character”轉換為字元,如下:
#! /usr/bin/env RScript
setwd("D:/Workspace/R-Works/R-Stat")
scores <- read.table("Score.txt", header = TRUE, sep = ",", quote="\"", encoding = "UTF-8", stringsAsFactors = TRUE)
# 將因子轉換為字元
scores$name <- as.character(scores$name)
# 多轉一個進行測試
scores$sex <- as.character(scores$sex)
再次檢視概要,如下:
name math english sex year
Length:4 Min. :65.0 Min. :68.0 Length:4 Min. :2018
Class :character 1st Qu.:72.5 1st Qu.:75.5 Class :character 1st Qu.:2018
Mode :character Median :80.0 Median :83.0 Mode :character Median :2018
Mean :80.0 Mean :83.0 Mean :2018
3rd Qu.:87.5 3rd Qu.:90.5 3rd Qu.:2018
Max. :95.0 Max. :98.0 Max. :2018
可以看到,概要中已經沒有了分組計數,但多了總數計量,如果要恢復分組計數,則需要重新建立因子,如下:
scores$sex <- factor(scores$sex, levels=c("M", "F"), ordered = TRUE)
結論
在匯入大批量資料時,為了提高效能,儘可能分兩步走:
1. 顯式指定“stringsAsFactors = FALSE”;
2. 依次將所需要的資料列(向量)轉換為因子;