1. 程式人生 > >Python文字資料處理

Python文字資料處理

1、文字基本操作

text1 = 'Python is a widely used high-level programming language for general-purpose programming, created by Guido van Rossum and first released in 1991.'
# 字元個數
print(len(text1))

# 獲取單詞
text2 = text1.split(' ')
print('單詞個數:', len(text2))
# 找出含有長度超過3的單詞
print([w for w in text2 if len(w) > 3
]) # 找出首字母大寫的單詞 print([w for w in text2 if w.istitle()]) # 以字母s結尾的單詞 print([w for w in text2 if w.endswith('s')]) # 找出不重複的單詞 text3 = 'TO be or not to be' text4 = text3.split(' ') print('單詞個數:', len(text4)) print('不重複的單詞個數:', len(set(text4))) # 忽略大小寫統計 set([w.lower() for w in text4]) print(len(set([w.lower
() for w in text4])))

2、 文字清洗

text5 = '            A quick brown fox jumped over the lazy dog.  '
text5.split(' ')
print(text5)
text6 = text5.strip()
print(text6)
text6.split(' ')
# 去掉末尾的換行符
text7 = 'This is a line\n'
text7.rstrip()
print(text7)

3、 正則表示式

text8 = '"Ethics are built right into the ideals and objectives of the United Nations" #UNSG @ NY Society for Ethical Culture bit.ly/2guVelr @UN @UN_Women'
print(text8) text9 = text8.split(' ') print(text9) # 查詢特定文字 # #開頭的文字 print([w for w in text9 if w.startswith('#')]) # @開頭的文字 print([w for w in text9 if w.startswith('@')]) # 根據@後的字元的樣式查詢文字 # 樣式符合的規則:包含字母,或者數字,或者下劃線 import re print([w for w in text9 if re.search('@[A-Za-z0-9_]+', w)]) text10 = 'ouagadougou' print(re.findall('[aeiou]', text10)) print(re.findall('[^aeiou]', text10))