正則表示式去除中文標點符號並且獲取數字

阿新 • • 發佈：2019-02-04

#-*-coding:utf8-*-
import re
file=open("D:/資料/山西/data_no_null.txt","r",encoding="utf8")
all_word=[["全水"],["分析水"],["灰分"],["揮發"],["固定碳"],["焦渣特徵"],["硫"],["低位熱量"]]
for line in file:
	# string = "全水22.21，分析水8.06，灰分8.87，揮發33.44，固定碳53.12，焦渣特徵2，硫0.82，低位熱量5053。"
	string=line.strip()
	for word_one in all_word:
		word_hash = {}
		for word in word_one:
			results = re.finditer(word, string)
			for result in results:
				son_string = string[result.span()[1]:]
				son_string = re.sub("[\s+\!\/_,$%^*(+\"\')]+|[:：+——()?【】“”！，。？、 
[email protected]#￥%……&*（）]+", "", son_string)
				pattern = "\d+([.])?(\d)*"
				number = re.match(pattern, son_string)
				if number != None:
					print(string)
					word_hash[result.span()[0]] = number.group()
					print(word+":" + number.group())
				else:
					print("沒有結果")
		# 正則表示式去除前面的字元
		# print(list(word_hash.items()))

正則表示式去除中文標點符號並且獲取數字

#-*-coding:utf8-*- import re file=open("D:/資料/山西/data_no_null.txt","r",encoding="utf8") all_word=[["全水"],["分析水"],["灰分"],["揮發"],["固定碳"],["焦

java正則表示式去除html中所有的標籤和特殊HTML字元

關於java正則表示式去除html中所有的標籤和特殊HTML字元，結合我所做的專案總結的經驗：總共分為三種：第一種適用於適用短的文章，將文章用正則表示式的方式拼接到程式碼中，有些繁瑣，其實不太實用。第二種就是直接將文件引入，進行更改，但是有一個小缺點，就是文件中的格式可能是utf-8格式的

Python 正則表示式匹配中文

在python2.x中，匹配中文，首先要宣告utf8的編碼方式。 # coding:utf-8 其次，被匹配的字串一定要是utf8編碼： string = u'我是個好人。' 最後，正則表示式一定要是utf8編碼： pat = u'\u6211.

如何使用正則表示式去除一篇文章兩端的多餘字元（python爬蟲）

原文章格式： text =''' /*<![CDATA[*/(adsbygoogle=window.adsbygoogle||[]).push({});/*]]>*/ Does Neural Imprinting Really Exist? Neural Imp

正則表示式去除a標籤和img標籤原始碼

public class TestString { public static void main(String[] args) { String s = "<a href=hjkhkhhk>daafadfafdadfa</a></a><img src='d

利用正則表示式去除字串中非數字字元

string a ="bkbk9*2.6/7"; 1.去除字串中非數字 a = Regex.Replace(a, @"[^\d]*", "");//需用到引用usingSystem.Text.RegularExpressions; 去除後a="9267"; 2.去除字串中數

用python3.x正則表示式匹配中文字串

re.match('^[\u4e00-\u9fa5|，。；？]+\?$','你好哈人日你，媽我。我？；們我為啥說在張志這?') 這演示了簡體，繁體，中文標點符號等等。可以看出python3.x對於中文字串匹配是可以執行得很好滴<pre name="code" cla

js 正則表示式去除html字元中所有的標籤（img標籤除外）

廢話不多說，直接上程式碼：description.replace(/<(?!img).*?>/g, ""); 如果保留img,p標籤，則為：description.replace(/<

python 正則表示式去除文字中標籤內容

print re.sub("<[^>]*>","",text) 輸出正確結果，而 print t.replace("<[^>]*>","")不能輸出正確結果 import re t = "<text>jsdkjfsgn&l

利用正則表示式去除字串中的空格

\s* 表示若干個空格（可以是0個）。 \s+ 表示一個或多個空格 public class Test { public static void main(String[] args) {

Python 3 正則表示式對中文的匹配

import re s='中文匹配7.14 3000 '.encode('utf-8') s=s.decode('utf8') m =re.findall(u"[\u4e00-\u9fa5]+",s) print(m)搜了半天網上的都是Python2的程式碼，根本不能執行

html字串去除標籤，字串利用正則表示式去除html標籤

html字串是儲存在伺服器的s='<li><a href="http://www.waiqin365.com/p-page-293.html">標題<span class="new">new</span></a>&l

java/android 正則表示式去除所有HTML標籤

protected string str = "<table><tr><td>sdasasdsdd</td></tr></table><br><p>sds</p>&l

js正則表示式去除HTML標籤

1，得到網頁上的連結地址： string matchString = @"<a[^>]+href=\s*(?:'(?<href>[^']+)'|""(?<href>[^""]+)""|(?<href>[^>\s]+))\s*[^>]

Python 正則表示式從Windows路徑中獲取資料夾

<pre name="code" class="python">1. Regular Expression ^([a-zA-Z]:|\\\\[a-zA-Z0-9_.$ -]+\\[a-z0-9_.$ -]+)?((?:\\|^)(?:[^\\/:*?"<&

python正則表示式去除html標籤

使用python的re模組，正則表示式去除html標籤，程式碼如下： import re html = '<pre class="line mt-10 q-content" accuse="qContent">\ 目的是通過第一次soup.find按class

js 正則表示式匹配中文

簡單匹配中文方法: /[^\u0000-\u00FF]/ (匹配非單位元組字元 ) 另錯誤方法：/[^\u00-\uFF]/ (匹配非單位元組字元、還包括一些全半形符號如,.(){}'"!等、還有vwxyz字元) 說明： //u0000-u00ff.包含unicode單位

基於正則表示式的字串位元組長度獲取

場景描述：比如資料庫表裡面有一個地址描述欄位，型別是varchar2(128)（128：表示可以儲存128位元組的內容），也就是說可以如果全是中文和中文字元的話可以儲存48個字元，如果是英文或者數字就可以儲存128個字元，所以需要進行校驗，判斷需要入庫的字串的位元組數；

python用正則表示式提取中文

Python re正則匹配中文，其實非常簡單，把中文的unicode字串轉換成utf-8格式就可以了，然後可以在re中隨意呼叫 unicode中中文的編碼為/u4e00-/u9fa5，因此正則表示式u”[\u4e00-\u9fa5]+”可以表示一個或者多箇中文字元 >&

Java中正則表示式去除html標籤

注：這是Java正則表示式去除html標籤方法。 private static final String regEx_script = "<script[^>]*?>[\\s\\S]*?<\\/script>"; // 定義sc

正則表示式去除中文標點符號並且獲取數字

相關推薦