自然語言處理作業A1

阿新 • • 發佈：2019-01-02

任務1：把HTML格式轉為JSON資料，再用python的JSON包，把JSON資料轉為python能使用的資料結構(dicts, lists…)（chaos2json.py）

Your implementation should have at least one regular expression (to extract the textual content of each line), and use NLTK’s word_tokenize function as the tokenizer. You may also use built-in string methods/operations and write your own helper functions.
The word_tokenize function does not separate hyphens, but this text uses hyphens in place of dashes, so your code should separate them.

Hint 1: The HTML contains (nonstandard) tags like at the beginning of each line. The number is the line within the stanza (between 1 and 4). Ignore ellipsis lines indicating removed stanzas.
Hint 2: When converting to JSON, use the indent argument to make it more human-readable.
(This script should not take extremely long to implement, but it will probably take you longer than you expect.)

from urllib import request
from bs4 import BeautifulSoup
from nltk import word_tokenize
import re
import json

url = 'file:///E:/學習文件/資料集/a1/chaos.html'

# 開啟URL，返回HTML資訊
def open_url(url):
    # 根據當前URL建立請求包
    req = request.Request(url)
    # 新增頭資訊，偽裝成瀏覽器訪問
    req.add_header('User-Agent',
                   'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36' 
)
    # 發起請求
    response = request.urlopen(req)
    # 返回請求到的HTML資訊
    return response.read()

# 用正則定位
def find_tag(url, regex = '<xxx.>(.*?)(<br>|</p>)'):
    # (.*?)
    # .是除了\n的任意字元
    # *是取之前字元的0個或者n個
    # ？是去之前字元的0個或者1個；也可以解釋為非貪婪模式
    # （）圓括號，舉例說明，eg： a(b)c，在這個例子中，用abcac來進行匹配的話，可以得到ac，abc兩個結果，意思是小括號中的內容在能匹配
    #     的情況下是需要匹配的，匹配不到內容也可以跳過。
    # 0個或者任意個不是\n的任意字元
    html = open_url(url).decode('utf-8')
    # hyphens Filter
    # 把 Recipe, pipe, studding-sail, choir 變成 Recipe, pipe, studding sail, choir;
    html= re.compile('-').sub(' ', html)
    result = re.findall(regex, html)
    return result

# 處理rhymeWord，有的結尾是一個標點，則不是rhymeWord，要跳過
def find_rhymeWord(tokens):
    length = len(tokens)
    for i in range(length):
        if tokens[length-1-i] in '.,[email protected]#$%^&*()\"\" ;\'\'':
            pass
        else:
            return tokens[length-1-i]



tag = find_tag(url)
setDict = []
numStanza = 1
count = 1 # 計算句子編號的，其實應該用xxx.這部分，但是我懶
switch = 0;
for i in range(len(tag)):
    # "stanza" = i；段首
    if tag[i][0][0:3] == '<p>':
        if switch == 1:
            setDict.append(dictionary)
            numStanza += 1
            switch = 0
        count = 1
        dictionary = dict()
        dictionary['stanza'] = numStanza
        # 處理text時要去掉<tt>(.*?)</tt>,裡面都是一些html的轉義符號,用re.sub去掉
        text = re.sub('\\xa0','',BeautifulSoup(tag[i][0], "lxml").get_text())
        tokens = word_tokenize(text)

        dictionary["lines"] = [{"lineId":'{}-{}'.format(numStanza, count), "lineNum": count, "text" : text,
                                "tokens": tokens, "rhymeWord" : find_rhymeWord(tokens)}]
        pass
    else:
        switch = 1
        count += 1
        text = re.sub('\\xa0', '', BeautifulSoup(tag[i][0], "lxml").get_text())
        tokens = word_tokenize(text)
        dictionary["lines"].append({"lineId": '{}-{}'.format(numStanza, count), "lineNum": count, "text": text,
                                "tokens": tokens, "rhymeWord": find_rhymeWord(tokens)})


js = json.dumps(setDict, indent=4)
print(js)

任務2：查詢cmudict 中的每個rhyming word，並把他們可能的發音新增到JSON資料中（allpron.py）

How many rhyming words are NOT found in cmudict (they are “out-of-vocabulary”, or “OOV”)? In your code, leave a comment indicating how many and give a few examples.

import cmudict
import json
# 發音表（元組+列表格式），和用於引索的列表格式資料
# index = words.index('apple')
# print(pron[index])
# > ('apple', ['AE1', 'P', 'AH0', 'L'])
# pron[index][1]就是我們需要的

pron = cmudict.entries()
words = cmudict.words()


# js 為上個實驗的輸出
setDict = json.loads(js)
list_OOV = []
for i in setDict:
    for j in i['lines']:
        # 可能cmudict沒有收入
        try:
            j['rhymeProns'] = pron[words.index(j['rhymeWord'].lower())][1]
        except:
            j['rhymeProns'] = 0
            list_OOV.append(j['rhymeWord'])
            pass

print(list_OOV)

[‘Terpsichore’,
‘reviles’,
‘endeavoured’,
‘tortious’,
‘clangour’,
‘hygienic’,
‘inveigle’,
‘mezzotint’,
‘Cholmondeley’,
‘obsequies’,
‘dumbly’,
‘vapour’,
‘fivers’,
‘gunwale’]

任務3：用一個啟發式的方法判斷是否兩個發音押韻與否，近似的押韻也不算（exact_rhymes.py）

How many pairs of lines that are supposed to rhyme actually have rhyming pronunciations according to your heuristic? For how many lines does having the rhyming line help you disambiguate between multiple possible pronunciations? What are some reasons that your heuristic is imperfect?

這題不大想做了，可能的思路是將每句詩押韻詞最後的發音進行比對，但是是最後幾個詞呢？可以做一個規則，比如說從後往前數都一樣，遇到不一樣時候看是不是非母音（最後一個非母音也可押韻，比如s,z進行押韻…這當作另一個規則）

自然語言處理作業A1

作業地址參考資料任務1：把HTML格式轉為JSON資料，再用python的JSON包，把JSON資料轉為python能使用的資料結構(dicts, lists…)（chaos2json.py） Your implementation should hav

cs224d 自然語言處理作業 problem set3 (一) 實現Recursive Nerual Net Work 遞歸神經網絡

函數 rec 合並聯系 cs224 作業 itl clas 自然語言處理 1、Recursive Nerual Networks能夠更好地體現每個詞與詞之間語法上的聯系這裏我們選取的損失函數仍然是交叉熵函數 2、整個網絡的結構如下圖所示: 每個參數的更新時的梯隊值如何計算

自然語言處理作業A2

自然語言處理作業A2 Unigram model 1. Creating the word_to_index dictionary 2. Building an MLE unigram model Bi

DeepLearning.ai作業:(5-2) -- 自然語言處理與詞嵌入(NLP and Word Embeddings)

title: ‘DeepLearning.ai作業:(5-2) – 自然語言處理與詞嵌入(NLP and Word Embeddings)’ id: dl-ai-5-2h tags: dl.ai homework categories: AI Deep L

自然語言處理課程作業中文文字情感分類

摘要：20世紀初以來，文字的情感分析在自然語言處理領域成為了研究的熱點，吸引了眾多學者越來越多的關注。對於中文文字的情感傾向性研究在這樣一大環境下也得到了顯著的發展。本文主要是基於機器學習方法的中文文字情感分類，主要包括：使用開源的Markup處理程式對XML檔案進行分析處理、中科院計算所開源的中文分詞處理

吳恩達Coursera深度學習課程 deeplearning.ai (5-2) 自然語言處理與詞嵌入--程式設計作業(一)：詞向量運算

Part 1: 詞向量運算歡迎來到本週第一個作業。由於詞嵌入的訓練計算量龐大切耗費時間長，絕大部分機器學習人員都會匯入一個預訓練的詞嵌入模型。你將學到：載入預訓練單詞向量，使用餘弦測量相似度使用詞嵌入解決類別問題，比如 “Man is to

吳恩達Coursera深度學習課程 deeplearning.ai (5-2) 自然語言處理與詞嵌入--程式設計作業(二)：Emojify表情包

Part 2: Emojify 歡迎來到本週的第二個作業，你將利用詞向量構建一個表情包。你有沒有想過讓你的簡訊更具表現力？ emojifier APP將幫助你做到這一點。所以不是寫下”Congratulations on the promotion! L

自然語言處理中的Attention Model：是什麽及為什麽

gensim自然語言處理

encode content for 服務讀取 htm all mat 自然語言最近在做詞語的相似度做比較，就選用了gensim 首先要安裝gensim庫，此處省略，參看官網http://radimrehurek.com/gensim/install.html 在網上下

NLP系列(1)_從破譯外星人文字淺談自然語言處理的基礎

應用展現發現 func 文本詞幹 pos 中文分詞漢語作者：龍心塵 &&寒小陽時間：2016年1月。出處： http://blog.csdn.net/longxinchen_ml/article/details/505

文本情感分析的基礎在於自然語言處理、情感詞典、機器學習方法等內容。以下是我總結的一些資源。

建議中心這場分詞自然語言處理目前能力開放計算推薦算法文本情感分析的基礎在於自然語言處理、情感詞典、機器學習方法等內容。以下是我總結的一些資源。詞典資源：SentiWordNet《知網》中文版中文情感極性詞典 NTUSD情感詞匯本體下載自然語言處理

自然語言處理哪家強？

的語音科學點對點亞馬遜消息合作夢幻項目找到自然語言處理哪家強？摘要：語音交互事關未來，這點從大公司收購、投資、合作不斷，就可見一斑。如蘋果收購Siri、Novauris、Google收購多項語音識別技術專利、Facebook收購Wit.ai等、Ama

2017MySQL中文索引解決辦法自然語言處理(N-gram parser)

ray spa 全文索引 rom alt lte int 中文索引 ble 　　問題：長期以來MYSQL搜索對於中文來說不太理想，InnoDB引擎對FULLTEXT索引的支持是MySQL5.6新引入的特性，但是用“初級”一詞在“我是一名初

(zhuan) 自然語言處理中的Attention Model：是什麽及為什麽

機器 pri 概念 max page acf 集中 use tps 自然語言處理中的Attention Model：是什麽及為什麽 2017-07-13 張俊林待字閨中要是關註深度學習在自然語言處理方面的研究進展，我相信你一定聽說過Attention Model（

95、自然語言處理svd詞向量

atp ear logs plt images svd分解 range src for import numpy as np import matplotlib.pyplot as plt la = np.linalg words = ["I","like","enjoy

NLP-python 自然語言處理01

count ems odin 頻率分布 str sep mon location don 1 # -*- coding: utf-8 -*- 2 """ 3 Created on Wed Sep 6 22:21:09 2017 4 5 @author: A

自然語言處理怎麽最快入門？

改進一個問答系統好的必須開源都在程序得出自然語言處理（簡稱NLP），是研究計算機處理人類語言的一門技術，包括： 1.句法語義分析：對於給定的句子，進行分詞、詞性標記、命名實體識別和鏈接、句法分析、語義角色識別和多義詞消歧。 2.信息抽取：從給定文本中抽

Python自然語言處理1

cmd 輸入函數調用 down load src 選擇分享 cnblogs 首先，進入cmd 輸入pip install的路徑隨後開始下載nltk的包一、準備工作 1、下載nltk 我的之前因為是已經下載好了，我現在用的參考書是Python自然語言處理這本書，最

數學之美讀書筆記——自然語言處理教父和他的弟子們

自然語言處理 jpg alt 自然 .cn 讀書筆記 bsp blog 處理數學之美讀書筆記——自然語言處理教父和他的弟子們

自然語言處理隨筆（一）

索引中國大學 import pip for earch 清華北京安裝jieba中文分詞命令：pip install jieba 簡單的例子： import jiebaseg_list = jieba.cut("我來到北京清華大學", cut_all=True)pri

自然語言處理作業A1

相關推薦