Building Databases from unstructured text using…

阿新 • • 發佈：2018-12-29

Moreover, my dataset did not include creative information for productions. Thus, any aggregated findings were perhaps categorized too broadly thus erroneously, rendering my results non-generalizable. Simply, a musical with a cast of 30 and a running time of 2.5 hours with 1 intermission can be considered categorically different from a musical with a cast of 5, a running time of 90 minutes with no intermission.

Thus, I would need an updated data set for gross revenues in addition to a data set containing creative information for shows before I could generalize my analysis to the current Broadway market.

Step 4A: Get Better Data — Scrape the Web

Using my preliminary analysis as an indicator to which data would be required, I visited the data-housing websites described earlier to determine which would be most appropriate for web scraping.

I chose to scrape Broadway World which consists of static html pages, because of my comfort with html code. The other Broadway data sites display data with dynamic pages through JavaScript.

My programing language of choice for web scraping was Python because of its speed with large data sets, sophisticated machine learning capabilities and clever web scraping packages.

My packages of choice are the hallmarks of web scraping: beautiful soup ,urllib, requests, re, and pandas.

Step 4B: Scrape Broadway Grosses

My first step in scraping Broadway grosses was to compile a list of the URLs for every Broadway show in Broadway World’s index: show_urls. To build this list, I built another list of URLs which contain a link to every show by the first letter of their name (a-z, #) — 27 pages exist in total: list_loop_az. Scraping while iterating over this second list (list_loop_az) resulted in a list of all show links, 13,568 URLs in length.

Below is a snippet of code where I get a list of URLs for all Broadway shows:

from bs4 import BeautifulSoupimport requestsimport re import urllib

# Beginpage_base = url_baseabc = list(string.ascii_lowercase)abc.append('1')list_loop_az =[]for a in abc:    list_loop_az.append(page_base+a)

# Now, write a code that gets all the links:def getLinks_tagged(url, tag):    r = requests.get(url)    html_doc = r.text    soup = BeautifulSoup(html_doc, 'html.parser')    links = []    # set the opening of each link to be...    tag = tag    for link in soup.findAll('a', attrs={'href': re.compile(tag)}):       links.append(link.get('href'))return links

# Now use the function:    show_links_nested = []    tag = 'https://www.broadwayworld.com/grosses/'    for page in list_loop_az:        show_links_nested.append(getLinks_tagged(page, tag))    show_links = sum(show_links_nested, [])

# You now have a list of all shows: show_links

Parsing the gross data for this list with pandas’ read_html function was fairly straightforward. I characterized the tables from each page, selected the one containing gross information, wrote a sleek hygiene script that cleaned, validated, and typified the data. Putting it all together, I had a neat script which returned a csv for all Broadway grosses (file available here).

Dataset contains shows from June 1984 until the present. Its dimension are 14 non-redundant fields and 14,874 (half the vertical size of the CORGIS set, for reasons I haven’t yet investigated).

Building Databases from unstructured text using…

Step 4A: Get Better Data — Scrape the Web

Step 4B: Scrape Broadway Grosses

Building Databases from unstructured text using…

Building Vim from source（轉）

[MST] Loading Data from the Server using lifecycle hook

Display certain line(s) from a text file in Linux.

c#npoi 報錯Cannot get a numeric value from a text cell 的解決

BADI Purchase Requisition Header Long Text using Badi ME

Get URL from the text

Question Answering on Knowledge Bases and Text using Universal Schema and Memory Networks

Building Flink from Source

BADI Purchase Requisition Header Long Text using Badi ME

Calling Kotlin from Java: start using Kotlin today

Building A Deep Learning Model using Keras

Guessing the year you were born from baby names using R

Building Quart from Flask and Asyncio

Building Technology from the Inside Out

Flag S3 Buckets That Allow Access From the Internet Using AWS Config

Building powerful image classification models using very little data

Using machine learning to index text from billions of images

Building a Simple Chatbot from Scratch in Python (using NLTK)

ConceptVector: Text Visual Analytics via Interactive Lexicon Building using Word Embedding

Building Databases from unstructured text using…

Step 4A: Get Better Data — Scrape the Web

Step 4B: Scrape Broadway Grosses

相關推薦