1. 程式人生 > >Building Databases from unstructured text using…

Building Databases from unstructured text using…

Moreover, my dataset did not include creative information for productions. Thus, any aggregated findings were perhaps categorized too broadly thus erroneously, rendering my results non-generalizable. Simply, a musical with a cast of 30 and a running time of 2.5 hours with 1 intermission can be considered categorically different from a musical with a cast of 5, a running time of 90 minutes with no intermission.

Thus, I would need an updated data set for gross revenues in addition to a data set containing creative information for shows before I could generalize my analysis to the current Broadway market.

Step 4A: Get Better Data — Scrape the Web

Using my preliminary analysis as an indicator to which data would be required, I visited the data-housing websites described earlier to determine which would be most appropriate for web scraping.

I chose to scrape Broadway World which consists of static html pages, because of my comfort with html code. The other Broadway data sites display data with dynamic pages through JavaScript.

My programing language of choice for web scraping was Python because of its speed with large data sets, sophisticated machine learning capabilities and clever web scraping packages.

My packages of choice are the hallmarks of web scraping: beautiful soup ,urllib, requests, re, and pandas.

Step 4B: Scrape Broadway Grosses

My first step in scraping Broadway grosses was to compile a list of the URLs for every Broadway show in Broadway World’s index: show_urls. To build this list, I built another list of URLs which contain a link to every show by the first letter of their name (a-z, #) — 27 pages exist in total: list_loop_az. Scraping while iterating over this second list (list_loop_az) resulted in a list of all show links, 13,568 URLs in length.

Below is a snippet of code where I get a list of URLs for all Broadway shows:

from bs4 import BeautifulSoupimport requestsimport re import urllib
# Beginpage_base = url_baseabc = list(string.ascii_lowercase)abc.append('1')list_loop_az =[]for a in abc:    list_loop_az.append(page_base+a)
# Now, write a code that gets all the links:def getLinks_tagged(url, tag):    r = requests.get(url)    html_doc = r.text    soup = BeautifulSoup(html_doc, 'html.parser')    links = []    # set the opening of each link to be...    tag = tag    for link in soup.findAll('a', attrs={'href': re.compile(tag)}):       links.append(link.get('href'))return links
# Now use the function:    show_links_nested = []    tag = 'https://www.broadwayworld.com/grosses/'    for page in list_loop_az:        show_links_nested.append(getLinks_tagged(page, tag))    show_links = sum(show_links_nested, [])
# You now have a list of all shows: show_links

Parsing the gross data for this list with pandas’ read_html function was fairly straightforward. I characterized the tables from each page, selected the one containing gross information, wrote a sleek hygiene script that cleaned, validated, and typified the data. Putting it all together, I had a neat script which returned a csv for all Broadway grosses (file available here).

Dataset contains shows from June 1984 until the present. Its dimension are 14 non-redundant fields and 14,874 (half the vertical size of the CORGIS set, for reasons I haven’t yet investigated).

Mr. Krabs (Bryan Ray Norris) from Spongebob the Musical would have been really happy with my code, which retrieved all financial information for Broadway shows.