1. 程式人生 > >Python 爬蟲筆記(對維基百科頁面的深度爬取)

Python 爬蟲筆記(對維基百科頁面的深度爬取)

*#! /usr/bin/env python
#coding=utf-8
import urllib2
from    bs4 import  BeautifulSoup
import  re
import datetime
import random
random.seed(datetime.datetime.now())
def getLinks(articleUrl):
        html=urllib2.urlopen("http://en.wikipedia.org"+articleUrl)
        bsObj=BeautifulSoup(html)
        return
bsObj.find("div",{"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$")) links=getLinks("/wiki/Kevin_Bacon") while len(links)>0: newArticle=links[random.randint(0,len(links)-1)].attrs["href"] print(newArticle) links=getLinks(newArticle)*

PSEUDORANDOM NUMBERS AND RANDOM SEEDS
In the previous example, I used Python’s random number generator to select an article at random on each page in
order to continue a random traversal of Wikipedia. However, random numbers should be used with caution.
While computers are great at calculating correct answers, they’re terrible at just making things up. For this reason,
random numbers can be a challenge. Most random number algorithms strive to produce an evenly distributed and
hard-to-predict sequence of numbers, but a “seed” number is needed to give these algorithms something to work
with initially. The exact same seed will produce the exact same sequence of “random” numbers every time, so for
this reason I’ve used the system clock as a starter for producing new sequences of random numbers, and, thus, new
sequences of random articles. This makes the program a little more exciting to run.
For the curious, the Python pseudorandom number generator is powered by the Mersenne Twister algorithm. While
it produces random numbers that are difficult to predict and uniformly distributed, it is slightly processor intensive.
Random numbers this good don’t come cheap!