1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
| from urllib.request import urlopen from urllib.parse import urlparse from bs4 import BeautifulSoup import re import datetime import random import io import os import sys from urllib import request
pages = set() random.seed(datetime.datetime.now())
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf-8') headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
def getInternalLinks(bsObj, includeUrl): includeUrl = urlparse(includeUrl).scheme+"://"+urlparse(includeUrl).netloc internalLinks = [] for link in bsObj.findAll("a", href=re.compile("^(/|.*"+includeUrl+")")): if link.attrs['href'] is not None: if link.attrs['href'] not in internalLinks: if(link.attrs['href'].startswith("/")): url = "{} | {}".format(link.string, includeUrl+link.attrs['href']) internalLinks.append(url) else: url = "{} | {}".format(link.string, link.attrs['href']) internalLinks.append(url) return internalLinks
def getExternalLinks(bsObj, excludeUrl): externalLinks = [] for link in bsObj.findAll("a", href=re.compile("^(http|www)((?!"+excludeUrl+").)*$")): if link.attrs['href'] is not None: if link.attrs['href'] not in externalLinks: url = "{} | {} ".format(link.string, link.attrs['href']) externalLinks.append(url) return externalLinks
def getRandomExternalLink(startingPage): req=request.Request(startingPage,headers=headers) html=urlopen(req) bsObj=BeautifulSoup(html.read(),"html.parser") externalLinks = getExternalLinks(bsObj, urlparse(startingPage).netloc) if len(externalLinks) == 0: print("没有外部链接,准备遍历整个网站") domain = urlparse(startingPage).scheme+"://"+urlparse(startingPage).netloc internalLinks = getInternalLinks(bsObj, domain) return getRandomExternalLink(internalLinks[random.randint(0,len(internalLinks)-1)]) else: return externalLinks[random.randint(0, len(externalLinks)-1)] def followExternalOnly(startingSite): externalLink = getRandomExternalLink(startingSite) print("随机外链是: "+externalLink) followExternalOnly(externalLink)
allExtLinks = set() allIntLinks = set() def getAllExternalLinks(siteUrl): req=request.Request(siteUrl,headers=headers) html=urlopen(req) bsObj=BeautifulSoup(html.read(),"html.parser") domain = urlparse(siteUrl).scheme+"://"+urlparse(siteUrl).netloc internalLinks = getInternalLinks(bsObj,domain) externalLinks = getExternalLinks(bsObj,domain) for link in externalLinks: if link not in allExtLinks: allExtLinks.add(link) print("外部链接 | "+link) print('---'*40) for link in internalLinks: if link not in allIntLinks: print("内部链接 | "+link) allIntLinks.add(link)
if __name__ == '__main__': hf = sys.argv[1] getAllExternalLinks(hf)
|