上一篇講述了Python的運行環境,這一篇要記錄如何使用Python的自動化來抓取網頁資料。我們先建立好運作的環境,打開Anaconda Navigator,點選Environments,選擇要運作的環境後,在右邊的外掛目錄中搜尋selenium並安裝。

安裝外掛與瀏覽器驅動

接下來要安裝瀏覽器自動化的驅動程式,分別有Chrome和Firefox兩個選擇。

Chrome:http://chromedriver.chromium.org/downloads
Firefox:https://github.com/mozilla/geckodriver/releases

下載好後解壓縮並放在自己習慣的路徑,打開 jupyter notebook,新建檔案進行編程

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

# 驅動要使用絕對路徑

# Chrome
driver_path = "/Users/Alex/Desktop/python/chromedriver"
driver = webdriver.Chrome(executable_path = driver_path)

# Firefox
driver_path = "/Users/Alex/Desktop/python/geckodriver"
driver = webdriver.Firefox(executable_path = driver_path)

# 使用driver開啟網頁
driver.get("http://www.imdb.com/")

 

使用Python對DOM進行操作

在Python中要對DOM進行操作,可以使用 CSS 選擇器 或是 Xpath 選擇器 對DOM進行選取

# 在搜尋列輸入La La Land
query_elem = driver.find_element_by_css_selector('#navbar-query')
query_elem.send_keys("La La Land")

# xpath版本
xpath_elem = driver.find_element_by_xpath("//input[@id='navbar-query']")
xpath_elem.send_keys("La La Land")

點擊事件寫法跟JS相似:

submit_btn = driver.find_element_by_xpath("//button[@id='navbar-submit-button']")
submit_btn.click()

first_result_elem = driver.find_element_by_css_selector("#findSubHeader+ .findSection .odd:nth-child(1) .result_text a")
first_result_elem.click()

取得DOM中的文字內容:

rating = driver.find_element_by_xpath("//strong/span")
rating.text

最後輸出為JSON檔:

import json

# 檔案名稱為"movie.json",寫入檔案
# json.dump()輸出為json客式並寫入外部檔案
with open("movie.json", 'w') as outfile:
    json.dump(movies_data, outfile)

 

使用 time 功能避免被Ban

在使用Python自動化功能抓取網站大量資料時,可能會被認為是惡意攻擊而被Ban(封鎖),所以會使用time function 讓Python 進行定時休息,避免被認為是機器人而被封鎖:

def get_movie_info(movie_name):
    
    from selenium import webdriver
    # 載入 time 功能
    import time
    
    driver_path = "/Users/SCE/Downloads/geckodriver.exe"
    driver = webdriver.Firefox(executable_path = driver_path)
    driver.get("http://www.imdb.com/")
    
    query_elem = driver.find_element_by_css_selector('#navbar-query')
    query_elem.send_keys(movie_name)
    # 讓Python休息3秒
    time.sleep(3)
    submit_btn = driver.find_element_by_xpath("//button[@id='navbar-submit-button']")
    submit_btn.click()
    # 讓Python休息3秒 
    time.sleep(3)
    first_result = driver.find_element_by_xpath("//div[@id='main']/div/div[2]/table/tbody/tr[1]/td[2]/a")
    first_result.click()
    # 讓Python休息3秒 
    time.sleep(3)
    rating_elem = driver.find_element_by_xpath("//strong/span")
    rating_elem.text
    rating = float(rating_elem.text)
    genre_elem = driver.find_elements_by_xpath("//div[@class='subtext']/a/span[@class='itemprop']")
    genre = [g.text for g in genre_elem]
    cast_elem = driver.find_elements_by_xpath("//*[@id='titleCast']/table/tbody/tr/td[2]/a/span")
    cast = [c.text for c in cast_elem]
    
    driver.close()
    movie_info = {
        "movie_name" : movie_name,
        "rating" : rating,
        "genre" : genre,
        "cast" : cast,
    }
    return movie_info

get_movie_info('The Shawshank Redemption')

突然覺得,使用Python抓取網頁的資料好像也不太難了呢~

Last modified: 2018-06-01

Author

Comments

Write a Reply or Comment

Your email address will not be published.