首页 > Python资料博客日记

python 爬虫 selenium 笔记

2024-09-27 15:00:05Python资料围观90次

这篇文章介绍了python 爬虫 selenium 笔记，分享给大家做个参考，收藏Python资料网收获更多编程知识

todo

阅读并熟悉 Xpath, 这个与 Selenium 密切相关、

selenium

selenium 加入无图模式，速度快很多。

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# selenium 无图模式，速度快很多。
option = Options()
option.page_load_strategy = "none"
prefs = {"profile.managed_default_content_settings.images": 2}  # 设置无图模式
option.add_experimental_option("prefs", prefs)  # 加载无图模式设置

driver = webdriver.Chrome(chrome_options=option)

遇到 BeautifulSoup iframe

一种解决方案是，获得iframe的src属性，然后请求并解析其内容:
另一种是：

driver.get(url)
iframe = driver.find_elements_by_tag_name('iframe')[1]
driver.switch_to.frame(iframe) # 最重要的一步
soup = BeautifulSoup(driver.page_source, "html.parser")

个人常犯的错误，误区，陷阱

driver.execute_script(JS) 这个才是执行 JS，
注意是 execute_script, 不是 execute。

页面等待。这个是比较关键的。

显式等待。貌似比较麻烦，且不常用。

from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 10)
element = wait.until(EC.element_to_be_clickable((By.ID,'someid')))

隐式等待。推荐使用。

driver.implicitly_wait(10) # seconds

定位元素

定位元素之前，加上这句话，笔记安全。

bot.implicitly_wait(10) # 这句话很关键。

查找元素的方法

find_element_by_id()
find_element_by_name()              # 这个name 是标签里面的一种属性。
find_element_by_xpath()             
find_element_by_link_text()         # 比如  'Sign In'
find_element_by_partial_link_tex()      
find_element_by_tag_name()
find_element_by_class_name()
find_element_by_css_selector()

基本配置，导包

import os
import random
import json
import pickle
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import pyautogui as pt
import pyperclip

切换frame

遇到 iframe，最好是切换过去，见 https://blog.csdn.net/huilan_same/article/details/52200586

driver.switch_to.frame(0) # 1.用frame的index来定位，第一个是0

点击元素。不可点击的元素, 执行下面的方法。

def real_click(self, driver, ele):
    actions = ActionChains(driver)
    actions.move_to_element(ele)
    actions.click(ele)
    actions.perform()

执行 js, 页面滚动

# 先滚动到底部，然后再滚动到顶部
# window.scrollTo(0,document.body.scrollHeight);

js = "var q=document.documentElement.scrollTop=500"
bot.execute_script(js)

js2 = "document.body.scrollTop=document.documentElement.scrollTop=0;"
bot.execute_script(js2)

填写表格。这个需要再读读看。

element = driver.find_element_by_xpath("//select[@name='name']")
choices = element.find_elements_by_tag_name("option")
for c in choices:
    print("Value is: %s" % c.get_attribute("value"))
    c.click()

封装一些自己常用的方法

@staticmethod
def save_html(bot):             # 保存 html
    filename = 'ret.html'
    data = bot.page_source
    with open(filename, 'w') as f:
        f.write(data)
    print("保存 html 完成!")

@staticmethod
def real_click(driver, ele):    # 点击元素
    actions = ActionChains(driver)
    actions.move_to_element(ele)
    actions.click(ele)
    actions.perform()

@staticmethod
def send_word(ele, word):       # 输入框，输入文字
    ele.clear()
    ele.send_keys(word)
    ele.send_keys(Keys.RETURN)

源码中有趣的，有用的方法

Driver

driver.current_url # 本身就是静态方法
driver.page_source
driver.save_screenshot(‘foo.png’)
driver.get_log(‘driver’)
driver.page_source # 保存 html 源码，功本地调试，减少网络请求
driver.title 直接获取页面的标题，很适合作为文件名。

WebElement

ele.id # 直接就可以用
ele.get_attribute(“class”) # 这个很常用的。

个人接单，python, R语言，有事请私聊

老哥，支持一下啊。

标签：

上一篇：【Python系列】异步任务的终止
下一篇：EDA 2023 年世界国家suicide rate排名

点击排行

本站推荐

标签云

Python高手进阶指南

首页 > Python资料博客日记

python 爬虫 selenium 笔记

todo

selenium

个人常犯的错误，误区，陷阱

页面等待。这个是比较关键的。

定位元素

基本配置，导包

切换frame

点击元素。不可点击的元素, 执行下面的方法。

执行 js, 页面滚动

填写表格。这个需要再读读看。

封装一些自己常用的方法

源码中有趣的，有用的方法

个人接单，python, R语言，有事请私聊

老哥，支持一下啊。

相关文章

最新发布

点击排行

本站推荐

标签云

首页 > Python资料 博客日记

python 爬虫 selenium 笔记

todo

selenium

个人常犯的错误， 误区，陷阱

页面等待。这个是比较关键的。

定位元素

基本配置，导包

切换frame

点击元素。不可点击的元素, 执行下面的方法。

执行 js, 页面滚动

填写表格。这个需要再读读看。

封装一些自己常用的方法

源码中有趣的，有用的方法

个人接单，python, R语言，有事请私聊

老哥，支持一下啊。

相关文章

最新发布

点击排行

本站推荐

标签云

首页 > Python资料博客日记

个人常犯的错误，误区，陷阱