多个复选框单击不起作用。它总是点击第一个元素

Question

Raha

Asked: 2024-09-04 21:58:05 +0800 CST2024-09-04 21:58:05 +0800 CST 2024-09-04 21:58:05 +0800 CST

使用 selenium 进行网页抓取无法分页

772

我正在尝试抓取此网页https://mst.dk/publikationer，它有分页，查看源代码，它看起来好像发生在我下面添加的部分中。

<div class="Container_Container__G5vVd Container_Container___width_std__y2_Pn">
    <div class="Pagination_Pagination_wrapper__kp62j">
        <ul class="Pagination_Pagination__UOZ60" role="navigation" aria-label="Pagination">
            <li class="Pagination_Pagination_prev__zIUqn Pagination_Pagination_item___disabled__g5CaR">
                <a class="Pagination_Pagination_link__Z2LW0 Pagination_Pagination_prevLink__HDKS4" tabindex="-1" role="button" aria-disabled="true" aria-label="Previous page" rel="prev"></a>
            </li>
            <li class="Pagination_Pagination_item__suqyV selected">
                <a rel="canonical" role="button" class="Pagination_Pagination_link__Z2LW0 Pagination_Pagination_link___active__to_Os" tabindex="-1" aria-label="Side 1" aria-current="page">1</a>
            </li>
            <li class="Pagination_Pagination_item__suqyV">
                <a role="button" class="Pagination_Pagination_link__Z2LW0" tabindex="0" aria-label="Side 2" rel="next">2</a>
            </li>
            <li class="Pagination_Pagination_break__dKVzB">
                <a class="Pagination_Pagination_breakLink__jB8Rd" role="button" tabindex="0">...</a>
            </li>
            <li class="Pagination_Pagination_item__suqyV">
                <a role="button" class="Pagination_Pagination_link__Z2LW0" tabindex="0" aria-label="Side 321">321</a>
            </li>
            <li class="Pagination_Pagination_next__N6tkt">
                <a class="Pagination_Pagination_link__Z2LW0 Pagination_Pagination_nextLink__mytrA" tabindex="0" role="button" aria-disabled="false" aria-label="Next page" rel="next"></a>
            </li>
        </ul>
    </div>

我尝试了多种方法，包括在 URL 中添加 page=x，或使用 selenium 不同的定位器和选择器，增加等待时间，尝试使用下一个按钮，或模拟单击列表项。似乎没有什么对我有用。有人能帮我弄清楚这个页面的动态以及如何对其进行分页吗？我想做的是打开每个页面中的每个链接，找到 pdf 并下载它，对于第一页来说，使用以下代码可以正常工作：

def parse_epa_filtered_keywords():
    # Get number of search results
    page_no = int(int(get_number_of_results(link_filtered)) / 10) + 1
    driver = webdriver.Chrome(options=options)
    search_query = '+'.join(keywords.split())
    
    for i in tqdm(range(1, page_no + 1)):
        try:
            search_url = f"{link_filtered}?search={search_query}&page={i}"
            print(f"Fetching URL: {search_url}")
            
            # Load the search URL
            driver.get(search_url)
            
            # Wait for the page to load completely
            time.sleep(5)  # Adjust the sleep time as needed
            
            # Wait for the main page to load again
            publications = driver.find_elements(By.CSS_SELECTOR, 'a[class^="Link_Link__lzynb SearchResultItem_SearchResult"]')
            ....
driver.quit()

显然，这是使用该页面的努力，它会一遍又一遍地打开第一页。然后我尝试使用以下项目：

next_button = driver.find_element(By.XPATH, "//li[contains(@class, 'Pagination_Pagination_next')]/a[@rel='next']")

或者

next_button = WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "li.Pagination_Pagination_next_N6tkt a")))

并使用不同的元素进行了更多尝试，这些尝试要么导致一般的 chrome 驱动程序错误，要么导致类似的错误：

An error occurred: Message: element click intercepted: Element is not clickable at point (732, 2911)
  (Session info: chrome=128.0.6613.114)
Stacktrace:
0   chromedriver                        0x0000000104f83998 cxxbridge1$str$ptr + 1887096
1   chromedriver                        0x0000000104f7be00 cxxbridge1$str$ptr + 1855456
2   chromedriver                        0x0000000104b80be0 cxxbridge1$string$len + 89508
3   chromedriver                        0x0000000104bca6fc cxxbridge1$string$len + 391360
4   chromedriver                        0x0000000104bc8d28 cxxbridge1$string$len + 384748
5   chromedriver

2 个回答

Voted

GTK · Answer 1 · 2024-09-05T00:22:52+08:00

这是使用 API 的另一种解决方案：

import requests

def get_all_results():
    headers = {
        'hostname': 'http://mst.local:3001'
    }

    payload = {
        'key': 'a2369450-5ec7-494c-b910-d72074a73af9',
        'documentTypes': ['articlePage'],
        'subjects': [],
        'categories': [{'name': 'Publikation'}],
        'takeAmount': 100,
        'skipAmount': 0,
        'direction': 'descending',
        'UserTextInputField': ''
    }

    url = 'https://search.mst.dk/api/News/Search'
    
    results = []
    while True:
        response = requests.post(url, headers=headers, json=payload)
        data = response.json()
        results.extend(data['searchResults'])
        payload['skipAmount'] += payload['takeAmount']

        if len(results) >= data['pagination']['totalResults']:
            break

    return results


results = get_all_results()

print(f'{len(results) = }')

*这需要大约 30-40 秒来获取所有 3201 个结果，您可以使用异步来加快速度。

Shawn · Answer 2 · 2024-09-04T23:40:23+08:00

next_button = driver.find_element(By.XPATH, "//li[contains(@class, 'Pagination_Pagination_next')]/a[@rel='next']")

尽管上述代码中的 XPath 表达式是正确的，但由于某种原因，它没有单击元素。我使用ActionChains如下方法，它成功单击了下一个按钮。

next_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//a[@aria-label='Next page']")))
actions = ActionChains(driver)
actions.move_to_element(next_button).click().perform()

这是一个完整的工作代码，它将循环抓取页面。

注意：我正在抓取前 3 页并抓取搜索结果标题，您可以抓取任何您想要的内容：

from selenium.webdriver import ActionChains
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

def click_next_page():
    next_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//a[@aria-label='Next page']")))
    actions = ActionChains(driver)
    actions.move_to_element(next_button).click().perform()

def extract_headings(wait):
    headings = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "//li//h3")))
    search_results_headings = ""
    for heading in headings:
        search_results_headings += "\n" + heading.text
    return search_results_headings

driver = webdriver.Chrome()
driver.get("https://mst.dk/publikationer")
driver.maximize_window()
wait = WebDriverWait(driver, 10)

# Use below line of code only if you see accept/reject cookies pop-up
accept_all = wait.until(EC.element_to_be_clickable((By.ID, "CybotCookiebotDialogBodyLevelButtonLevelOptinAllowAll")))
driver.execute_script("arguments[0].click();", accept_all)

search_results_headings = ""
# Below for loop iterates 3 times, so 3 pages will be scraped, if you want more pages change the range accordingly
for _ in range(3):
    search_results_headings += extract_headings(wait)
    click_next_page()

print(search_results_headings)

控制台输出：

Diffus forurening med PFAS i jord, grundvand og overfladevand
Digitale værktøjer til klimatilpasning
Performancebenchmarking
Oprensning af PFAS-forurening i jord, slam og vand - Test af teknologier i praksis
Lokalt funderede analyse – afrapportering
Maritime Emissionsløsninger i Kystnære Farvande
Biokinetisk lattergasreduktion i renseanlæg
Inter DAN NRW
Gennemførelse og anvendelse af slamdirektivet 2023
CombiControl - Combining above- and belowground biological control agents for improved pest control in strawberry tunnel production
Affaldsstatistik 2022
Scientific investigation of ballast water discharge - Random checks on ships in autumn – winter 2022
Control of Biocides 2023
Ny kosteffektiv teknologi til måling af klimagasudledninger fra renseanlæg
Recycling potential of separately collected post-consumer textile waste
Modelling and mapping pesticide exposure risk at the catchment scale (MOMAPEST)
Indberetning af status for anvendelse af almene vandforsyningsboringer i Virk.dk
PFAS i jord - International screening af andre landes praksis for håndtering af jord med PFAS
Anbefalinger til screening og kortlægning af bygge- og anlægsaffald
Emissions of Quaternary Alkylammonium Compounds
Nikotinposer – indhold og miljøkonsekvenser
Udredningsprojekt vedr. analysemetoder til undersøgelse for PFAS-forbindelser i jord, grundvand og overfladevand
Rensningsmuligheder for pesticider med fokus på aktivt kul og membraner
Renholds- og omkostningsanalyse jf. Engangsplastdirektivets oprydningsansvar
Kemiske stoffer i en cirkulær økonomi - Et MUDP projekt
Pesticider og biocider i den danske pindsvinebestand
Kortlægning af madaffald i primærproduktionen samt forarbejdnings- og fremstillingssektoren for 2022
Kortlægning af madaffald og madspild i restaurationsbranchen og restaurationstjenester for 2022
Inhibition of lung surfactant function as an alternative method to predict lung toxicity following exposure to plant protection products
Survey and risk assessment of pesticides in cut flowers from non-EU countries

Process finished with exit code 0

使用 selenium 进行网页抓取无法分页

Vue 3：创建时出错“预期标识符但发现‘导入’”[重复]

为什么这个简单而小的 Java 代码在所有 Graal JVM 上的运行速度都快 30 倍，但在任何 Oracle JVM 上却不行？

具有指定基础类型但没有枚举器的“枚举类”的用途是什么？

如何修复未手动导入的模块的 MODULE_NOT_FOUND 错误？

`(表达式，左值) = 右值` 在 C 或 C++ 中是有效的赋值吗？为什么有些编译器会接受/拒绝它？

何时应使用 std::inplace_vector 而不是 std::vector？

在 C++ 中，一个不执行任何操作的空程序需要 204KB 的堆，但在 C 中则不需要

PowerBI 目前与 BigQuery 不兼容：Simba 驱动程序与 Windows 更新有关

AdMob：MobileAds.initialize() - 对于某些设备，“java.lang.Integer 无法转换为 java.lang.String”

我正在尝试仅使用海龟随机和数学模块来制作吃豆人游戏

使用 selenium 进行网页抓取无法分页

2 个回答

相关问题