我正在尝试抓取此网页https://mst.dk/publikationer,它有分页,查看源代码,它看起来好像发生在我下面添加的部分中。
<div class="Container_Container__G5vVd Container_Container___width_std__y2_Pn">
<div class="Pagination_Pagination_wrapper__kp62j">
<ul class="Pagination_Pagination__UOZ60" role="navigation" aria-label="Pagination">
<li class="Pagination_Pagination_prev__zIUqn Pagination_Pagination_item___disabled__g5CaR">
<a class="Pagination_Pagination_link__Z2LW0 Pagination_Pagination_prevLink__HDKS4" tabindex="-1" role="button" aria-disabled="true" aria-label="Previous page" rel="prev"></a>
</li>
<li class="Pagination_Pagination_item__suqyV selected">
<a rel="canonical" role="button" class="Pagination_Pagination_link__Z2LW0 Pagination_Pagination_link___active__to_Os" tabindex="-1" aria-label="Side 1" aria-current="page">1</a>
</li>
<li class="Pagination_Pagination_item__suqyV">
<a role="button" class="Pagination_Pagination_link__Z2LW0" tabindex="0" aria-label="Side 2" rel="next">2</a>
</li>
<li class="Pagination_Pagination_break__dKVzB">
<a class="Pagination_Pagination_breakLink__jB8Rd" role="button" tabindex="0">...</a>
</li>
<li class="Pagination_Pagination_item__suqyV">
<a role="button" class="Pagination_Pagination_link__Z2LW0" tabindex="0" aria-label="Side 321">321</a>
</li>
<li class="Pagination_Pagination_next__N6tkt">
<a class="Pagination_Pagination_link__Z2LW0 Pagination_Pagination_nextLink__mytrA" tabindex="0" role="button" aria-disabled="false" aria-label="Next page" rel="next"></a>
</li>
</ul>
</div>
我尝试了多种方法,包括在 URL 中添加 page=x,或使用 selenium 不同的定位器和选择器,增加等待时间,尝试使用下一个按钮,或模拟单击列表项。似乎没有什么对我有用。有人能帮我弄清楚这个页面的动态以及如何对其进行分页吗?我想做的是打开每个页面中的每个链接,找到 pdf 并下载它,对于第一页来说,使用以下代码可以正常工作:
def parse_epa_filtered_keywords():
# Get number of search results
page_no = int(int(get_number_of_results(link_filtered)) / 10) + 1
driver = webdriver.Chrome(options=options)
search_query = '+'.join(keywords.split())
for i in tqdm(range(1, page_no + 1)):
try:
search_url = f"{link_filtered}?search={search_query}&page={i}"
print(f"Fetching URL: {search_url}")
# Load the search URL
driver.get(search_url)
# Wait for the page to load completely
time.sleep(5) # Adjust the sleep time as needed
# Wait for the main page to load again
publications = driver.find_elements(By.CSS_SELECTOR, 'a[class^="Link_Link__lzynb SearchResultItem_SearchResult"]')
....
driver.quit()
显然,这是使用该页面的努力,它会一遍又一遍地打开第一页。然后我尝试使用以下项目:
next_button = driver.find_element(By.XPATH, "//li[contains(@class, 'Pagination_Pagination_next')]/a[@rel='next']")
或者
next_button = WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "li.Pagination_Pagination_next_N6tkt a")))
并使用不同的元素进行了更多尝试,这些尝试要么导致一般的 chrome 驱动程序错误,要么导致类似的错误:
An error occurred: Message: element click intercepted: Element is not clickable at point (732, 2911)
(Session info: chrome=128.0.6613.114)
Stacktrace:
0 chromedriver 0x0000000104f83998 cxxbridge1$str$ptr + 1887096
1 chromedriver 0x0000000104f7be00 cxxbridge1$str$ptr + 1855456
2 chromedriver 0x0000000104b80be0 cxxbridge1$string$len + 89508
3 chromedriver 0x0000000104bca6fc cxxbridge1$string$len + 391360
4 chromedriver 0x0000000104bc8d28 cxxbridge1$string$len + 384748
5 chromedriver