我大多数时候无法向以下网站发出请求:
https://www.adondevivir.com/proyectos-etapa-pre-venta-en-construccion.html
library(rvest);library(tibble);library(httr2)
base_url <- "https://www.adondevivir.com/proyectos-etapa-pre-venta-en-construccion.html"
parsed_base_url <- base_url |>
read_html() # This works sometimes and I get the underlying html
# THIS NEVER WORKS
pagina_parsed <- base_url |>
request() |>
req_user_agent(
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
) |>
req_headers(
Referer = "https://www.adondevivir.com/",
Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
`Accept-Language` = "es-419,es;q=0.6",
`Accept-Encoding` = "gzip, deflate, br, zstd",
`Cache-Control` = "max-age=0",
`Sec-Ch-Ua` = '"Brave";v="125", "Chromium";v="125", "Not.A/Brand";v="24"',
Priority = "u=0, i"
) |>
req_perform()
为什么大多数时候我都无法向页面发出请求(更不用说它不适用于上面提供的标头的 httr2)?有没有办法克服这个“问题” httr2
?这与 cookie 有关,还是页面保护自己不被抓取的方式有关?
我想我可以重试很多次该请求直到它起作用,但我认为我不会了解太多关于它不起作用的原因。