欢迎访问 生活随笔!

生活随笔

当前位置: 首页 > 编程资源 > 编程问答 >内容正文

编程问答

使用composer_在Google Cloud Composer(Airflow)上使用Selenium搜寻网页

发布时间:2023/12/15 编程问答 48 豆豆
生活随笔 收集整理的这篇文章主要介绍了 使用composer_在Google Cloud Composer(Airflow)上使用Selenium搜寻网页 小编觉得挺不错的,现在分享给大家,帮大家做个参考.

使用composer

There are already a lot of different resources available on creating web-scrapers using Python which are usually based on either a combination of the well known Python packages urllib+beautifulsoup4 or Selenium. When you are faced with the challenge to scrape a javascript-heavy web page or a level of interaction with the content is required that can not be achieved by simply sending URL requests, then Selenium is very likely your preferred choice. I don’t want to go into the details here on how you can set-up your scraping script and the best practices on how to run it in a reliable way. I just want to refer to this and this resources that I found are particularly helpful.

使用Python创建网络抓取工具已经有很多不同的资源,这些资源通常基于著名的Python软件包urllib + beautifulsoup4或Selenium的组合。 当您面临抓取大量javascript网页的挑战,或者需要与内容进行一定程度的交互(仅通过发送URL请求无法实现)时, Selenium很可能是您的首选。 我不想在这里详细介绍如何设置抓取脚本以及如何以可靠方式运行抓取脚本的最佳实践。 我只想引用这个和这个资源,我发现是特别有帮助。

The problem that we want to solve in this post is: How can I, as a Data Analyst/Data Scientist, set up an orchestrated and fully managed process to facilitate a Selenium scraper with a minimum of dev-ops required? The main use case for such a set up is a managed and scheduled solution to run all your scraping jobs in the cloud.

我们在这篇文章中要解决的问题是: 作为一名数据分析师/数据科学家,我如何建立一个精心策划和完全管理的流程,以最少的开发人员操作来促进Selenium scraper? 此类设置的主要用例是托管和计划的解决方案,以在云中运行所有抓取作业。

The tools we are going to use are:

我们将使用的工具是:

  • Google Cloud Composer to schedule jobs and orchestrate workflows

    Google Cloud Composer安排工作并编排工作流程

  • Selenium as a framework to scrape websites

    Selenium作为刮刮网站的框架

  • Google Kubernetes Engine to deploy a Selenium remote driver as containerized application in the cloud

    Google Kubernetes Engine将Selenium远程驱动程序部署为云中的容器化应用程序

At HousingAnywhere we were already using Google Cloud Composer for a number of different tasks. Cloud Composer is quite an amazing tool to easily manage, schedule and monitor workflows as directed acyclic graphs (DAGs). It is based on the open-source framework Apache Airflow and using pure Python, which makes it ideal for everyone working in the data field. The entry barrier to deploy Airflow on your own is relatively high if you are not coming from DevOps which led to some cloud providers to provide managed deployments of Airflow — Google’s Cloud Composer being one of them.

在HousingAnywhere,我们已经在使用 Google Cloud Composer可以执行许多不同的任务 。 Cloud Composer是一个非常了不起的工具,可以作为有向无环图 (DAG)轻松管理,安排和监视工作流。 它基于开源框架Apache Airflow并使用纯Python,这使其非常适合从事数据领域工作的每个人。 如果您不是来自DevOps,则独自部署Airflow的入门门槛相对较高,这导致一些云提供商提供托管的Airflow部署-Google的Cloud Composer就是其中之一。

When deploying Selenium for webscraping, we’re actually using the so-called Selenium Webdriver. This WebDriver is a framework that allows you to control a browser using code (Java, .Net, PHP, Python, Perl, Ruby). For most use-cases you would simply download a browser that can directly interact with the WebDriver framework, for example Mozilla Geckodriver or ChromeDriver. The scraping script will initiate a browser instance on your local and execute all actions as specified. In our use case things are a bit more complicated because we want to run the script on a recurring schedule without using any local resources. To be able to deploy and run web scraping scripts in the cloud we need to use a Selenium Remote WebDriver (a.k.a Selenium Grid) instead of the Selenium WebDriver.

部署Selenium进行Web爬网时,实际上是在使用所谓的Selenium Webdriver。 这个WebDriver是一个框架,允许您使用代码( Java,.Net,PHP,Python,Perl,Ruby )控制浏览器。 对于大多数用例,您只需下载可以直接与WebDriver框架进行交互的浏览器,例如Mozilla Geckodriver或ChromeDriver 。 抓取脚本将在您本地的浏览器实例启动并执行指定的所有操作。 在我们的用例中,事情要复杂一些,因为我们希望在不使用任何本地资源的情况下定期执行脚本。 为了能够在云中部署和运行Web抓取脚本,我们需要使用Selenium Remote WebDriver (又名Selenium Grid )而不是Selenium WebDriver。

Source: https://www.browserstack.com/guide/difference-between-selenium-remotewebdriver-and-webdriver来源: https : //www.browserstack.com/guide/difference-between-selenium-remotewebdriver-and-webdriver

使用Selenium Grid运行远程Web浏览器实例 (Running remote web browser instances with Selenium Grid)

The idea behind Selenium Grid is to provide a framework that allows you to run parallel scraping instances by running web browsers on a single or multiple machines . In this case, we can make use of the provided standalone browsers (keep in mind that each of the available browsers, Firefox, Chrome and Opera are a different image) which are already wrapped up as a Docker image.

Selenium Grid背后的思想是提供一个框架,使您可以通过在单台或多台计算机上运行Web浏览器来运行并行抓取实例。 在这种情况下,我们可以使用提供的独立浏览器 (请注意,每个可用的浏览器Firefox,Chrome和Opera都是不同的映像),这些浏览器已经打包为Docker映像。

Cloud Composer runs Apache Airflow on top of a Google Kubernetes Engine (GKE) cluster. Furthermore, it is fully integrated with other Google Cloud products. The creation of a new Cloud Composer environment also comes along with a functional UI and a Cloud Storage bucket. All DAGs, plugins, logs and other required files are stored in this bucket.

Cloud Composer在Google Kubernetes Engine(GKE)集群的顶部运行Apache Airflow。 此外,它与其他Google Cloud产品完全集成。 还创建了新的Cloud Composer环境,并带有功能性UI和Cloud Storage存储桶。 所有DAG,插件,日志和其他必需文件都存储在此存储桶中。

在GKE上部署和公开远程驱动程序 (Deploy and expose the remote driver on GKE)

You can deploy a docker image for the Firefox standalone browser using the selenium-firefox.yaml file below and apply the specified configuration on your resource by running:

您可以使用下面的selenium-firefox.yaml文件为Firefox独立浏览器部署docker映像,并通过运行以下命令在资源上应用指定的配置:

kubectl apply -f selenium-firefox.yaml

The configuration file describes what kind of object you want to create, it’s metadata as well as specs.

配置文件描述了一种对象的要创建的,它的元数据以及规范的东西。

We can create new connection in the Admin UI of Airflow and access the connection details later in our Plugin. The connection details are either specified in the yaml file or can be found on your Kubernetes cluster.

我们可以在Airflow的Admin UI中创建新连接,并稍后在插件中访问连接详细信息。 连接详细信息可以在yaml文件中指定,也可以在Kubernetes集群上找到。

Airflow Connections气流连接 Kubernetes Engine on GCPGCP上的Kubernetes引擎

After setting up the connections we can access the connection in our scraping script (Airflow Plugin) where we connect to the remote browser.

设置连接后,我们可以在我们的抓取脚本(气流插件)中访问该连接,在该脚本中,我们可以连接到远程浏览器。

Thank you Massimo Belloni for technical consultancy and advice in realizing the project and this article.

感谢 Massimo Belloni 为实现项目和本文提供技术咨询和建议。

翻译自: https://towardsdatascience.com/scraping-the-web-with-selenium-on-google-cloud-composer-airflow-7f74c211d1a1

使用composer

总结

以上是生活随笔为你收集整理的使用composer_在Google Cloud Composer(Airflow)上使用Selenium搜寻网页的全部内容,希望文章能够帮你解决所遇到的问题。

如果觉得生活随笔网站内容还不错,欢迎将生活随笔推荐给好友。