Start Free Testing ->

Next-Gen App & Browser Testing Cloud

Trusted by 2 Mn+ QAs & Devs to accelerate their release cycles

Start free with Google Start free with Email

Join

Automation Selenium Python Tutorial

How To Perform Selenium and Python Web Scraping

Vinayak Sharma

Posted On: June 7, 2021

155746 Views

10 Min Read

Home
>
Blog
>
How To Perform Selenium and Python Web Scraping

This article is a part of our Content Hub. For more in-depth resources, check out our content hub on Selenium Python Tutorial.

As per Stack Overflow Survey 2020, Python holds the fourth position in the preferred programming languages category. Moreover, it is supported across a wide range of test automation frameworks, including the Selenium framework. There selenium and Python Web Scraping is one of the most used combination for smarter data collection and intelligent analysis.

‘Data is the new oil,’ the ever-green quote by Humbly becomes much more relevant if the right methods are used for making the most out of the data. There is a plethora of information (read data) available on the internet, and acting on the right set of data can reap significant business benefits. Putting the right data collection methods in implementation can bring useful insights. On the other hand, incorrect data collection methods can result in obtuse data.

Web scraping, surveys, questionnaires, focus groups, oral histories, etc., are some of the widely used mechanisms for gathering data that matters! Out of all the data collection methods, web scraping is considered the most reliable and efficient data collection method. For starters, web scraping (also termed web data extraction) is an automatic method for scraping (or obtaining) large amounts of data from websites. Selenium, the popular test automation framework, can be extensively used for scraping web pages. You can also learn more about what is Selenium? In this Selenium Python tutorial, we look at web scraping using Selenium and Python.

We have chosen Python – the popular backend programming language for demonstrating web page scraping. Along with scraping information from static web pages, we will also look into web scraping of dynamic pages using python and selenium.

TABLE OF CONTENT

What is Web Scraping?
Authorized Use Cases for Web Scraping
Is Web Scraping Legal?
Web Scraping In Python
Static and Dynamic Web Scraping using Selenium and Python

What is Web Scraping

Web Scraping, also known as “Crawling” or “Spidering,” is a technique for web harvesting, which means collecting or extracting data from websites. Here, we use bots to extract content from HTML pages and store it in a database (or CSV file or some other file format). Scraper bot can be used to replicate entire website content, owing to which many varieties of digital businesses have been built around data harvesting and collection.

Python Web Scraping can help us extract an enormous volume of data about customers, products, people, stock markets, etc. The data has to be put to ‘optimal use’ for the betterment of the service. There is a debate whether web scraping is legal or not, the fact is that web scraping can be used for realizing legitimate use cases.

This certification is for professionals looking to develop advanced, hands-on expertise in Selenium automation testing with Python and take their career to the next level.

Here’s a short glimpse of the Selenium Python 101 certification from LambdaTest:

Use Cases for Python Web Scraping

Here are some of the valid (or authorized) use cases of web scraping in Python (and other Selenium-supported programming languages):

Search Engines: Several Search bot engines crawl billions of websites and analyze their content for gathering meaningful search results. Search engine crawling is often called spidering. Spiders navigate through the web by downloading web pages and following links on these pages to find new pages available for their users. Then, they rank them according to different factors like keywords, content uniqueness, page freshness, and user engagement.
E-commerce (Price Comparison): Price comparison websites use bots to fetch price rates from different e-commerce websites. Python Web Scraping is a reliable and efficient method of getting product data from target e-commerce sites according to your requirements. They acquire data by either building in-house web scraping methodologies or employing a DaaS (Data As A Service) provider that’ll provide the requisite data.
Sentiment Analysis: Market research companies use Python Web Scraping for sentiment analysis. This kind of analysis helps companies gain customer insights, along with helping them understand how their customers behave to particular brands and products.
Job Postings: Job listings for details about job openings and interviews are scraped from a collection of websites. The scraped information is then listed in one place so that it is seamlessly accessible to the users.

Read More– Get started with your easy Selenium Python tutorial!!!

Legality of Python Web Scraping?

This is a debatable topic since it entirely depends on the intent of web scraping and the target website from which the data is being scraped. Some websites allow web scraping while several don’t. To see whether a website permits web scraping or not, we have to look at the website’s “robots.txt” file. We can find this file by adding “/robots.txt” at the end of the URL that you want to scrape.

For example, if we want to scrape the LambdaTest website, we have to see the “robots.txt” file, which is at the URL https://www.lambdatest.com/robots.txt

User-agent: *
Allow: /

Sitemap: https://www.lambdatest.com/sitemap.xml

1

2

3

4

User-agent: *

Allow: /

Sitemap: https://www.lambdatest.com/sitemap.xml

Python Web Scraping

Getting started with web scraping in Python is easy since it provides tons of modules that ease the process of scraping websites. Here are some of the modules that you should be aware of to realize Python Web Scraping:

Requests Library for Web Scraping
The requests library is used for making several types of HTTP requests like getting GET, POST, PUT, etc. Because of its simplicity and efficiency of use, it has a motto of “HTTP for Humans.”

But, we can’t directly parse HTML using the requests library. It uses the lxml library to parse HTML.

$ pip install requests
Beautiful Soup Library for Web Scraping
BeautifulSoup Library is one of the widely-used Python libraries for web scraping. It works by creating a parse tree for parsing HTML and XML documents. Beautiful Soup automatically transforms incoming documents to Unicode and outgoing documents to UTF-8. It uses a custom parser to implement idiomatic navigation methods to search and transform the parse tree.

$ pip install beautifulsoup4
Scrapy Framework for Web Scraping
Scrapy is a web scraping framework created by Pablo Hoffman and Shane Evans, co-founders of Scrapinghub. It is a full-fledged web scraping tool that does all the heavy lifting and provides spider bots to crawl various websites and extract the data. With Scrapy, we can create spider bots, host them on Scrapy Hub, or use their APIs. It allows us to develop fully functional spiders in a few minutes. We can also add pipelines to process and store data.

Scrapy allows making the asynchronous request, which means it makes multiple HTTP requests simultaneously. This method saves a lot of time and increases our efficiency of scraping.

$ pip install Scrapy

Note: To further ease down the process of writing small tests, Python offers various tools and frameworks. Whether you are a Python beginner or an experienced programmer, pytest helps you write the tests you need and have them run in a reliable manner. For a quick overview on getting started with pytest, check out the video below from LambdaTest YouTube Channel.

Static and Dynamic Selenium and Python Web Scraping

There is a difference between static web pages and dynamic web pages. In a static web page, the content remains the same until someone changes them manually. On the other hand, the dynamic web page content of the page can differ for different visitors (e.g., content can change as per the geolocation, user profile, etc.). This increases its time complexity as dynamic web pages can render at the client-side, unlike static web pages, which render at the server-side.

The static web page content or HTML documents are downloaded locally, and data can be scraped using relevant scripts. On the other hand, dynamic web page content (or data) is generated uniquely for every request after the initial page load request. For that case, we need to perform the following several actions using the manual approach:

For this purpose, we need to automate websites, the same can be achieved using Selenium WebDriver.

Scraping Dynamic Web Page using Python and Selenium

Here are the prerequisites for realizing Selenium and Python Web Scraping:

Beautifulsoup for scraping HTML content for websites:
$ pip install beautifulsoup4
Parsing HTML content of websites:
$ pip install lxml
Selenium for automation:
- Installing Selenium using pip
  $ pip install selenium
- Install Selenium using conda
  $ conda install -c conda-forge selenium

Read – What is Selenium & how to get started?

Importing modules for Selenium and Python Web Scraping

For demonstration, we would be using the LambdaTest Grid. Cloud-based Selenium Grid on LambdaTest lets you run Selenium automation tests on 2,000+ browsers and operating systems online. You can perform parallel testing at scale using the cloud-based Grid. Once you create an account on LambdaTest, make a note of the user-name & access-key from the LambdaTest profile section.

import os
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup as bs
from lxml import html
from selenium import webdriver
from selenium.webdriver.common.by import By

// get your user key from LambdaTest platform and import using environment variables
// username = os.environ.get("LT_USERNAME")
// access_key = os.environ.get("LT_ACCESS_KEY")

// Username and Access Key assigned as String variables
username = "user_name"
access_key = "access_key"

1

2

3

4

5

6

7

8

9

10

11

12

13

14

import os

from selenium.webdriver.common.keys import Keys

from bs4 import BeautifulSoup as bs

from lxml import html

from selenium import webdriver

from selenium.webdriver.common.by import By

// get your user key from LambdaTest platform and import using environment variables

// username = os.environ.get("LT_USERNAME")

// access_key = os.environ.get("LT_ACCESS_KEY")

// Username and Access Key assigned as String variables

username = "user_name"

access_key = "access_key"

Now that we have imported all modules let’s get our hands dirty with Selenium and Python Web Scraping.

Locating elements for Selenium and Python Web Scraping

We would scrap the Blog Titles from the LambdaTest Blog Page. For this, we search for a particular topic and enter the required topic in the search bar.

The following Selenium Locators can be used for locating WebElements on the web page under test:

Here is an example of the usage of Selenium web locators to locate the search box on the page:

Python Web Scraping

In this case, we would use the XPath method driver.find_element(By.XPATH) to locate the search box on the page.

driver.get("https://www.lambdatest.com/blog/")

searchBarXpath = "/html[1]/body[1]/section[1]/div[1]/form[1]/label[1]/input[1]"
       
# searching topic


textbox = driver.find_element(By.XPATH, searchBarXpath)
textbox.send_keys(topic)
 textbox.send_keys(Keys.ENTER)
 source = driver.page_source

1

2

3

4

5

6

7

8

9

10

11

driver.get("https://www.lambdatest.com/blog/")

searchBarXpath = "/html[1]/body[1]/section[1]/div[1]/form[1]/label[1]/input[1]"

# searching topic

textbox = driver.find_element(By.XPATH, searchBarXpath)

textbox.send_keys(topic)

textbox.send_keys(Keys.ENTER)

source = driver.page_source

As we have got the content, we can parse it using lxml and beautifulsoup.

Read: A Complete Tutorial on Selenium Locators

Scraping titles using beautifulsoup

After parsing HTML source using lxml’s html.parser, we will find all h2 tags with class “blog-titel” and anchor tags inside them as these anchor tags contain the blog titles.

beautifulsoup

title_list = [] 
soup = bs(source,"html.parser")
for h2 in soup.findAll("h2", class_="blog-titel"):
     for a in h2.findAll("a",href=True):
            title_list.append(a.text)
        
return title_list

1

2

3

4

5

6

7

title_list = []

soup = bs(source,"html.parser")

for h2 in soup.findAll("h2", class_="blog-titel"):

for a in h2.findAll("a",href=True):

title_list.append(a.text)

return title_list

Read – Scraping Dynamic Web Pages Using Selenium And C#

Putting it all together

Let’s combine the code to get the output.

import os
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup as bs
from lxml import html
from selenium import webdriver
from selenium.webdriver.common.by import By

// get your user key from LambdaTest platform and import using environment variables
// username = os.environ.get("LT_USERNAME")
// access_key = os.environ.get("LT_ACCESS_KEY")

// Username and Access Key assigned as String variables
username = "user_name"
access_key = "access_key"

class blogScraper:
    # Generate capabilities from here: https://www.lambdatest.com/capabilities-generator/
    def setUp(self):
        capabilities = {
                "build" : "your build name",
                "name" : "your test name",
                "platform" : "Windows 10",
                "browserName" : "Chrome",
                "version" : "91.0",
                "selenium_version" : "3.11.0",
                "geoLocation" : "IN",
                "chrome.driver" : "91.0",
                "headless" : True
            }
        self.driver = webdriver.Remote(
           command_executor="https://{}:{}@hub.lambdatest.com/wd/hub".format(username, access_key),
           desired_capabilities= capabilities)

    def tearDown(self):
        self.driver.quit()

    def scrapTopic(self,topic):
        driver = self.driver

        # Url
        driver.get("https://www.lambdatest.com/blog/")

        searchBarXpath = "/html[1]/body[1]/section[1]/div[1]/form[1]/label[1]/input[1]"
       
        # searching topic
        textbox = driver.find_element_by_xpath(searchBarXpath)
        textbox.send_keys(topic)
        textbox.send_keys(Keys.ENTER)
        
        source = driver.page_source
        # scraping title
        title_list = [] 
        soup = bs(source,"html.parser")
        for h2 in soup.findAll("h2", class_="blog-titel"):
            for a in h2.findAll("a",href=True):
                title_list.append(a.text)
        
        return title_list
        

       
if __name__ == "__main__":
    obj = blogScraper()
    obj.setUp()
    print(obj.scrapTopic("scrap"))
    obj.tearDown()

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

import os

from selenium.webdriver.common.keys import Keys

from bs4 import BeautifulSoup as bs

from lxml import html

from selenium import webdriver

from selenium.webdriver.common.by import By

// get your user key from LambdaTest platform and import using environment variables

// username = os.environ.get("LT_USERNAME")

// access_key = os.environ.get("LT_ACCESS_KEY")

// Username and Access Key assigned as String variables

username = "user_name"

access_key = "access_key"

class blogScraper:

# Generate capabilities from here: https://www.lambdatest.com/capabilities-generator/

def setUp(self):

capabilities = {

"build" : "your build name",

"name" : "your test name",

"platform" : "Windows 10",

"browserName" : "Chrome",

"version" : "91.0",

"selenium_version" : "3.11.0",

"geoLocation" : "IN",

"chrome.driver" : "91.0",

"headless" : True

}

self.driver = webdriver.Remote(

command_executor="https://{}:{}@hub.lambdatest.com/wd/hub".format(username, access_key),

desired_capabilities= capabilities)

def tearDown(self):

self.driver.quit()

def scrapTopic(self,topic):

driver = self.driver

# Url

driver.get("https://www.lambdatest.com/blog/")

searchBarXpath = "/html[1]/body[1]/section[1]/div[1]/form[1]/label[1]/input[1]"

# searching topic

textbox = driver.find_element_by_xpath(searchBarXpath)

textbox.send_keys(topic)

textbox.send_keys(Keys.ENTER)

source = driver.page_source

# scraping title

title_list = []

soup = bs(source,"html.parser")

for h2 in soup.findAll("h2", class_="blog-titel"):

for a in h2.findAll("a",href=True):

title_list.append(a.text)

return title_list

if __name__ == "__main__":

obj = blogScraper()

obj.setUp()

print(obj.scrapTopic("scrap"))

obj.tearDown()

Here is the execution output:

['Scraping Dynamic Web Pages Using Selenium And C#', '9 Of The Best Java Testing Frameworks For 2021', 'The Best Alternatives to Jenkins for Developers', '10 Of The Best Chrome Extensions - How To Find XPath in Selenium', 'How To Take A Screenshot Using Python & Selenium?', 'Top 10 Java Unit Testing Frameworks for 2021', 'Agile Vs Waterfall Methodology', 'Why You Should Use Puppeteer For Testing']

1

['Scraping Dynamic Web Pages Using Selenium And C#', '9 Of The Best Java Testing Frameworks For 2021', 'The Best Alternatives to Jenkins for Developers', '10 Of The Best Chrome Extensions - How To Find XPath in Selenium', 'How To Take A Screenshot Using Python & Selenium?', 'Top 10 Java Unit Testing Frameworks for 2021', 'Agile Vs Waterfall Methodology', 'Why You Should Use Puppeteer For Testing']

Selenium is often essential to extract data from websites using lots of JavaScript as it’s an excellent tool to automate nearly anything on the web.

Read – Automation Testing with Selenium JavaScript [Tutorial]

Here is the execution snapshot of our Python web automation tests on the LambdaTest Automation Dashboard:

LambdaTest Automation

Conclusion

In this blog on Selenium and Python Web Scraping, we deep-dived into web scraping as a technique that is extensively used by software developers for automating the extraction of data from websites. The purpose of web scraping is to allow companies and enterprises to manage information efficiently. There are a number of applications, such as VisualScrapper, HTMLAgilityPack, etc., that allow users to scrape data from static web pages. On the other hand, Selenium is the most preferred tool for dynamic web page scraping.

Test automation supports a variety of browsers and operating systems. LambdaTest offers a cloud-based Selenium Grid that makes it easy to perform cross browser testing at scale across different browsers, platforms, and resolutions.

Happy Scraping!

Vinayak Sharma

Full stack python developer and a tech enthusiast with strong communication and interpersonal skills. Highly adaptable to new environments, challenges, and increasing levels of responsibilities.

See author's profile

Author’s Profile

Vinayak Sharma

Full stack python developer and a tech enthusiast with strong communication and interpersonal skills. Highly adaptable to new environments, challenges, and increasing levels of responsibilities.

Blogs: 5

Got Questions? Drop them on LambdaTest Community. Visit now

Test Your Web Or Mobile Apps On 3000+ Browsers

Signup for free

Related Articles

Related Post

How To Perform Storybook Visual Testing

Vipul Gupta

December 21, 2023

17907 Views

20 Min Read

Automation | Selenium JavaScript | Tutorial |

Related Post

21 Lessons To Write Test Cases Effectively

Saniya Gazala

December 18, 2023

348090 Views

40 Min Read

Manual Testing | Automation |

Related Post

assertTrue() in Java: A Complete Tutorial

Faisal Khatri

December 12, 2023

131637 Views

17 Min Read

Selenium Java | Automation | Tutorial |

Related Post

How To Execute JavaScript In Selenium Python

Idowu (Paul) Omisola

132764 Views

29 Min Read

Selenium Python | Automation | Tutorial |

Related Post

Cracking the Code of CSS Specificity: A Developer’s Toolkit

Clinton Joy Fimie

December 5, 2023

213228 Views

23 Min Read

Web Development | LambdaTest Experiments | Tutorial |

Related Post

Using pytest asyncio to Reduce Test Execution Time

Milos Kajkut

November 30, 2023

35997 Views

7 Min Read

Selenium Python | LambdaTest Spartans | Voices of Community |

Try LambdaTest Now !!

Get 100 minutes of automation test minutes FREE!!

Start free with Google Start free with Email

Join

Cookie

X

We use cookies to give you the best experience. Cookies help to provide a more personalized experience and relevant advertising for you, and web analytics for us. Learn More in our Cookies policy, Privacy & Terms of service.

Allow Cookie Cancel