Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / DevOps / automation

Automating the Web with Selenium and Python

2.64/5 (3 votes)
17 Apr 2019CPOL3 min read 9.8K   139  
Automated Pakwheels parsing with Selenium and Python

Introduction

Selenium is a tool for automating Web Browsers. Instead of testing websites manually, developers use this tool to test their website automatically. It works just as though a user would be using the website on his/her system. Although it is popular for testing websites, countless boring and repetitive web tasks can and should be automated.

Let's get started.

Diagrammatic Representation

Image 1

  1. Program opens the browser and browses to pakwheels.com.
  2. It further navigates to their Used Cars section and inputs the following query:
    • Car Make or Model: Honda Civic
    • Location: Islamabad
    • Year Range: 2008 - 2012
    • Price Range: 10 - 18 Lacs
  3. The resulted page displays 29 ads on the main page, satisfying the query.
  4. The program gets the links of the 29 ads.
  5. Saves the required data to an Excel Spreadsheet
  6. Finally closes the browser

Technologies Used

  • Selenium: Version 3.141.0
  • xlwt: Version 1.3.0
  • Python: Version 3.6.5

Prerequisites

  • Latest version of Firefox
  • Python 3.6.5 installed

Setting Up

  • Installing geckodriver

Note

  • The below command is for homebrew package manager.
  • geckodriver should be placed in /usr/local/bin/, otherwise you'll have to give the path to where it is located.
    Python
    $ brew install geckodriver
  • requirements.txt: It has all the dependencies to run the program.
    Python
    $ pip3 install -r requirements.txt
  • To run:
    Python
    $ python3 app.py

Code Overview

In the initialization block, I'm importing all the required modules and packages, creating selenium objects such as driver and options, and I'm creating an Excel workbook with the help of xlwt.

Note

  • I have initialized headless mode because it takes up less resources.
  • This can be disabled if you comment out options.add_argument('-headless').

The source code is as follows:

Python
from selenium import webdriver
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select
import xlwt

options = webdriver.FirefoxOptions()
options.add_argument('-headless')  # To disable headless mode, comment this line
driver = webdriver.Firefox(firefox_options=options)
workbook = xlwt.Workbook()

The program has three functions in total:

navigation()

The following function navigates to the used "Honda Civic" page.

  • Opens up the browser and browses to pakwheels
  • Waits for the pop-up to show up to close it
  • Clicks on Used Cars
  • Inputs the query
  • Clicks on search
  • Goes to the main page of the resulted query

The source code is as follows:

Python
"""
    - The function below, when called, 
      opens up the browser and goes to wwww.pakwheels.com and then navigates to their
      used cars section.
      
    - It then inserts queries into their query bar such as:
      Car Make Model
      City
      Price Range
      From - To Year
      
    - Finally it clicks on the search button and the page with the inserted criteria opens 
"""

def navigation():
    driver.get("https://www.pakwheels.com")
    WebDriverWait(driver, 500).until(EC.element_to_be_clickable_
                 ((By.ID, 'onesignal-popover-cancel-button')))
    driver.find_element_by_id('onesignal-popover-cancel-button').click()
    time.sleep(5)
    driver.find_element_by_link_text('Used Cars').click()
    driver.find_element_by_id('more_option').click()
    driver.find_element_by_name('home-query').send_keys('Honda Civic')
    driver.find_element_by_class_name('chzn-single').click()
    driver.find_element_by_id('UsedCity_chzn_o_4').click()
    pr_range = driver.find_element_by_id('pr-range-filter')
    pr_range.click()
    driver.find_element_by_id('pr_from').send_keys('10')
    driver.find_element_by_id('pr_to').send_keys('18')
    pr_range.click()
    yr_from = Select(driver.find_element_by_id('YearFrom'))
    yr_to = Select(driver.find_element_by_id('YearTo'))
    yr_from.select_by_value('2008')
    yr_to.select_by_value('2012')
    driver.find_element_by_id('used-cars-search-btn').click()

get_car_links()

I'm calling the navigation() function inside this so that selenium can navigate to the desired page.

The function does the following:

  1. Creates an empty list called links.
  2. Gets all the anchor tags of the Honda Civic ads.
  3. Through a for-loop, appends all the href of the anchor tags in the links list.
  4. Finally returns the list

The source code is as follows:

Python
'''

    - This function gets all the required links.
    
    - Firstly it gets all the Anchor tags that has the required links 
      (class = "car-name ad-detail-path")
    
    - Then through the for-loop, it further parses the anchor tags to get 
      the href from the anchor tags one by one and appends it to another list 
      called as links and returns that list.    

'''

def get_car_links():
    navigation()  # Calling the navigation func in get_car_links func
    links = []
    elems = driver.find_elements(By.XPATH, '//a[@class = "car-name ad-detail-path"]')
    for elem in elems:
        links.append(elem.get_attribute('href'))

    return links

scrape_output()

The final function takes in the list that has the links and does the following:

  • Creates a sheet in the workbook
  • Adds headings of the data that will be furnished in the workbook
  • The for-loop loops through the links and:
    • Opens up the link
    • Scrapes the data:
      • Phone Number
      • Milage
      • Year
      • Transmission
      • Price
      • Engine Capacity
      • Color
      • Registered City
    • Writes the scraped data in the Workbook with the ad's link as well.
  • These steps are done till the list is empty.
  • And finally, saves the Workbook with the name "output".

The source code is as follows:

Python
'''
  - The last function takes the returned list from the upper function.
  - A new workbook is created and a sheet is added by the name of Report.
  - The function scrapes the required data:
        phone number
        milage
        car_year
        transmission
        price
        engine_capacity
        color
        registration
    and finally writes it to the worksheet with the ads' link as well.
'''

def scrape_output(links):
    style = xlwt.easyxf('font: bold 1')
    sheet1 = workbook.add_sheet('Report')
    sheet1.write(0, 0, 'Phone Number', style)
    sheet1.write(0, 1, 'Milage', style)
    sheet1.write(0, 2, 'Year', style)
    sheet1.write(0, 3, 'Transmission', style)
    sheet1.write(0, 4, 'Price', style)
    sheet1.write(0, 5, 'Engine Capacity', style)
    sheet1.write(0, 6, 'Color', style)
    sheet1.write(0, 7, 'Registration', style)
    sheet1.write(0, 8, 'Link', style)
    row = 1
    for link in links:
        driver.get(link)
        driver.find_element(By.XPATH, '//button[@class = 
                "btn btn-large btn-block btn-success phone_number_btn"]').click()
        phone_number = driver.find_element(By.XPATH, '//*[@id="scrollToFixed"]/
                                              div[2]/div[1]/button[1]/span').text
        milage = driver.find_element(By.XPATH, 
                       '//*[@id="scroll_car_info"]/table/tbody/tr/td[2]/p').text
        car_year = driver.find_element(By.XPATH, 
         '/html/body/div[2]/section[2]/div/div[2]/div[1]/div/table/tbody/tr/td[1]/p').text
        transmission = driver.find_element(By.XPATH, 
                      '//*[@id="scroll_car_info"]/table/tbody/tr/td[4]/p').text
        price = driver.find_element(By.XPATH, 
                      '//*[@id="scrollToFixed"]/div[2]/div[1]/div/strong').text
        engine_capacity = driver.find_element(By.CSS_SELECTOR, 
                      '#scroll_car_detail > li:nth-child(8)').text
        color = driver.find_element(By.CSS_SELECTOR, 
                      '#scroll_car_detail > li:nth-child(4)').text
        registration = driver.find_element(By.CSS_SELECTOR, 
                      '#scroll_car_detail > li:nth-child(2)').text

        sheet1.write(row, 0, phone_number)
        sheet1.write(row, 1, milage)
        sheet1.write(row, 2, car_year)
        sheet1.write(row, 3, transmission)
        sheet1.write(row, 4, price)
        sheet1.write(row, 5, engine_capacity)
        sheet1.write(row, 6, color)
        sheet1.write(row, 7, registration)
        sheet1.write(row, 8, link)
        row += 1
    workbook.save('output.xls')

Finally, the last bits of code.

In the try block, I'm chaining scrape_output() and get_car_links().

Finally, it closes the browser.

Python
try:
    print('Starting')
    car_links = get_car_links()
    scrape_output(car_links)
finally:
    print('Done')
    driver.quit()

Note

This project can also be viewed on my GitHub repo.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)