Building a Web Scraper with CAPTCHA Solving: A Step-by-Step Guide

Srikar V
10 min readAug 2, 2024

--

Web Scraping with Python

In this post, I’ll guide you through building a web scraper using Selenium to extract student semester results, tackling the challenges posed by CAPTCHAs along the way. This project not only involved solving image CAPTCHAs with a third-party solver but also required innovative problem-solving to overcome various obstacles. Let’s dive into the details!

Overview

The objective was to develop a web scraper that retrieves semester results for students by inputting their University Seat Number (USN) and solving an image CAPTCHA. Given that the website required CAPTCHA challenges to access the results, I integrated a third-party CAPTCHA solver, TrueCaptcha, into my Selenium-based scraper. This solution was specifically designed for my project, EduInsights.

Project Setup

Dependencies

To get started, you need the following Python packages:

  1. selenium: For web automation
  2. pydantic: For data validation(if needed)
  3. requests: To interact with the CAPTCHA-solving API
  4. dotenv: For managing environment variables

Install the above packages using pip:

pip install selenium pydantic requests python-dotenv

Initializing the WebDriver

Before initializing a WebDriver, you need to download the ChromeDriver that matches your version of Chrome. You can download the appropriate version from the ChromeDriver site.

Since I used Selenium for browser automation, here’s how you can set up and initialize the Chrome WebDriver:

from selenium import webdriver
from selenium.common.exceptions import (
SessionNotCreatedException,
TimeoutException,
WebDriverException,
)
from selenium.webdriver.chrome.service import Service

def initialise_driver():
"""
Initialize a Chrome WebDriver instance with specific options.

Returns:
WebDriver: An instance of Chrome WebDriver.

Raises:
WebDriverException: If a WebDriver-related error occurs.
PermissionError: If there's a permission error while accessing files.
FileNotFoundError: If the specified ChromeDriver executable is not found.
SessionNotCreatedException: If a new WebDriver session cannot be created.
TimeoutException: If a timeout occurs while initializing the WebDriver.
"""
try:
# Set up the WebDriver service using the ChromeDriver executable path
service = Service("/usr/local/bin/chromedriver-linux64/chromedriver")

# Set up ChromeOptions to configure the Chrome browser instance
options = webdriver.ChromeOptions()
options.add_argument("--no-sandbox") # Disable sandbox mode for headless browsing
options.add_argument("--disable-dev-shm-usage") # Disable the /dev/shm usage for headless browsing
options.add_argument("--headless") # Run Chrome in headless mode (without GUI)

# Launch the Chrome browser with the specified service and options
driver = webdriver.Chrome(service=service, options=options)

return driver

except (
WebDriverException,
PermissionError,
FileNotFoundError,
SessionNotCreatedException,
TimeoutException,
) as e:
print(f"An error occurred while initializing the WebDriver: {e}")

except Exception as e:
print(f"An unexpected error occurred: {e}")

In this code, the initialise_driver function sets up the Chrome WebDriver for Selenium. It configures the WebDriver service with the path to the ChromeDriver executable and applies several options for headless browsing. Headless mode means the browser operates without a GUI, making it run faster and more efficiently in the background.

The function includes a try-except block to handle any exceptions that might occur during initialization, ensuring that errors related to WebDriver setup are managed gracefully.

VTU Website

The primary objective of this project is to scrape student semester results from the VTU (Visvesvaraya Technological University) results website. The URL for the VTU results page is available through the official VTU portal but may change periodically.

VTU Results Website

To ensure effective web scraping, follow these steps to analyze the website and understand the scraping requirements:

  1. Access the Website: Navigate to the VTU results website to understand its layout and structure. This site typically contains input fields for the University Seat Number (USN) and CAPTCHA to verify that the request is made by a human.
  2. Identify Key Elements:
  • USN Input Field: Locate the HTML element where the USN needs to be entered.
  • CAPTCHA Input Field: Identify where the CAPTCHA code must be entered.
  • Submit Button: Find the button that submits the request to retrieve the results.
  • Results Section: Understand where the results are displayed after submission.

3. Handle CAPTCHAs: The website uses CAPTCHAs to prevent automated access. A third-party CAPTCHA solver, like trueCaptcha, is integrated into the scraper to handle these challenges.

4. Examine the HTML Structure: Inspect the HTML to identify elements and their attributes that are essential for scraping. You will need to find the exact XPath or CSS selectors for:

  • Student Details: Such as name and USN.
  • Marks and Subjects: Information about each subject, including codes and marks.

By thoroughly analyzing the VTU results website, you can design your web scraping strategy to collect and manage student results while handling CAPTCHA challenges effectively.

Finding elements by inspecting the page

Solving CAPTCHAs

To handle CAPTCHAs on the VTU results website, we use the True Captcha service. Here’s how you can integrate True Captcha into your web scraping project:

import base64
import requests
import os
from dotenv import load_dotenv

load_dotenv()

apikey = os.getenv("TRUE_CAPTCHA_API_KEY")

def solve_captcha(imagePath):
with open(imagePath, "rb") as image_file:
encoded_string = base64.b64encode(image_file.read()).decode("ascii")
url = "https://api.apitruecaptcha.org/one/gettext"

data = {
"userid": "your_email@example.com",
"apikey": apikey,
"data": encoded_string,
"mode": "auto",
"len_str": "6",
}
response = requests.post(url=url, json=data, timeout=5)
data = response.json()
return data["result"]

True Captcha's API requires the following parameters:

  • userid : Email address of the caller.
  • apikey : Secret API key to access.
  • data : Base64 encoding of image.
  • mode : Use human or AI (human | default | auto).
  • len_str : length of the captcha code, if it is fixed.

To automate the extraction and solving of CAPTCHAs in our web scraper, we use Selenium to capture the CAPTCHA image from the website and then solve it using the True Captcha service.

import json
import time
from typing import Optional, Tuple

from pydantic import HttpUrl
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

import webExtractor.trueCaptcha as trueCaptcha
from webExtractor.driver import initialise_driver


# Function to solve captcha using trueCaptcha
def solve_captcha(driver) -> str:
# Find the captcha image element on the webpage
div_element = driver.find_element("xpath", '//*[@id="raj"]/div[2]/div[2]/img')

# Take a screenshot of the captcha image
div_element.screenshot(r"/api/webExtractor/captcha.png")

# Solve the captcha using the trueCaptcha solver
captcha = trueCaptcha.solve_captcha(
"/api/webExtractor/captcha.png",
)

return captcha

In the above code, we locate the CAPTCHA image element on the webpage using its XPath. XPath is a powerful query language used to select nodes from an XML document, and in this case, it helps us pinpoint the CAPTCHA image on the page.

Web Scraping Implementation

In this section, we’ll dive into the core implementation of our web scraper, designed to extract student semester results from a university website. This guide is divided into manageable parts for clarity and better understanding.

Function Signature & Docstring

Below is the function signature and detailed docstring for the scrape_results function. This function is central to our web scraping project, designed to extract student semester results by interacting with a university results website.

async def scrape_results(
USN: str, result_url: HttpUrl, driver
) -> Tuple[Optional[str], int]:
"""
Scrape the results of a student with the given USN.

Args:
USN (str): The University Seat Number (USN) of the student.
link (str): The link to the VTU results website.
driver: WebDriver object representing the browser driver.

Returns:
str: JSON string representing the student's details and marks or None if the results are not available.
int: The status code of the request.
0: Success
1: Invalid USN or non-existent USN
2: Invalid captcha
3: Connection Timeout
4: Connection refused
5: Other WebDriverException
6: Other Exception
>10: 10 + reattempts for invalid captcha
>20: 20 + reattempts for connection timeout

Description:
This function navigates to the VTU results website, fills in the USN and captcha
fields, submits the form, and retrieves the student's results if available. It
handles cases such as invalid captcha codes and alerts indicating unavailable
results. The student's details and marks are extracted, formatted into a dictionary,
converted to a JSON string, and returned.
"""

Main Scrapping Loop

Here’s how you can structure the main scraping loop for your project. This loop continuously attempts to scrape student results until successful or until a certain error threshold is reached.

print("Scraping results for USN:", USN, flush=True)
invalid_count = 0
while True:
try:
# Navigate to the VTU results website
url = str(result_url)
driver.get(url)

# solve the captcha
captcha = solve_captcha(driver)

# Refresh captcha if length is not 6
if len(captcha) != 6:
captcha = refresh_captcha(driver)

Filling Form Fields & Submitting

To fill in the USN and CAPTCHA fields and submit the form on the VTU results website, you can use the following code. This snippet ensures that the necessary elements are present and interactable before proceeding with the form submission.

# Fill in the USN and captcha fields
usn_text_field = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.NAME, "lns"))
)
usn_text_field.send_keys(USN)

captcha_text_field = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.NAME, "captchacode"))
)
captcha_text_field.send_keys(captcha)

submit_button = WebDriverWait(driver, 10).until(
EC.element_to_be_clickable((By.ID, "submit"))
)
submit_button.click()

Handling Alerts

After submitting the form, it’s crucial to handle any alerts that may appear on the VTU results website. Here’s how you can manage different types of alerts and take appropriate actions:

# Check for alerts
if EC.alert_is_present()(driver):
alert = driver.switch_to.alert
if (
alert.text
== "University Seat Number is not available or Invalid..!"
):
print(f"Invalid USN {USN}", flush=True)
alert.accept()
return None, 1
elif alert.text == "Invalid captcha code !!!":
print("Invalid captcha code for " + USN, flush=True)
print("Reattempting for " + USN, flush=True)
alert.accept()
invalid_count += 1
if invalid_count == 3:
return None, 2
continue
elif alert.text == "Please check website after 2 hour !!!":
print("Website cool down...", flush=True)
alert.accept()
print("Reinitialising driver to bypass cool down...", flush=True)
print("Reattempting after 10sec...", flush=True)
time.sleep(10)
driver = initialise_driver()
continue

Extracting Student Data

Once the form is successfully submitted and no alerts are present, we proceed to extract the student’s details and marks from the results page.

The results page after the successful submission looks like this:

Don’t judge my results, please…

Below is the code and explanation for this step:

else:
# Wait for student details
WebDriverWait(driver, 4).until(
EC.presence_of_element_located(
(
By.XPATH,
"/html/body/div[2]/div[2]/div[1]/div/div[2]/div[2]/div[1]/div/div/div[1]/div/table/tbody/tr[1]/td[2]",
)
)
)

# Extract student details and marks
usn_element = driver.find_element(
By.XPATH,
"/html/body/div[2]/div[2]/div[1]/div/div[2]/div[2]/div[1]/div/div/div[1]/div/table/tbody/tr[1]/td[2]",
)
stud_element = driver.find_element(
By.XPATH,
"/html/body/div[2]/div[2]/div[1]/div/div[2]/div[2]/div[1]/div/div/div[1]/div/table/tbody/tr[2]/td[2]",
)
table_element = driver.find_element(
By.XPATH,
"/html/body/div[2]/div[2]/div[1]/div/div[2]/div[2]/div[1]/div/div/div[2]/div/div/div[2]/div",
)
sub_elements = table_element.find_elements(By.XPATH, "div")
num_sub_elements = len(sub_elements)
stud_text = stud_element.text
usn_text = usn_element.text
print("Student Name: " + stud_text + " | USN: " + usn_text.upper())

# Extract marks for each subject
marks_list = []
for i in range(2, num_sub_elements + 1):
marks_data = []
for j in range(1, 7):
details = driver.find_element(
By.XPATH,
"/html/body/div[2]/div[2]/div[1]/div/div[2]/div[2]/div[1]/div/div/div[2]/div/div/div[2]/div/div["
+ str(i)
+ "]/div["
+ str(j)
+ "]",
)
marks_data.append(details.text)
marks_details = {
"Subject Code": marks_data[0],
"Subject Name": marks_data[1],
"INT": marks_data[2],
"EXT": marks_data[3],
"TOT": marks_data[4],
"Result": marks_data[5],
}

marks_list.append(marks_details)

marks_list.sort(key=lambda x: x["Subject Code"])

# Construct student data dictionary
student_data = {
"USN": usn_text.upper(),
"Name": stud_text.upper(),
"Marks": marks_list,
}

# Convert the dictionary to a JSON object
student_data = json.dumps(student_data, indent=4)

if invalid_count > 0:
return student_data, 10 + invalid_count
elif invalid_count == 0:
return student_data, 0
elif error_count > 0:
return student_data, 20 + error_count
elif error_count == 0:
return student_data, 0

Example JSON response of the student data:

{
"USN": "1RV17CS001",
"Name": "JOHN DOE",
"Marks": [
{
"Subject Code": "15CS51",
"Subject Name": "Management and Entrepreneurship",
"INT": "20",
"EXT": "75",
"TOT": "95",
"Result": "PASS"
},
{
"Subject Code": "15CS52",
"Subject Name": "Computer Networks",
"INT": "18",
"EXT": "70",
"TOT": "88",
"Result": "PASS"
},
{
"Subject Code": "15CS53",
"Subject Name": "Database Management Systems",
"INT": "22",
"EXT": "80",
"TOT": "102",
"Result": "PASS"
},
{
"Subject Code": "15CS54",
"Subject Name": "Automata Theory and Computability",
"INT": "19",
"EXT": "65",
"TOT": "84",
"Result": "PASS"
},
{
"Subject Code": "15CS55",
"Subject Name": "Operating Systems",
"INT": "21",
"EXT": "78",
"TOT": "99",
"Result": "PASS"
},
{
"Subject Code": "15CS56",
"Subject Name": "Microprocessor",
"INT": "20",
"EXT": "72",
"TOT": "92",
"Result": "PASS"
}
]
}

This JSON structure provides a clear and organized format for the student’s results, making it easier to interpret and use the data programmatically.

Handling Exceptions

In this part of the code, we catch and handle various exceptions that may occur during the web scraping process. Proper exception handling is crucial to ensure that the scraper can handle issues gracefully and attempt to recover when possible. Here’s a detailed breakdown:

except WebDriverException as e:
if "ERR_CONNECTION_TIMED_OUT" in str(e):
print("Connection timed out.", flush=True)
error_count += 1
if error_count == 3:
return None, 3
continue
elif "ERR_CONNECTION_REFUSED" in str(e):
print("Connection refused.", flush=True)
time.sleep(5)
driver = initialise_driver()
refused_count += 1
if refused_count == 3:
return None, 4
continue
else:
# Handle other WebDriverExceptions if needed
print("WebDriverException:", e, flush=True)
return None, 5
except Exception as e:
# Handle other exceptions if needed
print("Exception:", e, flush=True)
return None, 6

Status Codes:

  • 0: Success
  • 1: Invalid USN or non-existent USN
  • 2: Invalid CAPTCHA
  • 3: Connection Timeout
  • 4: Connection Refused
  • 5: Other WebDriverException
  • 6: Other Exception
  • 10 + X: 10 + X reattempts for invalid CAPTCHA
  • 20 + X: 20 + X reattempts for connection timeout

This approach ensures that the scraper can handle different types of errors effectively and provides a clear way to manage retry attempts and error reporting.

Conclusion

The student results scraping project successfully automates the process of retrieving academic results from the VTU website using Selenium and Python. Through a combination of advanced web scraping techniques and comprehensive error-handling strategies, the project efficiently navigates the website, resolves CAPTCHAs, and extracts crucial data such as student details and marks. This automation not only minimizes manual effort but also handles various potential errors, offering a scalable solution for batch-processing student results. The project highlights the practical application of automation and web scraping technologies, ensuring efficient and reliable access to academic results.

Thank you for taking the time to explore my student results scraping project. Your feedback is immensely valuable, and I would greatly appreciate any thoughts or suggestions for improvement. Please feel free to share your insights and comments.

--

--

Srikar V
Srikar V

Written by Srikar V

Aspiring AWS Machine Learning Specialist

No responses yet