Preserving the Soul of Urdu Poetry: Scraping Rekhta.org for Ghazals with Python’s Selenium & BeautifulSoup

Saad Sohail

--

A Deep Dive into Urdu Ghazal Data with Python to Uncover a Timeless Literary Legacy

Urdu Ghazals are not just verses they embody a rich cultural legacy. Rekhta.org, with its vast repository of Urdu literature, plays a crucial role in preserving this art. However, manually cataloging thousands of Ghazals is as daunting as a librarian trying to index an ever-growing collection of ancient manuscripts. Automation comes to the rescue!

In this blog post, I’ll guide you through my Python project designed to scrape Ghazal data from Rekhta.org. You’ll learn about the project’s purpose, technical details, challenges, and insights, along with a step-by-step breakdown of the code.

Image From the Author

Introduction

The Cultural Significance of Ghazals

Ghazals are soulful expressions of love, loss, and longing. Rekhta.org preserves this poetic tradition, making it accessible to scholars, poets, and enthusiasts. By automating data extraction from this repository, we not only preserve the beauty of Urdu poetry digitally but also enable deeper cultural and linguistic analysis.

The Problem and the Solution

Manually collecting data from Rekhta.org is time-consuming and error-prone. With automation using Python, we can efficiently scrape, clean, and store Ghazal data paving the way for further analysis, such as natural language processing (NLP) and cultural research.

Project Overview

Objectives

  • Dataset Creation: Automatically extract Ghazals, poet names, and relevant metadata.
  • Poet Analysis: Compare poetic styles and thematic trends across different poets.
  • Cultural Insights: Analyze patterns such as common meters and poetic structures.
  • Digital Preservation: Ensure that the timeless beauty of Urdu poetry is preserved in a digital format.

Tech Stack

  • Python: The core programming language.
  • Requests & BeautifulSoup: For HTTP requests and HTML parsing.
  • Selenium: To handle dynamic content loading.
  • CSV Module: To store the scraped data.

Ethical Scraping Practices: We strictly adhere to Rekhta.org’s terms of service by respecting and implementing rate-limiting, and ensuring our scraping is non-disruptive.

Technical Deep Dive

How the Scraper Works

Imagine the process as a librarian cataloging books on shelves: first, you navigate through the library’s sections (web pages), then you identify the relevant books (Ghazal pages), and finally, you record the details (data extraction). This scraper follows a similar flow:

  1. Navigating Rekhta’s Structure:
    Start at the main Ghazal page, then follow URL patterns to access individual categories.
  2. Extracting Data:
    Use Selenium to scroll through pages (loading dynamic content) and BeautifulSoup to parse the HTML for poet names and Ghazal URLs.
  3. Fetching and Processing Ghazals:
    Visit each Ghazal URL, extract the verses, and format them for storage.
  4. Storing Data:
    Write the extracted data to a CSV file for later analysis.

Code Breakdown

To help you follow along, the code is divided into manageable chunks.

Imports and Setup

First, install the necessary libraries:

pip install requests beautifulsoup4 selenium

Now, include the following imports:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from selenium import webdriver
import time
import csv

Explanation:

  • Requests: For making HTTP requests.
  • BeautifulSoup: To parse HTML content.
  • Selenium: To automate browser interactions (especially for dynamic content).
  • CSV: To store the scraped data.

Processing Pages & Extracting URLs

The ghazalParse function processes each Ghazal page by loading content, scrolling to ensure all elements are visible, and extracting poet names and Ghazal URLs.

def ghazalParse(ghazalURL, output_file="ghazals.csv"):
print(f"\nStarting to parse {len(ghazalURL)} ghazal pages...")
driver = webdriver.Chrome() # Ensure ChromeDriver is installed and in PATH
try:
for index, url in enumerate(ghazalURL, 1):
print(f"\nProcessing page {index}/{len(ghazalURL)}: {url}")
driver.get(url)

print("Scrolling page to load all content...")
# Scroll to bottom until all content is loaded
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2) # Allow time for content to load
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height

print("Extracting poets and URLs...")
# Parse the page with BeautifulSoup
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
parent_div = soup.find("div", class_="mainContentBody")

# Extract poet names and ghazal URLs
poets = parent_div.find_all("h5") if parent_div else []
urls = parent_div.find_all("a") if parent_div else []
poetsList = []
shairiURL = set()
shairiText = []

print(f"Found {len(poets)} poets on this page")
for poet in poets:
poetsList.append(poet.text)

print("Extracting ghazal URLs...")
for a_tag in urls:
class_name = a_tag.get("class")
if class_name is None:
href = a_tag.get("href")
shairiURL.add(href)

Explanation:

  • Scrolling: Selenium is used to scroll the page to load dynamic content.
  • HTML Parsing: BeautifulSoup locates the main content and extracts poet names and URLs.
  • Data Collection: Poet names are stored in poetsList, and Ghazal URLs in the shairiURL set.

Fetching Individual Ghazals & Writing to CSV

After collecting URLs, we fetch each Ghazal’s content, process the verses, and store everything in a CSV file.

            print(f"Processing {len(shairiURL)} individual ghazals...")
for i, href in enumerate(shairiURL, 1):
print(f"Fetching ghazal {i}/{len(shairiURL)}: {href}")
# Getting text from the ghazal
response = requests.get(href)
html = response.text
soup = BeautifulSoup(html, "html.parser")
ghazalText = soup.find_all(class_="c")
ghazalText.remove(ghazalText[0]) # Remove extra text present on the page
shairiText.append(ghazalText)

print("Processing and formatting ghazal texts...")
shairiText_str = [] # For combining verses of a ghazal
ghazalsText_str = [] # For storing complete ghazal texts

for i in range(len(shairiText)):
for text in shairiText[i]:
shairiText_str.append(text.text) # Each verse extracted from span
tempText = "\n".join(shairiText_str) # Combine verses with newline separation
ghazalsText_str.append(tempText)
shairiText_str.clear() # Clear list for the next ghazal

print(f"Writing {len(poetsList)} entries to CSV file...")
# Write to CSV
with open(output_file, mode="a", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(["Poet", "Poetry Text"]) # Write header
for poet, poetry_text in zip(poetsList, ghazalsText_str):
writer.writerow([poet, poetry_text])

print(f"Successfully processed page {index}")

except Exception as e:
print(f"\nERROR: An error occurred: {e}")
finally:
driver.quit()
print("\nBrowser session closed")

Explanation:

  • Fetching Content: For each Ghazal URL, we use Requests and BeautifulSoup to extract the text.
  • Processing Data: The verses are concatenated to form complete Ghazals.
  • Storing Data: The poet’s name and Ghazal text are written into a CSV file in UTF-8 encoding to properly support Urdu text.

The Main Function

The main function kickstarts the scraping process by fetching the base page and extracting category URLs.

def main():
print("Starting Rekhta Ghazal Scraper...")
base_url = "https://www.rekhta.org/shayari/ghazals"
print(f"Fetching main page: {base_url}")

response = requests.get(base_url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
links = soup.find_all("a")

print("Extracting ghazal category URLs...")
ghazals_url = []
for link in links:
href = link.get("href")
if href:
absolute_url = urljoin(base_url, href)
# Filter to get only relevant Ghazal category links
if "/shayari" in absolute_url and "?" not in absolute_url:
if absolute_url != base_url:
ghazals_url.append(absolute_url)

print(f"Found {len(ghazals_url)} ghazal category pages")
ghazalParse(ghazals_url)
print("\nScript execution completed!")

if __name__ == "__main__":
main()

Explanation:

  • Starting Point: The script fetches the main Ghazal page from Rekhta.org.
  • URL Extraction: It identifies and collects relevant category links.
  • Execution: These URLs are then passed to the ghazalParse function for further processing.

Conclusion

This project is more than just a data scraper it’s a journey into preserving and analyzing the rich tapestry of Urdu poetry. By automating the extraction of Ghazals from Rekhta.org, we open up exciting opportunities for linguistic research, cultural studies, and advanced NLP applications.

If you’re interested in contributing or expanding on this project, check out the GitHub repository and join the conversation. Happy coding, and may the timeless verses of Ghazals continue to inspire us all!

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

No responses yet

Write a response