Understanding Pandas Returning Empty Data Frames: Common Issues and Solutions

Understanding Pandas Returning Empty Data Frames

As a technical blogger, I have encountered numerous scenarios where data frames are returned empty. In this article, we will delve into the world of pandas and explore why it might return an empty data frame.

Introduction to Pandas

Pandas is a powerful library used for data manipulation and analysis in Python. It provides data structures and functions designed to make working with structured data easy and efficient.

Scrape Data Using BeautifulSoup and Selenium

In the provided question, we are using BeautifulSoup and Selenium to scrape data from a JavaScript-heavy website. The goal is to extract specific columns’ contents.

Importing Libraries

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver

Setting Up the Browser

# Set up the browser
driver = webdriver.Chrome()
driver.get("https://www.canalplus.com/programme-tv/")

Parsing HTML Content

soup = BeautifulSoup(driver.page_source, 'html.parser')

Extracting Data

The provided code snippet attempts to extract data using XPath expressions:

time = []
sport = []
description = []

# Programme time
for item in soup.select("guide___1Ogg9"):
    if item.find_next(find_element_by_xpath='//*[@id="landing_layers_1"]/div/div[2]/div[3]/div/ul/div[2]/ul/li/ul/li[2]/a/div[1]'):
        time.append(item.find_next(
            find_element_by_xpath='//*[@id="landing_layers_1"]/div/div[2]/div[3]/div/ul/div[2]/ul/li/ul/li[2]/a/div[1]').text.strip())
    else:
        time.append("Nan")

# Sport Name
if item.find_next(find_element_by_xpath='//*[@id="landing_layers_1"]/div/div[2]/div[3]/div/ul/div[2]/ul/li/ul/li[1]/a/div/div[2]/div[1]/span'):
    sport.append(item.find_next(
        find_element_by_xpath='//*[@id="landing_layers_1"]/div/div[2]/div[3]/div/ul/div[2]/ul/li/ul/li[1]/a/div/div[2]/div[1]/span').text.strip())
else:
    sport.append("Nan")

# Programme info
if item.find_next(find_element_by_xpath='//*[@id="landing_layers_1"]/div/div[2]/div[4]/div[2]/div/div/ul/div[2]/ul/li/ul/li[4]/a/div[2]'):
    description.append(item.find_next(
        find_element_by_xpath='//*[@id="landing_layers_1"]/div/div[2]/div[4]/div[2]/div/div/ul/div[2]/ul/li/ul/li[4]/a/div[2]').text.strip())
else:
    description.append("Nan")

Using the API to Fetch Data

In the provided answer, we use an API to fetch data directly from the website.

Importing Libraries

import requests
from tabulate import tabulate

Making a Request to the API

api_url = "https://secure-webtv-static.canal-plus.com/metadata/cpfra/all/v2.2/globalchannels.json"
response = requests.get(api_url).json()

Parsing JSON Response

tv_programme = {
    channel["name"]: [
        [
            e['title'],
            e['subTitle'],
            pendulum.parse(e['timecodes'][0]['start']).time().strftime("%H:%M"),
            datetime.timedelta(
                milliseconds=e['timecodes'][0]['duration'],
            ).__str__().rsplit(".")[0],
        ] for e in channel["events"]
    ] for channel in response["channels"]
}

Creating a Data Frame

df = pd.DataFrame(tv_programme["CANAL+"], columns=["Title", "Subtitle", "Time", "Duration"])
print("Here is your data. Right I am off to sleep then!")
print(df)
df.to_csv("canalPlusSport.csv")

Conclusion

In this article, we explored the different scenarios where pandas might return an empty data frame. We discussed how using BeautifulSoup and Selenium can lead to empty data frames if not implemented correctly. Additionally, we saw how using APIs can be a more efficient way to fetch data directly from websites.