Today I’m very excited to write my first article. It will focus on one of the fantastic new features that have been added to Power BI recently: The Python Data Source.

The Python data source is part of the Python features currently in preview. Refreshing the dataset in the Power BI Service is currently not working. This means the following steps can only be reproduced in Power BI Desktop.

But first, let me show you the final result because it is amazing to see Power BI doing this:

In order to use the Python components in Power BI Desktop, Python must be installed locally and all the libraries in use must be installed using the Anaconda Prompt.

The goal of this article is to demonstrate how to pull data from a website (web scrapping) using Python in order to report it in Power BI.

Python provides many libraries that make web scrapping easier. With the new release of Python features in Power BI, it is now possible to execute the script and get the data directly from Power BI.

Even though Power BI already has a From Web connector, it is not always working as expected, the release of the new Python Data Source provides the ability to overcome this issue by getting the data from web using a Python script.

As a bonus, we decided to login to a website which uses Cookies (that expire) for authentication.

Not sure if this is possible in Power Query, if yes it would probably be complex.

The website is Opal. This is the website of Sydney’s public transport and the data I will retrieve is the list of all trips I made using my opal card.

Without going into the details of the Python script, here is a brief overview:

  • We use the Selenium library along with Chromedrive.exe to navigate the internet, log in to the desired website and click on Next below the table to retrieve all the pages. Chromedriver.exe must be placed in the folder: C:\Program Files (x86)\Google\Chrome\
  • The BeautifulSoup4 library is used to parse the HTML table
  • Pandas is required to pass a DataFrame to Power BI. A Pandas DataFrame is required for Power BI to find the data in a table form. List of Lists are not supported by Power BI (No documentation being available at the moment, it took me some tries to find out).

For the demonstration purpose, the script only iterates over 5 pages but you can retrieve all the pages available by replacing the Foreach loop with the While True loop (lines 30-31).

The complete script is available here:

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.keys import Keys
import time
from pandas import DataFrame

#Open Chrome and go to Opal.com.au
driver = webdriver.Chrome("C:\Program Files (x86)\Google\Chrome\chromedriver.exe")
driver.get("https://www.opal.com.au/")

#user access information
username = driver.find_element_by_id("h_username")
username.send_keys("Username")
password = driver.find_element_by_id("h_password")
password.send_keys("Password", Keys.TAB, Keys.ENTER)

time.sleep(2)

driver.get("https://www.opal.com.au/registered/opal-card-transactions/?cardIndex=0")
soup = BeautifulSoup(driver.page_source, 'lxml')

datasets = []

#Get the headings
table = soup.find("table", attrs={"id":"transaction-data"})
headings = [th.get_text() for th in table.find("tr").find_all("th")]
datasets.append(headings) 

#Extract rows from the HTML table
#while True:
for i in range(0, 5):
    time.sleep(2)
    soup = BeautifulSoup(driver.page_source, 'lxml')
    table = soup.find("table", attrs={"id":"transaction-data"})
    
    # The first tr contains the field names.
    headings = [th.get_text() for th in table.find("tr").find_all("th")] 
    for row in table.find_all("tr")[1:]:
        
        dataset = [td.get_text().replace('\n', '').replace('\t', '').replace('\xad','') for td in row.find_all("td")]
        datasets.append(dataset)
    
    try:
        button_next = driver.find_element_by_id("next")
        button_next.click()
    except:
        break

#Close google Chrome
driver.quit()

#Create Pandas DataFrame
headers = datasets.pop(0)
df = DataFrame(datasets, columns=headers)
df

 

You can run the script from a Python IDE after replacing the values Username and Password. The quotes must be kept (lines 13 and 15)

It is possible to use the script to log into any other site, for this simply replace the URLs and make sure the id of the different HTML elements correspond to the ones of the pages you are referring.

The script is integrated in the Power BI report using the Python data source. You can download the report from the OneDrive here. In order to refresh the report successfully, make sure to change the Username and Password parameters of the report with your own credentials.

Hope you enjoyed as much as I did, thanks for reading!

Alexis has been working in the Business Intelligence domain for the past four years, leading projects end-to-end across multiple sectors, delivering true business value. His areas of specialty include data warehouse design and ETL processes. Recently, Alexis was also involved on projects hosting data in the Microsoft Cloud.