
"Following is an example on how to use Selenium Python to navigate a website and download or scrap relevant information.
Selenium is a free open source automated testing framework used to validate web application across different browsers and platforms.In this post I go on to illustrate how one can navigate a website and perform repetitive task and download documents.
Potential usage :
A) Going to website such as Realestate.com and inputting different suburbs and scrapping property and real estate related data.
B) Downloading documents from a CRM for given set of customers.
I would be keen to learn in the comments section how this approach was utilized by you and was beneficial.
IMPORTANT NOTE ON CHROME
Chrome driver has limitation as it creates multiple warnings if you use this application to download files which are not browser embedded and cannot be ignored through Selenium. For examples when downloading a file chrome might forced you to approve a file download and if the aim of your application if to scrap documented then chrome driver will fail.
Set Up Firefox browser:
The code below shows an working example of how to set up a profile for Mozilla Firefox by using gecko driver. My task involved clicking through multiple screens and downloading attachments. Firefox gecko driver seems to be optimal for this task
SETTING UP PROFILE AND MIMES FOR DOCUMENT DOWNLOAD:
#PACKAGE IMPORTS FOR SELENIUM
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver import Firefox, FirefoxProfile
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
#FOLLOWING CODE SETS UP FIREFOX PROFILE TO WORK AROUND AUTOMATED AUTHENTICATION OR TO WORK WITH PROXY SETTINGS
profile = FirefoxProfile()
# VARIABLE BELOW IS A THE DIRECTORY WHERE FIREFOX WILL SAVE THE DOWNLOADED FILES
dirr=r"C:\document_put_your_folder_location_where_files_are_to_be_downaloaded"
options = webdriver.FirefoxOptions()
profile.set_preference('network.negotiate-auth.trusted-uris', 'sap-cip.csda.gov.au:8081')
profile.set_preference("network.negotiate-auth.delegation-uris", 'https://your_url.com.au:8081/**')
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference("browser.download.manager.alertOnEXEOpen", False)
profile.set_preference("browser.download.manager.closeWhenDone", False)
profile.set_preference("browser.download.manager.focusWhenStarting", False)
profile.set_preference("services.sync.prefs.sync.browser.download.useDownloadDir", True)
#SETTING FOLDER LOCATION WHERE DOWNLOADED FILE ARE GOING TO BE SAVED
profile.set_preference("browser.download.dir", dirr)
#MIME SETTING SO THAT THE BROWSER AUTOMATICALLY DOWNLOADS AND NO PROMPTS ARE GENERATED
profile.set_preference("browser.helperApps.neverAsk.saveToDisk","image/svg+xml, image/jpeg, image/jpg, image/png, application/vnd.ms-excel, application/msword, application/x-msg, message/rfc822, application/atom+xml, application/atomsvc+xml, text/new, application/forced-download, application/x-msdownload, application/vnd.ms-outlook, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/outlook, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, application/x-gzip,text/plain, application/octet-stream, application/binary, text/csv, application/csv, application/excel, attachment/excel, attachment/msg, text/comma-separated-values, text/xml, application/xml, application/pdf, application/msg, application/x-unknown, application/octet-stream, application/vnd.ms-excel.sheet.binary.macroEnabled.12, application/xml, application/vnd.ms-excel.sheet.macroEnabled.12, application/vnd.ms-excel.template.macroEnabled.12")
profile.set_preference("browser.helperApps.neverAsk.openToDisk","image/svg+xml, image/jpeg, image/jpg, image/png, application/vnd.ms-excel, application/msword, application/x-msg, message/rfc822, application/atom+xml, application/atomsvc+xml,text/new, application/forced-download, application/x-msdownload, application/vnd.ms-outlook, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/outlook, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, application/x-gzip,text/plain, application/octet-stream, application/binary, text/csv, application/csv, application/excel, attachment/excel, attachment/msg, text/comma-separated-values, text/xml, application/xml, application/pdf, application/msg, application/x-unknown, application/octet-stream, application/vnd.ms-excel.sheet.binary.macroEnabled.12, application/xml, application/vnd.ms-excel.sheet.macroEnabled.12, application/vnd.ms-excel.template.macroEnabled.12")
#OTHER FIREFOX OPTIONS
profile.set_preference("browser.helperApps.alwaysAsk.force", True)
profile.set_preference("pdfjs.disabled", True)
profile.set_preference("dom.disable_open_during_load", True)
profile.set_preference("dom.webnotifications.enabled", False)
options.set_preference("dom.push.enabled", False)
options.set_preference("dom.webnotifications.enabled", False)
#return Firefox(firefox_profile=profile)
# PATH TO FIREFOX GECKO DRIVER
geckodriver="C:/Program Files/Drivers/geckodriver.exe"
profile.set_preference("browser.download.folderList", 2)
profile.set_preference("browser.download.manager.showWhenStarting", False)
# CALL DRIVER AND EXECUTE
driver = webdriver.Firefox(executable_path=geckodriver, firefox_profile=profile, options=options)
Code below (excerpts from above) defined mime type in order to avoid download pop-ups and warning when scrapping documents from a website
profile.set_preference("browser.helperApps.neverAsk.saveToDisk","image/svg+xml, image/jpeg, image/jpg, image/png, application/vnd.ms-excel, application/msword, application/x-msg, message/rfc822, application/atom+xml, application/atomsvc+xml, text/new, application/forced-download, application/x-msdownload, application/vnd.ms-outlook, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/outlook, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, application/x-gzip,text/plain, application/octet-stream, application/binary, text/csv, application/csv, application/excel, attachment/excel, attachment/msg, text/comma-separated-values, text/xml, application/xml, application/pdf, application/msg, application/x-unknown, application/octet-stream, application/vnd.ms-excel.sheet.binary.macroEnabled.12, application/xml, application/vnd.ms-excel.sheet.macroEnabled.12, application/vnd.ms-excel.template.macroEnabled.12")
Packages to Import
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.keys import Keys
from selenium.webdriver import Firefox, FirefoxProfile
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import StaleElementReferenceException
from selenium.common.exceptions import ElementNotVisibleException
from selenium.common.exceptions import ElementClickInterceptedException
from selenium.common.exceptions import TimeoutException
from seleniumrequests import Firefox
from seleniumrequests.request import RequestMixin
import os, sys
from os.path import join
from os.path import splitext
import re
import sys
import traceback
import collections
import shutil
import time
import re
The illustrated code is aiming to complete the following tasks:
1) Loop through a list of URL e.g. suburbs listing of sold properties a
https://www.realestate.com.au/sold/in-annandale,+nsw+2038%3b/list-1
https://www.realestate.com.au/sold/in-nsw,+nsw+2042%3b/list-1
2) For each member we have to navigate and click through a series of web pages to download documents or to scrap data
3) If you are downloading data then wait until files are downloaded and then rename and move files for next loop to commence
4) loop through the list and scrap the web page , download files or do both
Below I have loosely described how to achieve this aim and to handle various scenario which one will face:
First Step if you are downloading documents from the website then you need to set the folder where the files are downloaded and staged and the "moved to folder" where after each loop cycle the files are moved to. In my example i rename the files once downloaded and then move them to the "moved to folder". Since i might be downloading thousands of document if they remain in the downloaded folder each cycle will rename them and process them.
Also we have broken down the URL so that we can add a looping element as per my example above we can loop through suburbs in "realestate.com.au".
The loop will than iterate through a list or data-frame columns values.
What will the list and loop look like?
newtown+nsw+2038
annandale+nsw+2042
Erskenville+nsw+2041
In our example above where we are scraping data from realestate.com.au this list will be used to create new url for each loop like "https://www.realestate.com.au/sold/in-annandale,+nsw+2038%3b/list-1 ". This is shown in the embedded code below.
So lets begin:
#FOLDER LOCATION WHERE IF YOU ARE DOWNLOADING DOCUMENTS ARE SAVED, PROCESSED AND THEN MOVED
src_dir = r"c:\folder_where_the_renamed_files_are_saved"
dest_dir = r"c:\folder_where_the_files_are_first_downloaded"
#BREAKDOWN THE URL INTO THREE PARTS
#1 BEGINNING
url_template1 = "https://www.realestate.com.au/sold/in-"
#2 END
url_surf1 = "%3b/list-1"
#EXAMPLE: url_base = url_template + suburbid+ url_surf
#COMMENCE LOOP TEST IS THE DATA FRAME WHERE FIRST COLUMN HAS THE VALUE WE WANT TO LOOP THROUGH HENCE test[1]
for guiddd in [test[1] for test in test]:
#CREATE COMPLETE URL
url_base1 = url_template1 + guiddd + url_surf1
#LAUNCH THE DRIVER AND BROWSER AND THEN PASTE THE URL ADDRESS INTO IT
#MOST IMPORTANT PART IS "rust_mozprofile8gjASj" this part which after
a profile has been created it saved in the temp folder and rather than
creating new profile for each loop you refer to one
profilen=webdriver.FirefoxProfile(r'C:\Users\SW\AppData\Local\Temp\rust_mozprofile8gjASj')
driver = webdriver.Firefox(executable_path=geckodriver, firefox_profile=profilen)
print('Selenium webdriver setup complete')
print(guiddd)
#TRY AND OPEN THE URL IF UNSUCCESSFUL THEN EXIT AND BEGIN WITH NEXT
VALUE
try:
driver.get( url_base1 + '/')
print('page open')
except ConnectionRefusedError:
driver.close()
time.sleep(10)
driver.get( url_base1 + '/')
except ConnectionError:
driver.close()
time.sleep(10)
driver.get( url_base1 + '/')
time.sleep(3)
driver.set_window_size(480, 320)
time.sleep(1)
driver.maximize_window()
time.sleep(2)
#BELOW ARE FEW EXAMPLES OF HOW TO EXTRACT CERTAIN ELEMENTS USING XPATH METHOD AND WORKING AROUND VARIOUS SCENARIOS. YOUR SCRIPT WOULD SURELY BE TAILORED TO YOUR SPECIFIC WEB_PAGE AND INFORMATION YOU WANT TO DERIVE
#LOOK FOR ELEMENT id and with text tabAttachments, WAIT FOR 10 SECONDS BEFORE IT APPEARS
try:
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH,"//*[contains(@id,'tabAttachments')]")))
#IF ELEMENT id and with text tabAttachments EXIST THEN CLICK THE
ELEMENT
elm1 = driver.find_element_by_xpath("//* [contains(@id,'tabAttachments')]")
elm1.click()
# IF EXCEPTION OCCURS THEN RFRESH THE DRIVER SIMILAR TO PAGE REFRESH
except StaleElementReferenceException:
driver.refresh()
#REPEAT THE ABOVE PROCESS AGAIN AFTER THE REFRESH
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH,"//*[contains(@id,'tabAttachments')]")))
elm1 = driver.find_element_by_xpath("//*[contains(@id,'tabAttachments')]")
elm1.click()
time.sleep(3)
try:
WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH,"//*[contains(@id,'__attribute11-__xmlview2') and contains(@id,'text')]")))
except TimeoutException:
elm1 = driver.find_element_by_xpath("//*[contains(@id,'tabAttachments')]")
elm1.click()
#AFTER THE ABOVE STEP COMPLETES LOOK FOR A NEW ELEMENTS WHICH ARE LINKS TO DOCUMENT DOWNLOADS AND THEN COUNT NUMBER OF LINKS?DOCUMENTS TO DOWNLOAD. elem2 is the list of links to document download
elem2= driver.find_elements_by_xpath("//*[contains(@id,'__attribute11-__xmlview2') and contains(@id,'text')]")
count_docs =len(driver.find_elements_by_xpath("//*[contains(@id,'__attribute11-__xmlview2') and contains(@id,'text')]"))
print("countDocs " + str(count_docs))
x = datetime.datetime.now()
print(x)
So far we have created a FIREFOX profile, we created a loop which will go through a list and can go through multiple pages, finally we have shown few selenium functions to navigate through the web page and isolate elements.
If you are scrapping data from the web page itself then you can use the xpath function of selenium as shown above to get relevant information. What if you want to download all documents from a web page?
We have already created a list which captures all the links to download document in the web page. Now we use this list to loop and click through each link and download the documents.
#START A LOOP using elem2 from the above step.
for element in range(len(elem2)):
try:
#FIND THE FIRST LINK DOWNLOAD
elem2 = WebDriverWait(driver, 25).until(EC.visibility_of_all_elements_located((By.XPATH,"//*[contains(@id,'__attribute11-__xmlview2') and contains(@id,'text')]")))
except TimeoutException:
pass
#IF FOUND THEN CLICK ON THE LINK TO DOWNLOAD
elem2= driver.find_elements_by_xpath("//*[contains(@id,'__attribute11-__xmlview2') and contains(@id,'text')]")
try:
elem2[element].click()
time.sleep(1)
#FOLLOWING STEPS DEAL WITH ERRORS AND TIMEOUTS
#THIS PART OF THE CODE CAN BE EDITED AND MADE FIT TO PURPOSE FOR YOUR
OWN USE. MY EXPERIENCE YOU MIGHT HAVE TO INPUT MANY FAIL SAFE
MEASURES TO WORK AROUND ERRORS AND ISSUES WHICH MAY ARISE. FOR
EXAMPLE WHEN A WEB PAGE IS LOADING THERE MIGHT BE AN OVERLAY AND
BEFORE YOUR MOVE ON TO CLICK YOU HAVE TO WAIT FOR THE OVERLAY TO
DISAPPEAR
#THE PROGRAM LOOPS THROUGH AND CLICK EACH LINK AND DOWNLOADS THE
DOCUMENTS
except IndexError:
try:
elem2 = WebDriverWait(driver, 25).until(EC.visibility_of_all_elements_located((By.XPATH,"//span[contains(@id,'__attribute11-__xmlview2') and contains(@id,'text')]")))
elem2[element].click()
time.sleep(1)
except ElementClickInterceptedException:
WebDriverWait(driver