Selenium + Python Working example. Scraping data of a website.

"Following is an example on how to use Selenium Python to navigate a website and download or scrap relevant information.

Selenium is a free open source automated testing framework used to validate web application across different browsers and platforms.In this post I go on to illustrate how one can navigate a website and perform repetitive task and download documents.

Potential usage :

A) Going to website such as Realestate.com and inputting different suburbs and scrapping property and real estate related data.

B) Downloading documents from a CRM for given set of customers.

I would be keen to learn in the comments section how this approach was utilized by you and was beneficial.

IMPORTANT NOTE ON CHROME

Chrome driver has limitation as it creates multiple warnings if you use this application to download files which are not browser embedded and cannot be ignored through Selenium. For examples when downloading a file chrome might forced you to approve a file download and if the aim of your application if to scrap documented then chrome driver will fail.

Set Up Firefox browser:

The code below shows an working example of how to set up a profile for Mozilla Firefox by using gecko driver. My task involved clicking through multiple screens and downloading attachments. Firefox gecko driver seems to be optimal for this task

SETTING UP PROFILE AND MIMES FOR DOCUMENT DOWNLOAD:

#PACKAGE IMPORTS FOR SELENIUM
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver import Firefox, FirefoxProfile
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 

#FOLLOWING CODE SETS UP FIREFOX PROFILE TO WORK AROUND AUTOMATED AUTHENTICATION OR TO WORK WITH PROXY SETTINGS
profile = FirefoxProfile()
# VARIABLE BELOW IS A THE DIRECTORY WHERE FIREFOX WILL SAVE THE DOWNLOADED FILES
dirr=r"C:\document_put_your_folder_location_where_files_are_to_be_downaloaded"
options = webdriver.FirefoxOptions()
profile.set_preference('network.negotiate-auth.trusted-uris', 'sap-cip.csda.gov.au:8081')
profile.set_preference("network.negotiate-auth.delegation-uris", 'https://your_url.com.au:8081/**')
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference("browser.download.manager.alertOnEXEOpen", False)
profile.set_preference("browser.download.manager.closeWhenDone", False)
profile.set_preference("browser.download.manager.focusWhenStarting", False)
profile.set_preference("services.sync.prefs.sync.browser.download.useDownloadDir", True)
#SETTING FOLDER LOCATION WHERE DOWNLOADED FILE ARE GOING TO BE SAVED
profile.set_preference("browser.download.dir", dirr)
#MIME SETTING SO THAT THE BROWSER AUTOMATICALLY DOWNLOADS AND NO PROMPTS ARE GENERATED

profile.set_preference("browser.helperApps.neverAsk.saveToDisk","image/svg+xml, image/jpeg, image/jpg, image/png, application/vnd.ms-excel, application/msword, application/x-msg, message/rfc822, application/atom+xml, application/atomsvc+xml, text/new, application/forced-download, application/x-msdownload, application/vnd.ms-outlook, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/outlook, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, application/x-gzip,text/plain, application/octet-stream, application/binary, text/csv, application/csv, application/excel, attachment/excel, attachment/msg, text/comma-separated-values, text/xml, application/xml, application/pdf, application/msg, application/x-unknown, application/octet-stream, application/vnd.ms-excel.sheet.binary.macroEnabled.12, application/xml, application/vnd.ms-excel.sheet.macroEnabled.12, application/vnd.ms-excel.template.macroEnabled.12")
profile.set_preference("browser.helperApps.neverAsk.openToDisk","image/svg+xml, image/jpeg, image/jpg, image/png, application/vnd.ms-excel, application/msword, application/x-msg, message/rfc822, application/atom+xml, application/atomsvc+xml,text/new, application/forced-download, application/x-msdownload, application/vnd.ms-outlook, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/outlook, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, application/x-gzip,text/plain, application/octet-stream, application/binary, text/csv, application/csv, application/excel, attachment/excel, attachment/msg, text/comma-separated-values, text/xml, application/xml, application/pdf, application/msg, application/x-unknown, application/octet-stream, application/vnd.ms-excel.sheet.binary.macroEnabled.12, application/xml, application/vnd.ms-excel.sheet.macroEnabled.12, application/vnd.ms-excel.template.macroEnabled.12")

#OTHER FIREFOX OPTIONS
profile.set_preference("browser.helperApps.alwaysAsk.force", True)
profile.set_preference("pdfjs.disabled", True)
profile.set_preference("dom.disable_open_during_load", True)
profile.set_preference("dom.webnotifications.enabled", False)
options.set_preference("dom.push.enabled", False)
options.set_preference("dom.webnotifications.enabled", False)
#return Firefox(firefox_profile=profile)

# PATH TO FIREFOX GECKO DRIVER
geckodriver="C:/Program Files/Drivers/geckodriver.exe"
profile.set_preference("browser.download.folderList", 2)
profile.set_preference("browser.download.manager.showWhenStarting", False)


# CALL DRIVER AND EXECUTE
driver = webdriver.Firefox(executable_path=geckodriver, firefox_profile=profile, options=options)

Code below (excerpts from above) defined mime type in order to avoid download pop-ups and warning when scrapping documents from a website

profile.set_preference("browser.helperApps.neverAsk.saveToDisk","image/svg+xml, image/jpeg, image/jpg, image/png, application/vnd.ms-excel, application/msword, application/x-msg, message/rfc822, application/atom+xml, application/atomsvc+xml, text/new, application/forced-download, application/x-msdownload, application/vnd.ms-outlook, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/outlook, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, application/x-gzip,text/plain, application/octet-stream, application/binary, text/csv, application/csv, application/excel, attachment/excel, attachment/msg, text/comma-separated-values, text/xml, application/xml, application/pdf, application/msg, application/x-unknown, application/octet-stream, application/vnd.ms-excel.sheet.binary.macroEnabled.12, application/xml, application/vnd.ms-excel.sheet.macroEnabled.12, application/vnd.ms-excel.template.macroEnabled.12")

Packages to Import

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.keys import Keys
from selenium.webdriver import Firefox, FirefoxProfile
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import StaleElementReferenceException
from selenium.common.exceptions import ElementNotVisibleException
from selenium.common.exceptions import ElementClickInterceptedException
from selenium.common.exceptions import TimeoutException
from seleniumrequests import Firefox
from seleniumrequests.request import RequestMixin
import os, sys
from os.path import join
from os.path import splitext
import re
import sys
import traceback
import collections
import shutil
import time
import re

The illustrated code is aiming to complete the following tasks:

1) Loop through a list of URL e.g. suburbs listing of sold properties a

https://www.realestate.com.au/sold/in-annandale,+nsw+2038%3b/list-1

https://www.realestate.com.au/sold/in-nsw,+nsw+2042%3b/list-1

2) For each member we have to navigate and click through a series of web pages to download documents or to scrap data

3) If you are downloading data then wait until files are downloaded and then rename and move files for next loop to commence

4) loop through the list and scrap the web page , download files or do both

Below I have loosely described how to achieve this aim and to handle various scenario which one will face:

First Step if you are downloading documents from the website then you need to set the folder where the files are downloaded and staged and the "moved to folder" where after each loop cycle the files are moved to. In my example i rename the files once downloaded and then move them to the "moved to folder". Since i might be downloading thousands of document if they remain in the downloaded folder each cycle will rename them and process them.

Also we have broken down the URL so that we can add a looping element as per my example above we can loop through suburbs in "realestate.com.au".

The loop will than iterate through a list or data-frame columns values.

What will the list and loop look like?

newtown+nsw+2038

annandale+nsw+2042

Erskenville+nsw+2041

In our example above where we are scraping data from realestate.com.au this list will be used to create new url for each loop like "https://www.realestate.com.au/sold/in-annandale,+nsw+2038%3b/list-1 ". This is shown in the embedded code below.

So lets begin:

#FOLDER LOCATION WHERE IF YOU ARE DOWNLOADING DOCUMENTS ARE SAVED, PROCESSED AND THEN MOVED
src_dir = r"c:\folder_where_the_renamed_files_are_saved"
dest_dir = r"c:\folder_where_the_files_are_first_downloaded"

#BREAKDOWN THE URL INTO THREE PARTS
#1 BEGINNING
url_template1 = "https://www.realestate.com.au/sold/in-"
#2 END
url_surf1 =  "%3b/list-1"

#EXAMPLE: url_base = url_template + suburbid+ url_surf

#COMMENCE LOOP TEST IS THE DATA FRAME WHERE FIRST COLUMN HAS THE VALUE WE WANT TO LOOP THROUGH HENCE test[1]

for guiddd in  [test[1] for test in test]:
    #CREATE COMPLETE URL
    url_base1 = url_template1 + guiddd + url_surf1
   
 #LAUNCH THE DRIVER AND BROWSER AND THEN PASTE THE URL ADDRESS INTO IT
 #MOST IMPORTANT PART IS "rust_mozprofile8gjASj" this part which after 
 a profile has been created it saved in the temp folder and rather than 
 creating new profile for each loop you refer to one    
profilen=webdriver.FirefoxProfile(r'C:\Users\SW\AppData\Local\Temp\rust_mozprofile8gjASj')
    driver = webdriver.Firefox(executable_path=geckodriver, firefox_profile=profilen)
    print('Selenium webdriver setup complete')
    print(guiddd)
  
    #TRY AND OPEN THE URL IF UNSUCCESSFUL THEN EXIT AND BEGIN WITH NEXT 
     VALUE  
    try:
        driver.get( url_base1 + '/')
        print('page open')
    except ConnectionRefusedError:
        driver.close()
        time.sleep(10)
        driver.get( url_base1 + '/')
    except ConnectionError:
        driver.close()
        time.sleep(10)
        driver.get( url_base1 + '/')
    time.sleep(3)
    driver.set_window_size(480, 320)
    time.sleep(1)
    driver.maximize_window()
    time.sleep(2)
 #BELOW ARE FEW EXAMPLES OF HOW TO EXTRACT CERTAIN ELEMENTS USING XPATH METHOD AND WORKING AROUND VARIOUS SCENARIOS. YOUR SCRIPT WOULD SURELY BE TAILORED TO YOUR SPECIFIC WEB_PAGE AND INFORMATION YOU WANT TO DERIVE  
 
  #LOOK FOR ELEMENT id and with text tabAttachments, WAIT FOR 10 SECONDS BEFORE IT APPEARS
    try:
      WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH,"//*[contains(@id,'tabAttachments')]")))
    #IF ELEMENT id and with text tabAttachments EXIST THEN CLICK THE 
     ELEMENT
      elm1 = driver.find_element_by_xpath("//* [contains(@id,'tabAttachments')]")
      elm1.click()
   # IF EXCEPTION OCCURS THEN RFRESH THE DRIVER SIMILAR TO PAGE REFRESH
    except StaleElementReferenceException:
      driver.refresh()
   
   #REPEAT THE ABOVE PROCESS AGAIN AFTER THE REFRESH
      WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH,"//*[contains(@id,'tabAttachments')]")))
      elm1 = driver.find_element_by_xpath("//*[contains(@id,'tabAttachments')]")
      elm1.click()
    time.sleep(3)
    
   
    try:
      WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH,"//*[contains(@id,'__attribute11-__xmlview2') and contains(@id,'text')]")))
    except TimeoutException:
        elm1 = driver.find_element_by_xpath("//*[contains(@id,'tabAttachments')]")
        elm1.click()
 
#AFTER THE ABOVE STEP COMPLETES LOOK FOR A NEW ELEMENTS WHICH ARE LINKS TO DOCUMENT DOWNLOADS AND THEN COUNT NUMBER OF LINKS?DOCUMENTS TO DOWNLOAD. elem2 is the list of links to document download     
    elem2= driver.find_elements_by_xpath("//*[contains(@id,'__attribute11-__xmlview2') and contains(@id,'text')]")
    count_docs =len(driver.find_elements_by_xpath("//*[contains(@id,'__attribute11-__xmlview2') and contains(@id,'text')]"))
    print("countDocs " + str(count_docs))
    x = datetime.datetime.now()
    print(x)

So far we have created a FIREFOX profile, we created a loop which will go through a list and can go through multiple pages, finally we have shown few selenium functions to navigate through the web page and isolate elements.

If you are scrapping data from the web page itself then you can use the xpath function of selenium as shown above to get relevant information. What if you want to download all documents from a web page?

We have already created a list which captures all the links to download document in the web page. Now we use this list to loop and click through each link and download the documents.

 #START A LOOP using elem2 from the above step. 
 for element in range(len(elem2)):     
         try: 
 #FIND THE FIRST LINK DOWNLOAD    
           elem2 = WebDriverWait(driver, 25).until(EC.visibility_of_all_elements_located((By.XPATH,"//*[contains(@id,'__attribute11-__xmlview2') and contains(@id,'text')]")))
         except TimeoutException:
          pass
 #IF FOUND THEN CLICK ON THE LINK TO DOWNLOAD  
         elem2= driver.find_elements_by_xpath("//*[contains(@id,'__attribute11-__xmlview2') and contains(@id,'text')]")
         try:
           elem2[element].click()
           time.sleep(1)
  #FOLLOWING STEPS DEAL WITH ERRORS AND TIMEOUTS
  #THIS PART OF THE CODE CAN BE EDITED AND MADE FIT TO PURPOSE FOR YOUR 
   OWN USE. MY EXPERIENCE YOU MIGHT HAVE TO INPUT MANY FAIL SAFE 
   MEASURES TO WORK AROUND ERRORS AND ISSUES WHICH MAY ARISE. FOR 
   EXAMPLE WHEN A WEB PAGE IS LOADING THERE MIGHT BE AN OVERLAY AND 
   BEFORE YOUR MOVE ON TO CLICK YOU HAVE TO WAIT FOR THE OVERLAY TO 
   DISAPPEAR
  #THE PROGRAM LOOPS THROUGH AND CLICK EACH LINK AND DOWNLOADS THE 
    DOCUMENTS 
         except IndexError:
                try:
                    elem2 = WebDriverWait(driver, 25).until(EC.visibility_of_all_elements_located((By.XPATH,"//span[contains(@id,'__attribute11-__xmlview2') and contains(@id,'text')]")))
                    elem2[element].click()
                    time.sleep(1)
                except ElementClickInterceptedException:
                    WebDriverWait(driver, 3).until(EC.presence_of_element_located((By.XPATH,"//*[contains(@id,'sapUshellFiori2LoadingOverlay')]")))
                    elem2 = WebDriverWait(driver, 25).until(EC.visibility_of_all_elements_located((By.XPATH,"//span[contains(@id,'__attribute11-__xmlview2') and contains(@id,'text')]")))
                    elem2[element].click()
                    time.sleep(1)
                elem2[element].click()
                time.sleep(1)

Finally once all the files have been downloaded in my user case I rename them and move them to folder. this is done so that they do not interfere and complicate the next tranche of download.

Further in this user case after each loop I kill the gecko firefox driver and then relaunch it for each driver. I have used to function to rename and move the files which are also illustrated below.

#Move and rename files
#After each loop this part of the code rename each file in the download folder and then moves them to the final folder. This is a efficient way of processing downloaded files before the next loop is launched.
    time.sleep(3)
    #Start loop step through each file
    for y, filename in enumerate(os.listdir(dest_dir)):
       infilename = os.path.join(dest_dir,filename)
    #Rename and move files
       if not os.path.isfile(infilename): continue
       base_file,ext = os.path.splitext(filename)
       file_size = -1
     #Waiting for the file to download
       while file_size != os.path.getsize(infilename):
           file_size = os.path.getsize(infilename)
       time.sleep(1)
     #Only process if the files are certain type
       if ext == ".pdf" or ext == ".xlsx" or ext == ".part" or ext == ".xls"or ext == ".csv" or ext == ".msg" or ext == ".docx" or ext == ".doc" :
           while not os.path.exists(infilename):
               time.sleep(1)
       if os.path.isfile(infilename):
           try:
     #GetDownLoadedFileName               os.rename(infilename,join(dest_dir,ID+ID_Date1+"_"+base_file+ext))
           except FileNotFoundError:
               time.sleep(2) 
               os.rename(infilename,join(dest_dir,ID+ID_Date1+"_"+base_file+ext))
           pass             
     
           time.sleep(2)  
    #Use the move_over function to move the files and if duplicate 
     files exist in the destination folder then overwrite with the new 
     file.   
    try:
        move_over(dest_dir, src_dir)
    except FileNotFoundError:
        pass

    driver.quit()
    #After each loop kills the page and gecko driver
    os.system("taskkill /f /im geckodriver /T")

So whats the move_over function?

# Move files to new folder and replace files if the same files(name) 
  already exists
 def move_over(src_dir, dest_dir):
    fileList = os.listdir(src_dir)
    for i in fileList:
        src = os.path.join(src_dir, i)
        dest = os.path.join(dest_dir, i)
        if os.path.exists(dest):
            if os.path.isdir(dest):
                move_over(src, dest)
                continue
            else:
                os.remove(dest)
        shutil.move(src, dest_dir)

Wow this is boring!!

Well this nearly brings me to the end of my long winded post, though hopefully you have learned to use Selenium to loop through few pages and get to know how to scrap stuff.

More details about "FireFox" options and how to isolate html elements in the web page would follows in upcoming blog posts. My aim was for to provide a working example without rummaging wasting your time to get all elements working though would like to add a caveat some assembly might be required to fit your purpose.

COMPLETE CODE BELOW:

### Create Profile for the firefox browser to ignore certain file download others
### MIME types to list in order to recognise those specific  files and prevent a dialog box
### gecko driver used for firefox
### chrome driver has limitation: it creates multiple warning signs which are not browser embedded and cannot be ignored through Selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver import Firefox, FirefoxProfile
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 

profile = FirefoxProfile()
options = webdriver.FirefoxOptions()
dirr =r"R:\Secured\NDIA-ACTUARIES\Scheme_Actuary\11 Data management\99 Projects\02 StateTerritory\SW\SIL\SILAUG2020\SILL"
profile.set_preference('network.negotiate-auth.trusted-uris', 'sap-cip.csda.gov.au:8081')
profile.set_preference("network.negotiate-auth.delegation-uris", 'https://sap-cip.csda.gov.au:8081/**')
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference("browser.download.manager.alertOnEXEOpen", False)
profile.set_preference("browser.download.manager.closeWhenDone", False)
profile.set_preference("browser.download.manager.focusWhenStarting", False)
profile.set_preference("services.sync.prefs.sync.browser.download.useDownloadDir", True)
profile.set_preference("browser.download.dir", dirr)
profile.set_preference("browser.helperApps.neverAsk.saveToDisk","image/svg+xml, image/jpeg, image/jpg, image/png, application/vnd.ms-excel, application/msword, application/x-msg, message/rfc822, application/atom+xml, application/atomsvc+xml, text/new, application/forced-download, application/x-msdownload, application/vnd.ms-outlook, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/outlook, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, application/x-gzip,text/plain, application/octet-stream, application/binary, text/csv, application/csv, application/excel, attachment/excel, attachment/msg, text/comma-separated-values, text/xml, application/xml, application/pdf, application/msg, application/x-unknown, application/octet-stream, application/vnd.ms-excel.sheet.binary.macroEnabled.12, application/xml, application/vnd.ms-excel.sheet.macroEnabled.12, application/vnd.ms-excel.template.macroEnabled.12")
profile.set_preference("browser.helperApps.neverAsk.openToDisk","image/svg+xml, image/jpeg, image/jpg, image/png, application/vnd.ms-excel, application/msword, application/x-msg, message/rfc822, application/atom+xml, application/atomsvc+xml,text/new, application/forced-download, application/x-msdownload, application/vnd.ms-outlook, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/outlook, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, application/x-gzip,text/plain, application/octet-stream, application/binary, text/csv, application/csv, application/excel, attachment/excel, attachment/msg, text/comma-separated-values, text/xml, application/xml, application/pdf, application/msg, application/x-unknown, application/octet-stream, application/vnd.ms-excel.sheet.binary.macroEnabled.12, application/xml, application/vnd.ms-excel.sheet.macroEnabled.12, application/vnd.ms-excel.template.macroEnabled.12")
profile.set_preference("browser.helperApps.alwaysAsk.force", True)
profile.set_preference("pdfjs.disabled", True)
profile.set_preference("dom.disable_open_during_load", True)
profile.set_preference("dom.webnotifications.enabled", False)
options.set_preference("dom.push.enabled", False)
options.set_preference("dom.webnotifications.enabled", False)
#return Firefox(firefox_profile=profile)
geckodriver=geckodriver="C:/Program Files/Drivers/geckodriver.exe"
profile.set_preference("browser.download.folderList", 2)
profile.set_preference("browser.download.manager.showWhenStarting", False)
#options.set_preference("browser.download.dir","/data")
#options.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/octet-stream,application/vnd.ms-excel")
driver = webdriver.Firefox(executable_path=geckodriver, firefox_profile=profile, options=options)

### Function copy and move files to new folder ### ( move files to a new folder and replace files if the same files(name) already exists)
def move_over(src_dir, dest_dir):
    fileList = os.listdir(src_dir)
    for i in fileList:
        src = os.path.join(src_dir, i)
        dest = os.path.join(dest_dir, i)
        if os.path.exists(dest):
            if os.path.isdir(dest):
                move_over(src, dest)
                continue
            else:
                os.remove(dest)
        shutil.move(src, dest_dir)
#FOLDER LOCATION WHERE IF YOU ARE DOWNLOADING DOCUMENTS ARE SAVED, PROCESSED AND THEN MOVED
src_dir = r"c:\folder_where_the_renamed_files_are_saved"
dest_dir = r"c:\folder_where_the_files_are_first_downloaded"

#BREAKDOWN THE URL INTO THREE PARTS
#1 BEGINNING
url_template1 = "https://www.realestate.com.au/sold/in-"
#2 END
url_surf1 =  "%3b/list-1"

#EXAMPLE: url_base = url_template + suburbid+ url_surf

#COMMENCE LOOP TEST IS THE DATA FRAME WHERE FIRST COLUMN HAS THE VALUE WE WANT TO LOOP THROUGH HENCE test[1]

for guiddd in  [test[1] for test in test]:
    #CREATE COMPLETE URL
    url_base1 = url_template1 + guiddd + url_surf1
   
 #LAUNCH THE DRIVER AND BROWSER AND THEN PASTE THE URL ADDRESS INTO IT
 #MOST IMPORTANT PART IS "rust_mozprofile8gjASj" this part which after 
 a profile has been created it saved in the temp folder and rather than 
 creating new profile for each loop you refer to one    
profilen=webdriver.FirefoxProfile(r'C:\Users\SW\AppData\Local\Temp\rust_mozprofile8gjASj')
    driver = webdriver.Firefox(executable_path=geckodriver, firefox_profile=profilen)
    print('Selenium webdriver setup complete')
    print(guiddd)
  
    #TRY AND OPEN THE URL IF UNSUCCESSFUL THEN EXIT AND BEGIN WITH NEXT 
     VALUE  
    try:
        driver.get( url_base1 + '/')
        print('page open')
    except ConnectionRefusedError:
        driver.close()
        time.sleep(10)
        driver.get( url_base1 + '/')
    except ConnectionError:
        driver.close()
        time.sleep(10)
        driver.get( url_base1 + '/')
    time.sleep(3)
    driver.set_window_size(480, 320)
    time.sleep(1)
    driver.maximize_window()
    time.sleep(2)
 #BELOW ARE FEW EXAMPLES OF HOW TO EXTRACT CERTAIN ELEMENTS USING XPATH METHOD AND WORKING AROUND VARIOUS SCENARIOS. YOUR SCRIPT WOULD SURELY BE TAILORED TO YOUR SPECIFIC WEB_PAGE AND INFORMATION YOU WANT TO DERIVE  
 
  #LOOK FOR ELEMENT id and with text tabAttachments, WAIT FOR 10 SECONDS BEFORE IT APPEARS
    try:
      WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH,"//*[contains(@id,'tabAttachments')]")))
    #IF ELEMENT id and with text tabAttachments EXIST THEN CLICK THE 
     ELEMENT
      elm1 = driver.find_element_by_xpath("//* [contains(@id,'tabAttachments')]")
      elm1.click()
   # IF EXCEPTION OCCURS THEN RFRESH THE DRIVER SIMILAR TO PAGE REFRESH
    except StaleElementReferenceException:
      driver.refresh()
   
   #REPEAT THE ABOVE PROCESS AGAIN AFTER THE REFRESH
      WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH,"//*[contains(@id,'tabAttachments')]")))
      elm1 = driver.find_element_by_xpath("//*[contains(@id,'tabAttachments')]")
      elm1.click()
    time.sleep(3)
    
   
    try:
      WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH,"//*[contains(@id,'__attribute11-__xmlview2') and contains(@id,'text')]")))
    except TimeoutException:
        elm1 = driver.find_element_by_xpath("//*[contains(@id,'tabAttachments')]")
        elm1.click()
 
#AFTER THE ABOVE STEP COMPLETES LOOK FOR A NEW ELEMENTS WHICH ARE LINKS TO DOCUMENT DOWNLOADS AND THEN COUNT NUMBER OF LINKS?DOCUMENTS TO DOWNLOAD. elem2 is the list of links to document download     
    elem2= driver.find_elements_by_xpath("//*[contains(@id,'__attribute11-__xmlview2') and contains(@id,'text')]")
    count_docs =len(driver.find_elements_by_xpath("//*[contains(@id,'__attribute11-__xmlview2') and contains(@id,'text')]"))
    print("countDocs " + str(count_docs))
    x = datetime.datetime.now()
    print(x)
    
#START A LOOP using elem2 from the above step. 
 for element in range(len(elem2)):     
         try: 
 #FIND THE FIRST LINK DOWNLOAD    
           elem2 = WebDriverWait(driver, 25).until(EC.visibility_of_all_elements_located((By.XPATH,"//*[contains(@id,'__attribute11-__xmlview2') and contains(@id,'text')]")))
         except TimeoutException:
          pass
 #IF FOUND THEN CLICK ON THE LINK TO DOWNLOAD  
         elem2= driver.find_elements_by_xpath("//*[contains(@id,'__attribute11-__xmlview2') and contains(@id,'text')]")
         try:
           elem2[element].click()
           time.sleep(1)
  #FOLLOWING STEPS DEAL WITH ERRORS AND TIMEOUTS
  #THIS PART OF THE CODE CAN BE EDITED AND MADE FIT TO PURPOSE FOR YOUR 
   OWN USE. MY EXPERIENCE YOU MIGHT HAVE TO INPUT MANY FAIL SAFE 
   MEASURES TO WORK AROUND ERRORS AND ISSUES WHICH MAY ARISE. FOR 
   EXAMPLE WHEN A WEB PAGE IS LOADING THERE MIGHT BE AN OVERLAY AND 
   BEFORE YOUR MOVE ON TO CLICK YOU HAVE TO WAIT FOR THE OVERLAY TO 
   DISAPPEAR
  #THE PROGRAM LOOPS THROUGH AND CLICK EACH LINK AND DOWNLOADS THE 
    DOCUMENTS 
         except IndexError:
                try:
                    elem2 = WebDriverWait(driver, 25).until(EC.visibility_of_all_elements_located((By.XPATH,"//span[contains(@id,'__attribute11-__xmlview2') and contains(@id,'text')]")))
                    elem2[element].click()
                    time.sleep(1)
                except ElementClickInterceptedException:
                    WebDriverWait(driver, 3).until(EC.presence_of_element_located((By.XPATH,"//*[contains(@id,'sapUshellFiori2LoadingOverlay')]")))
                    elem2 = WebDriverWait(driver, 25).until(EC.visibility_of_all_elements_located((By.XPATH,"//span[contains(@id,'__attribute11-__xmlview2') and contains(@id,'text')]")))
                    elem2[element].click()
                    time.sleep(1)
                elem2[element].click()
                time.sleep(1)
#Move and rename files
#After each loop this part of the code rename each file in the download folder and then moves them to the final folder. This is a efficient way of processing downloaded files before the next loop is launched.
    time.sleep(3)
    #Start loop step through each file
    for y, filename in enumerate(os.listdir(dest_dir)):
       infilename = os.path.join(dest_dir,filename)
    #Rename and move files
       if not os.path.isfile(infilename): continue
       base_file,ext = os.path.splitext(filename)
       file_size = -1
     #Waiting for the file to download
       while file_size != os.path.getsize(infilename):
           file_size = os.path.getsize(infilename)
       time.sleep(1)
     #Only process if the files are certain type
       if ext == ".pdf" or ext == ".xlsx" or ext == ".part" or ext == ".xls"or ext == ".csv" or ext == ".msg" or ext == ".docx" or ext == ".doc" :
           while not os.path.exists(infilename):
               time.sleep(1)
       if os.path.isfile(infilename):
           try:
     #GetDownLoadedFileName               os.rename(infilename,join(dest_dir,ID+ID_Date1+"_"+base_file+ext))
           except FileNotFoundError:
               time.sleep(2) 
               os.rename(infilename,join(dest_dir,ID+ID_Date1+"_"+base_file+ext))
           pass             
     
           time.sleep(2)  
    #Use the move_over function to move the files and if duplicate 
     files exist in the destination folder then overwrite with the new 
     file.   
    try:
        move_over(dest_dir, src_dir)
    except FileNotFoundError:
        pass

    driver.quit()
    #After each loop kills the page and gecko driver
    os.system("taskkill /f /im geckodriver /T")

Selenium + Python Working example. Scraping data of a website.

Recent Posts

Comments