Python: Extract text from Word document



Following up on my previous post where I showcased how to convert PDF document into a text file and then extract the relevant information. I have applied the same approach to a word document. The big difference being rather than writing a whole function to convert the word document to text I have used docx package which directly reads-in the word document.





Step 1.

Import the necessary packages :

import json
from docx import *
import re
import os
import pandas as pd
import docx2txt
import subprocess
subprocess.call('dir', shell=True)
from docx import document

Step 2.

Create a list of all docx files in the folder to loop through. Read the comments explaining the code in "purple" below:

# FILE PATH AND creating LIST of all documents
filepath = r"C:\PATH_TO_FOLDER_WHERE_THE_FILES_ARE"
files = os.listdir(filepath)

# EMPTY LIST TO FILL IN ALL WORD DOCUMENTS
document_list = [] 
# NEW DATFRAME to APPEND INFORMATION
df_new = pd.DataFrame()

#COLUMN NAMES IN THE NEW DATA FRAME FOR INFORMATION TO STORE
cnames = ['p_id', 'p_name']

# FUNCTION TO LIST ALL WORD DOCUMENT IN THE FOLDER
for path, subdirs, files in os.walk(filepath): 
    for name in files:
# For each file we find,we need to ensure it is a .docx file before adding it to our list
        if os.path.splitext(os.path.join(path, name))[1] == ".docx":
            if not name.startswith('~$'):
                    document_list.append(os.path.join(path, name))

STEP 3.

Loop through the document list (document_list) , extract relevant information and then append it to the empty data frame. In addition to the looping part I am also checking the format of the files and the naming convention to get information from each individual filename. An example would be that of a scrapped a files from the web using Selenium and while downloading it named/renamed it with certain attributes such a "license number" or "file owner name" and i want this information extracted.

e.g. file name :"123456_sam_wad_licencedetails.docx"


          
#Loop starts
for document_path in document_list:
    data = []
    try:
        #READ EACH FILE USE docxtxt and convert to text
        document = docx2txt.process(document_path)
    except:
        continue
    #EMPTY DATA FRAME ACT AS STAGING TABLE TO APPEND TO MAIN TABLE
    dfObj = pd.DataFrame()
    #GET INFORMATION FROM THE FILE NAME
    pid = os.path.basename(document_path)
    listn = pid.split("_")
    if len(listn) == 1:
        pidn = None
    elif len(listn) > 3:
        pidn= listn[0]
    else:
        pidn = None
        pass
    #GET INFORMATION OUT USING REGEX LIKE HERE LICENSE NUMBER
    result = re.search('owner license number: (.*)\n\n', document)    
    resulta = re.search('license for(.*)', document) 
    if result:
        p_id = (result.group(1))
        #dfObj.append['p_id'] = p_id
    elif resulta:
        p_id_a = resulta.group(0)
        p_id = re.findall(r'\d+',p_id_a)
    else:
        p_id = pidn
        
    #GET THE NAME FOR LICENSE RECIPIENT    
    result1 = re.search('owner name:(.*)\n\n', document)
    if result1:
        p_name = (result1.group(1))        
    elif resulta:
        resultb = re.search('document for(.*)\s\(?\d+?', document)
        if resultb:
            p_name = resultb.group(1)
        else:
            p_name = resulta.group(1)
    else:
        p_name = None 

STEP 4.

Bring it all together, take the extracted information and append it to the data frame.

    #add data to the list from each loop
    data.append([p_id])
    data.append([p_name])
    
    #convert this list to a data frame
    df = pd.DataFrame(data)
    
    #transpose the the staged data frame
    df_tr= df.transpose()
    #Rename the columns of the data frame
    df_tr.rename(columns = {0:'p_id', 1:'p_name'}, inplace = True) 
    
    # Append the transposed data frame to the master data frame
    df_new = df_new.append(df_tr) 

The frame work above allows you to loop through multiple word documents and extract relevant information and convert them into a database. It is easier than pdf document as there are packages which readily allow you to convert word documents to text.

Using regular expressions you can extract the relevant information. I have gone into details about regular expression and their usage in another upcoming post.

Hope this helps!

The complete code is below (mind the indentations and modify it to your needs):


import json
from docx import *
import re
import os
import pandas as pd
import docx2txt
import subprocess
subprocess.call('dir', shell=True)
from docx import document



# FILE PATH AND creating LIST of all documents
filepath = r"C:\PATH_TO_FOLDER_WHERE_THE_FILES_ARE"
files = os.listdir(filepath)

# EMPTY LIST TO FILL IN ALL WORD DOCUMENTS
document_list = [] 
# NEW DATFRAME to APPEND INFORMATION
df_new = pd.DataFrame()

#COLUMN NAMES IN THE NEW DATA FRAME FOR INFORMATION TO STORE
cnames = ['p_id', 'p_name']

# FUNCTION TO LIST ALL WORD DOCUMENT IN THE FOLDER
for path, subdirs, files in os.walk(filepath): 
    for name in files:
# For each file we find,we need to ensure it is a .docx file before adding it to our list
        if os.path.splitext(os.path.join(path, name))[1] == ".docx":
            if not name.startswith('~$'):
                    document_list.append(os.path.join(path, name))
#Loop starts
for document_path in document_list:
    data = []
    try:
        #READ EACH FILE USE docxtxt and convert to text
        document = docx2txt.process(document_path)
    except:
        continue
    #EMPTY DATA FRAME ACT AS STAGING TABLE TO APPEND TO MAIN TABLE
    dfObj = pd.DataFrame()
    #GET INFORMATION FROM THE FILE NAME
    pid = os.path.basename(document_path)
    listn = pid.split("_")
    if len(listn) == 1:
        pidn = None
    elif len(listn) > 3:
        pidn= listn[0]
    else:
        pidn = None
        pass
    #GET INFORMATION OUT USING REGEX LIKE HERE LICENSE NUMBER
    result = re.search('owner license number: (.*)\n\n', document)    
    resulta = re.search('license for(.*)', document) 
    if result:
        p_id = (result.group(1))
        #dfObj.append['p_id'] = p_id
    elif resulta:
        p_id_a = resulta.group(0)
        p_id = re.findall(r'\d+',p_id_a)
    else:
        p_id = pidn
        
    #GET THE NAME FOR LICENSE RECIPIENT    
    result1 = re.search('owner name:(.*)\n\n', document)
    if result1:
        p_name = (result1.group(1))        
    elif resulta:
        resultb = re.search('document for(.*)\s\(?\d+?', document)
        if resultb:
            p_name = resultb.group(1)
        else:
            p_name = resulta.group(1)
    else:
        p_name = None 
        
    #add data to the list from each loop
    data.append([p_id])
    data.append([p_name])
    
    #convert this list to a data frame
    df = pd.DataFrame(data)
    
    #transpose the the staged data frame
    df_tr= df.transpose()
    #Rename the columns of the data frame
    df_tr.rename(columns = {0:'p_id', 1:'p_name'}, inplace = True) 
    
    # Append the transposed data frame to the master data frame
    df_new = df_new.append(df_tr)