Some nifty functions in SAS and how to replicate them in Python
What is Regex Or Regexp?
Abbreviation for regular expression it is used for pattern match in strings, similar to find and replace function you use in excel or word spreadsheet.
For more detailed Regex explanation and basic syntax and complex examples refer to the post on Regex(coming out soon). In this post I am comparing *Python regular expression and functions and some SAS specific functions.
Why not then use find and replace and use Regex which looks like a alien language?
Because find and replace functions usually are for specific words, regex allows us to match all patterns such as phone number or ABN (Australian Business Number).
For example regular expression below shows a regular expression for ABN:
This pattern is for SAS only the flags syntax "io" is different for python "re" package
"/(A?)(.?)(B?)(.?)(N?)(.?)(NO?)(.?)(:?)( ?)(\d *?){11}/io"
The above regular expression is trying to match a Australian Business Number which are made up of unique 11-digit identifiers making it easier for businesses and government to interact and identify. Usually in the following format "ABN 11 223 491 505".
The above regex can explained as follows
(A?) - The pattern may start with "A" is matched zero or one times which means A may or may not be there.
(.?) - The second character may be a full stop / dot "." matched zero or one times.
(.B) - The third character may be a "B" matched zero or one times.
(.?) - The fourth character may be a full stop / dot "." matched zero or one times.
(.N) - The fifth character may be a "B" matched zero or one times.
(NO) - The sixth character pattern may be a "NO" matched zero or one times.
(.?) - The seventh character may be a full stop / dot "." matched zero or one times.
(:?) - The eighth character may be a full hyphen ":" matched zero or one times.
(\d *?){11} - A digit with space or no space repeated 11 times.
/io - i makes the matching case insensitive. o makes the code compile once and not each time when the function is called. this options saves time and makes the code efficiency.
As you can see it is a good practice to comment any regular expressions as it is quite tedious to comprehend and remember the pattern match of expression even the ones which you wrote is quite difficult.
SAS Regular Expression Functions:
PRXMATCH
PRXSUBSTR (call routine)
PRXPOSN (call routine)
PRXNEXT (call routine)
PRXPAREN
PRXMATCH & PRXNEXT in SAS
Both functions are used to locate the position of a pattern in string. I will emphasize "position" as these functions do not capture and return the pattern but rather return the "first position" where the pattern is found.
For example:
String : "The cat is black"
PRXMATCH("/cat/",STRING) returns 5 as the first letter of "cat" begins at 5th position in the string including space.
The main difference between PRXMATCH & PRXNEXT is that PRXMATCH will stop at the first instance where it finds the match. If there are no match for the pattern in the string PRXMATCH will return 0. While PRXNEXT will find all occurrence of the pattern.
If your issues is that the pattern repeats itself for example in a large paragraph or document how do you extract all instances of occurrences?
For instance the sentence below has three occurrences of "cat":
"Where is the cat. It's my cat and it's a black cat."
PRXNEXT function provides the answer to these issue when you nee do extract all three occurrences of pattern. A real world issue would be where you need to extract mobile numbers out of a text string, each number would be different but each one will follows a 10 digit pattern starting with "04"(Australian mobile number).
A thing which differentiate PRXNEXT is that it's a call function and has few more options than PRXMATCH. I have explained this below in the form of an example. A working example is much easier to understand than just seeing the syntax and options of a function.
In the code below I am attempting to extract all occurrences of Australian Business Number in a string. PRXNEXT give the start position and length of each occurrence. The script uses SUBSTR to concatenate all occurrences into pipe delimited manner.
data _null_ ;
set abntoextract end= eof;
if eof then call symputx('numbs', _N_);
run;
%put &numbs;
data abnextract(DROP=ABN ABN2 abnpattern abnpattern2 z position length start stop line);
do z=1 to &numbs; /*Start the loop before set this for looping through each row of the dataset abntoextract*/
set abntoextract;
length match $100;
call missing(match);
start =1;
/* above step specifies the position where to start looking*/
stop= length(abntext);
/*above step specifies where to end in this case the length of string in column abntext*/
abnpattern =prxparse("/(A?)(.?)(B?)(.?)(N?)(.?)(NO?)(.?)(:?)( ?)(\d *?){11}/io");
/* above regex pattern for ABN*/
abnpattern2 =prxparse("/(\d *?){11}/");
/* Above another regex pattern for ABN for illustrative purpose*/
call prxnext(ree, Start, stop, Text, position, length);
do while (position > 0);
/*second loop within first to loop through each word in the string in column abntext of each row*/
line=_N_;
match =catx(' | ',match, substr(Text, position, length));
/* above lines concatenates each instance in the string pattern is matched into pipe delimited format;e.g. ABN1234|ABN2345 */
if start > stop then position = 0;
else
call prxnext(ree, Start, stop, Text, position,length);
retain Text start stop position length match;
end;
/* Above if statement retains position and length of each match and passes it to substr above to extract the pattern*/
output;
end;
run;
The above code is illustration of one of many ways PRXNEXT can be used, since I found it hard to locate a working example hope you would benefit from working example.
So how do we go on to perform the same task in Python?
There are few regex functions in python which can do the things that PRXMATCH AND PRXNEXT in SAS do:
Regex pattern in python differs from SAS. If the above pattern from SAS code is deployed in python then it will divide the pattern into multiple groups.Where each closed round brackets becomes a group which can be extracted individually. Illustrated below is a example of python regex match function. The match function(re.match) looks for the beginning of line and matches the first instance. If the beginning of the string doesn't match then "match" function stops as shown below. The below example has list with three string elements. The "re.match" function is able match only the first and second element in the list as the second element first word "ABN123456" is not a match.
Also worth noting is how to extract where the pattern match. If you want to extract the matched pattern then use group() if you want to display which groups in the pattern were matched use groups(). The second functionality has limited usage as shown below.
Therefore the first element in the list "ABN12345678910 abn match" the output is "ABN12345678910".
For the second element in the list "ABN123456 not match ABN12345678910" there is no output because the first word is not matched to pattern.This shows the limitation of re.match functionality which can be overcome by using re.search functionality
The third element in the list is "ABN123456789101 abn more match" and the output is
"ABN123456789101". Note that the output includes ABN plus 12 digits, this regex extracts the whole sub-string before space.
import re
list =["ABN12345678910 abn match","ABN123456 not match ABN12345678910","ABN123456789101 abn more match"]
for element in list:
z=re.match("(A?)(.?)(B?)(.?)(N?)(.?)(NO?)(.?)(:?)( ?)(\d *?){11}",element)
if z:
print((z.group()))
print((z.groups()))
#PRINTED OUTPUT BELOW (NOT PART OF CODE)
ABN12345678910
('A', 'B', '', '', '', '', 'N', '', '', '', '0')
ABN123456789101
('A', 'B', '', '', '', '', 'N', '1', '', '', '1')
Is there an easier and more simplified version of the above regex and how does the output differs?
The answer to above is yes and is shown in the code snippet below. while the output for first element remains the same and there is no matched output for the second,notice how for the third element of the string the output changes. The pattern below has only one group, hence groups() function outputs the one and only group matched --> 0
import re
list =["ABN12345678910 abn match","ABN123456 not match ABN12345678910","ABN123456789101 abn more match"]
for element in list:
z=re.match("(A?\.?B?\.?[NO N]?\.?\d *?){11}",element)
if z:
print((z.group()))
print((z.groups()))
#PRINTED OUTPUT BELOW (NOT PART OF CODE)
ABN12345678910
('0',)
ABN12345678910
('0',)
So how do we overcome the limitation where the first word or sub-string doesn't matches the regex pattern?
The answer to above query is re.search function which checks all lines of the input string and not only the first and returns the first instance where the pattern is a match.
The below working example shows how multiple pattern can be passed through and different outputs which can be derived from the function.
import re
patterns = ['(A?)(.?)(B?)(.?)(N?)(.?)(NO?)(.?)(:?)( ?)(\d *?){11}','match']
text = ' where is the number ABN12345678910 abn match'
for pattern in patterns:
print ('Looking for "%s" in ""%s" ->' %(pattern,text), end=' ')
if re.search(pattern,text):
z=re.search(pattern,text)
print('found a match')
print(re.search(pattern,text))
print((z.group()))
else:
print('no match')
#PRINTED OUTPUT BELOW (NOT PART OF CODE)
#1
Looking for "(A?)(.?)(B?)(.?)(N?)(.?)(NO?)(.?)(:?)( ?)(\d *?){11}" in ""ABN12345678910 abn match" -> found a match
#2
<re.Match object; span=(19, 35), match='ABN12345678910'>
#3
ABN12345678910
#4
Looking for "match" in ""ABN12345678910 abn match" -> found a match
#5
<re.Match object; span=(20, 25), match='match'>
#6
match
Above code uses the same pattern "(A?)(.?)(B?)(.?)(N?)(.?)(NO?)(.?)(:?)( ?)(\d *?){11}" though rather than passing list of string we passed a single string and two patterns to match . Unlike in re.match example the ABN number is not at the start or first line of the string and hence re.match would have returned a null value whilst re.search has successfully matched the pattern. Above code is also handy where more than one pattern need to be matched.
The outputs are explained as per below:
#1 Illustrates the first pattern being matched and whether there was a successful match
#2 Shows the two outputs the part of the sting matched and the position of the string -19 to 35.
#3 Using group() get the output for the first pattern.
#4 Illustrates the second pattern being matched and whether there was a successful match.
#5 Outputs matched part of the sting and the position of the matched string is from 20 to 25.
#6 Using group() get the output for the second pattern.
Another example of using re.search is shown below:
result = re.search('\w+((A?)(.?)(B?)(.?)(N?)(.?)(NO?)(.?)(:?)( ?)(\d *?){11})\w+',' where is the number ABN12345678910 abn match')
if result:
abn = (result.group(1))
print(abn)
else:
print('no match')
How we replicate functionality of PRXNEXT when there is more than one instance of the pattern in the string and we need to extract all such pattern?
The final python function which we discuss in this post is re.findall which allows us to extract multiple occurrence of pattern within string e.g. what if you want to extract all the Australian Business Number in a documents and not just the first one?
Python re.findall syntax is much simpler than SAS PRXNEXT and is similar to re.search and re.match. Though re.next provides tuple within a list of all groups found. What I mean by this? This is illustrated in the two example below.
abntext = 'unilevers ABN123456789110, company ABNO3456789012, company xyz ABN5678901234512'
patterns = ['(A?)(.?)(B?)(.?)(N?)(.?)(NO?)(.?)(:?)( ?)(\d *?){11}','match']
abns = re.findall(patterns, abntext)
print(abns)
for abn in abns:
print(abn)
#PRINTED OUTPUT BELOW (NOT PART OF CODE)
#LIST OF TUPLES
[('', ' ', '', 'A', '', 'B', 'N', '1', '', '', '0'), ('', ' ', '', 'A', '', 'B', 'NO', '', '', '', '2'), ('', ' ', '', 'A', '', 'B', 'N', '5', '', '', '1')]
#EACH TUPLE PRINTED BY THE FOR LOOP
('', ' ', '', 'A', '', 'B', 'N', '1', '', '', '0')
('', ' ', '', 'A', '', 'B', 'NO', '', '', '', '2')
('', ' ', '', 'A', '', 'B', 'N', '5', '', '', '1')
The challenge above is that the pattern is broken into multiple groups and if you want to extract all the patterns which were matched it won't be as easy as in re.match or re.search where group() gave the output. If you use those option you will get an error.
IMPORTANT: findall does not preserve named groups or capture groups, that means it will give you group number/name but not the value.
In order to use re.findall you might have to use a pattern without groups as per the example below:
import re
abntext = "unilevers ABN123456789110, company ABNO34567890122, company xyz ABN5678901234512"
patterns = r"A?\.?B?\.?NO?\.?N?\.?\d{11}"
abns = re.findall(patterns, abntext)
#abns2 = re.finditer(patterns,abntext)
print(abns)
#print(abns2)
for abn in abns:
print(abn)
#PRINTED OUTPUT BELOW (NOT PART OF CODE)
#ALL ABN PRINTED AS A LIST
['ABN12345678911', 'ABNO34567890122', 'ABN56789012345']
#EACH ABN PRINTED BY THE FOR LOOP BY ITERATING THROUGH THE LIST
ABN12345678911
ABNO34567890122
ABN56789012345
The above seems suitable for a simple regex pattern but for complex regex pattern where multiple groups need to be extracted this may not be a suitable approach.
Well then here is where the bonus function re.finditer comes which preserves named/indexed captured groups. Its return a iterate object and can be used to output all match within a string one by one.
Thinking of a practical solution where you have a data frame with document contained in a column and you need the output by each row. In this case you can have a "For loop" append all matched pattern for a single row into a list and then output to a column. (That's for another post)
In order to keep up to my philosophy of "It's all talk without an example" see illustrations of use below:
list =[]
data = """unilevers ABN123456789110, company ABNO34567890122, company xyz ABN5678901234512"""
for m in re.finditer(r'(?P<percentage>\sA?\.?B?\.?NO?\.?N?\.?\d{11})', data):
item=m.group('percentage')
list.append(item)
print( m.group('percentage'))
print(list)
#PRINTED OUTPUT
#OUTPUT FROM THE LOOP
ABN12345678911
ABNO34567890122
ABN56789012345
#OR YOU CAN CREATE A LIST
[' ABN12345678911', ' ABNO34567890122', ' ABN56789012345']
The above snippet of code shows two ways to get relevant outputs, my preference being second option as it is more applicable method. Since most us will deal with a data frame with multiple rows.
SUMMARY:
Whether SAS or python I have shown above how to apply regular expressions in either language. I have used the above code whilst doing text analytics, create a pseudo tf-dif (term frequency-inverse document frequency) table, used it to scrap relevant texts from PDF and WORD documents, scrapping of data online in conjunction with Selenium; more on that in a different post. I hope this blog was elaborate and illustrative and helped you in your objectives. Please leave comment and share this post if you liked it.
Comments