Table of Contents
Introduction To Python RegEx
The concept by American mathematician Stephen Cole Kleene in 1951. He describes a regular language using his mathematical notation called regular events.
A Python RegEx expression is a special sequence of characters that defines a pattern for complex string-matching functionality.
There are three regular expressions in python such as regexp, regex, and re. The regular expression also called (RE’s, or regexes, or regex pattern) are highly essentially programming language embedded inside python. Using this function all possible strings match as per our requirements. The regular expression language is relatively small and limited, so not all possible string processing tasks can be done using this function. Now you can learn how to define and manipulate string objects.
One simple technique in the python module is used to match the strings.
- If two string is equal, Using equality(==) operator.
Application
- Used in Search Engines
- Search and Replace dialogs of word processors
- Text editors
re Module:
Python has a built-in package called re, also called Regular Expressions. There are so many functions in the re module to work with Python RegEx.
Import re module
- If two string is equal, Using equality(==) operator.
Character | String | Matched |
x | 1 match | |
[xyz] | xy | 2 match |
Hey man | No match | |
xyz yz yx | 7 match |
You can also specify a range function using (-) inside a square bracket.
For example:
- [p – t] = [ pqrst ].
- [5-10] = [5678910].
You can also complement the character using invert(^), at a start of the square bracket.
For example:
[^xyz] = means any character except x or y or z.
[^0-9] = means non-digit character.
# square bracket
sample = "Fireblaze AI School"
#Find all lower case characters alphabetically between "a" and "m":
sample_square= re.findall("[a-m]", sample)
print(sample_square)
- . – Period
Match ‘any single’ character.
Character | String | Matched |
x | no match | |
.. | xy | 1 match |
xyz | 1 match | |
wxyz | 2 match |
- ^ –Caret
Used for ‘start with’ a character.
Character | String | Matched |
x | 1 match | |
^x | xy | 1 match |
zyx | No match | |
^xy | xyz | 1 match |
zyx | No match |
sample = "Fireblaze AI School"
#Check if the string starts with 'hello':
x = re.findall("^Fireblaze", sample)
if x:
print("Yes, the string starts with 'Fireblaze'")
else:
print("No match")
- $-Dollar
Used for ‘end with’ a character.
Character | String | Matched |
x | 1 match | |
x$ | Manx | 1 match |
Hey man | No match |
import re
sample = "Fireblaze AI School"
#Check if the string ends with 'world':
x = re.findall("School$", sample)
if x:
print("Yes, the string ends with 'School'")
else:
print("No match")
- *-Star
Star symbol matches zero or more occurrences of the pattern.
Character | String | Matched |
gi | 1 match | |
gir*l | girl | 1 match |
perl | No match |
- + -Plus
plus symbol matches one or more occurrences of the pattern.
Character | String | Matched |
xa | No match(no m character) | |
xman | Man | 1 match |
xmaaan | 1 match |
import re
txt = "Fireblaze AI School"
#Check if the string contains "ai" followed by 1 or more "x" characters:
x = re.findall("Schoo", txt)
print(x)
if x:
print("Yes, there is at least one match!")
else:
print("No match")
['Schoo']
Yes, there is at least one match!
- ? – Question Mark
The question symbol matches zero or one occurrence of the pattern.
Character | String | Matched |
xa | No match(no m character) | |
xma?n | Man | 1 match |
xmaaan | No match(more than one a) | |
xmn | No match (m not followed by a) |
- {} – Braces:
Consider the {n,m}. This means at least n, and at most m repetitions of the pattern
Character | String | Matched |
pqr xyz | No match | |
x{2,3} | pqr xyyz | 1 match (at xyyz) |
ppqr xyyyz | 2 matches(at pp and yyy) | |
ppqr xyyyyz | 3 matches(at pp and yyyy) |
braces
import re
sample = "Fireblaze AI School"
#Check if the string contains "a" followed by exactly two "l" characters:
x = re.findall("aze{2}", sample)
print(x)
if x:
print("Yes, there is at least one match!")
else:
print("No match")
[]
No match
- | – Alteration
The special character standing or vertical bar is used for alteration. The standing bar also works as ‘or’ operation.
Expression | String | Matched |
pqr | No match | |
x|y | xaz | 1 match |
wxypyz | 2 matches |
- () – Group
Parentheses symbol is used to group.
For example, (x|y|z)ab match by any string-like, x, y, z, a, b.
Expression | String | Matched |
xy ab | No match | |
(x|y|z)ab | xyab | 1 match (match at yab) |
xay cabxy | 2 matches |
- \- Backlash
Used for escape various characters including all metacharacters.
\$x match if a string contains $ followed by x.
If you not sure about any character, you can simply put \ in front of it.
backlash
mport re
sample = "That will be 123 rupees"
#Find all digit characters:
x = re.findall("\d", sample)
print(x)
import re
sample = "That will be 123 rupees"
#Find all digit characters:
x = re.findall("\d", sample)
print(x)
['1', '2', '3']
Special Sequences
The special character used for easy to write a pattern.
Here following ist of special character,
\A, \B, \b, \D, \d, \S, \s, \W, \w, \Z.
- \A – matches if the character is at the start of a string.
Expression | String | Matched |
man has | match | |
\Aman | in man | No match |
- \B – matches if the specific characters are not at the beginning or end of the end.
Expression | String | Matched |
football | No match | |
\Bfoo | A football | No match |
afootball | match |
- \b – opposite of \B, matches if the specific character are at the beginning or end of the word.
Expression | String | Matched |
football | match | |
\Bfoo | A football | match |
afootball | No match |
- \D –
Matches any non-decimal digit. Same as [^0-9]
Expression | String | Matched |
2xy56”90 | 3 matches (except digit) | |
\D | 9876 | No match |
- \d –
Opposite of \D, means decimal digit.
Expression | String | Matched |
54xyz3 | 3 match (digit) | |
\d | Data science | No match |
- \S –
Matches where a string contains any non-whitespace.
It is similar to [^ \t\n\r\f\v]
Expression | String | Matched |
x y | 2 match | |
\S | No match |
- \s –
Matches where a string contains any whitespace.
It is similar to [^ \t\n\r\f\v]
Expression | String | Matched |
Machine Learning | 1 match | |
\s | MachineLearning | No match |
- \W –
Matches where a non-alphanumeric character.
It is similar to [^a-zA-Z0-9_]
Expression | String | Matched |
1a2%c | 1 match | |
\W | Machine Learning | No match |
- \w –
Matches where any alphanumeric character.(i.e. Digits and alphabets)
It is similar to [^a-zA-Z0-9_]
underscore _ is also considered an alphanumeric character.
Expression | String | Matched |
12$”: ;a | 3 Matches | |
\w | %”>! | No match |
- \Z –
Matches if the specified characters are at the end of string.
Expression | String | Matched |
I like ML | 1 Match | |
ML\Z | I like ML program | No match |
ML is good | No match |
Match Object
You can get the methods and attributes of a match object using dir() function.
Here, explain some commonly used methods are:
- match.group()
the group method returns the part of the string where there is a match.
match object
import re
string = '39801 356, 2102 1111'
# Three digit number followed by space followed by two digit number
pattern = '(\d{3}) (\d{2})'
# match variable contains a Match object.
match = re.search(pattern, string)
if match:
print(match.group())
else:
print("pattern not found")
# Output: 801 35
801 35
- match.start(), match.end(), and match.span()
the start function returns the index of the start.
the end function returns the end index.
the span function returns the tuple containing start and end index.
match start and end
match.start()
2
match.end()
8
match span
match.span()
(2, 8)
- match.re and match.string
the re attribute of a matched object returns a regular expression.
the string attribute returns the passed string.
match re
match.re
re.compile(r'(\d{3}) (\d{2})')
match string
match.string
'39801 356, 2102 1111'
r prefix:
R or r prefix is used before a regular expression.
For example, r’\n’ means two characters. Read how you can Send multiple Emails using Python script
r prefix
import re
string = '\n and \r are escape sequences.'
result = re.findall(r'[\n\r]', string)
print(result)
['\n', '\r']