|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
![]() |
![]() |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| MultiSub © 2005 A FREE utility for batch Find/Replace |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| MultiSub | |
Use MultiSub for batch Find & Replace * Supports regular expressions * Find text from file contents * Replace text from file contents * Batch mode * Ideal for WebMasters * Use MultiSub for batch Find & Replace * Supports regular expressions * Find text from file contents * Replace text from file contents * Batch mode * Ideal for WebMasters * Use MultiSub for batch Find & Replace * Supports regular expressions * Find text from file contents * Replace text from file contents * Batch mode * Ideal for WebMasters * |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Regular Expressions are extremely useful but can be
difficult to comprehend at first. In this overview, I will not cover
every single possibility but perhaps summarize what you need to know.
If you want to go deeper, then there are many excellent web sites and
books that cover the subject in greater depth. Although there are many sources of information for regular expressions (known as RegEx and pronounced rej-x), many of these go from a basic RegEx to something a little too complicated in the space of a few lines. So given that I learned RegEx the hard way, I feel well qualified to document an introduction. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Special Characters |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| For the purposes of
this overview, let us assume that we are searching a file for something
we want to replace. First of all, the characters you have on your keyboard all represent their normal function with the exception of the characters below; . # ^ $ \ ? + * | [ ] ( )
These all have a special meaning. So if you try to search for a text string like "Question???" with RegEx enabled, the "?" character is going to assume its special function. To use any of these characters as the actual characters they represent, they need to be 'escaped' using a "\" character. So the previous search would need to be "Question\?\?\?". Again, the \ character indicates that the "?" is to really be the question mark character and must not assume its special function. So the first thing to remember is that these special characters take on their special function by "default" and need to be escaped to assume their normal representation. If we could be taken back in time to when RegEx were first proposed, it would have been better if these characters "normal" function was the default. But there we are... Taking this a step further, if you wanted to search for; "Hello, what is an * character? (it's not a question mark)." This would need to become; "Hello, what is an \* character\? \(it's not a question mark\)\." Which would be interpreted as the previous text since all special characters are escaped. Final point on this, if we use a \ character and you want it to have its normal character meaning, it would be escaped exactly as before and become \\. The good news, is that you've now dealt with one of the trickiest things to understand! The most common mistakes with using RegEx is forgetting to use the \ escape character. So that's part one over with! |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Other special characters strings |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| In
addition to the special characters, there are a few other special
strings. \d \D
\t \n \r \s \S \w \W
These are discussed later. But one is particularly important, "\n". When you type some text into an editor and press return, you get a new line. Unknown to most users is that the return button does actually insert an invisible character called a 'line break'. Most software uses the line break represented in RegEx by "\n", but for reasons known only to Microsoft, a lot of Windows tools add a 'line feed' character too represented by "\r", as well as a "\n" onto the end of a line. So most of the time "\n" is the only one you need to remember, think of it as 'newline'. If you get some weird behaviour with line break characters, then bear in mind you may have a "\r" character present too. If you wanted to search for all the line breaks in a document, search for \n and all the end of line characters will show up. You can even remove all the line breaks by substituting out \n. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Back
to the special characters, what do they mean? |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| So what do the special characters mean? Here's an explanation with examples, note that when writing a RegEx you do not need to put it into quotes. I have done that below only to help show the start and end of RegEx expressions. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| . |
"." - the dot
character means any character except a line break character. So if you wanted to search for all three letter words beginning with b and ending in t, the RegEx would be "b.t", for similar four letter words it would be "b..t" and so on. In these cases the "." means that any character can be in this position. The only exception is a line break character. So if our text was bo at and we searched for "b..t", the above would not match, since there is a line break present. So to search for the specific example above, we would need a RegEx of "b.\n.t". There is a RegEx option in MultiSub (". matches \n") which takes away the exception so that a "." character truly represents any character including line breaks. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| | |
"|" - the pipe
character means "or". If you want to search for fred or joe, the RegEx for that would be "fred|joe". To search for fred, joe or mike, it would be "fred|joe|mike". You can also combine special characters, so with the previous "." examples, you could search for three and four letter words starting with b and ending in t at the same time using a RegEx of "b.t|b..t". |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ^ |
"^" - the caret
character means the start of the text. So with some
text like "hello, hello these are
some words", the RegEx "^hello"
would
pick out the first word "hello" but not the second occurrence of
"hello",
since the first one occurs at the start of the text. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| $ |
"$" - the dollar
character means the end of the text. So with some text
like "hello, hello these are some
words and more words", the RegEx "words$"
would pick
out the last occurrence of "words", since this occurs at the end of the
text. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ^
$ again |
The ^ and $ characters lead a double life. As
mentioned previously they
specify the start/end of text, even if that text covers several lines.
There is an option however to make ^ and $ match the start/end of a line, this
is the option "^$ match embedded \n".
Let's look again... This is some text that straddles more than one line. With the option "^$ match embedded \n" deselected. The RegEx "^.h.." would match only the word "This" because it occurs at the start of the entire text, but not the words "that" or "than". With the option "^$ match embedded \n" selected, the same RegEx of "^.h.." will match "This", "that" and "than", because they all occur at the start of a line. The same is true with $, with "^$ match embedded \n" deselected, $ means the end of the entire text, with "^$ match embedded \n" selected, $ means the end of a line. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| So by now, I'm sure you're beginning to see that although
cryptic,
RegEx expressions are actually very powerful. Next... |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ? |
"?" - means 0 or 1
of the previous
character. So "ca?t" would
match "ct", or "cat", but it would not match "coat". Beware greedy and lazy discussed below. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| + |
"+" - is a little
like the "?", but
this means 1 or more of the previous
character. So "ca+t" would
match "cat", "caat" and "caaat" etc., but it
would not match "ct". Beware greedy and lazy discussed soon. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| * |
"*" - means 0 or
more of the previous
character. For example, if we
wanted to search for words beginning with "bo" and ending in "t", we
would
use a RegEx of "bo*t". This
would match "bot", "boot", "booot"
etc. Beware greedy and lazy discussed very soon. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| .* |
In terms of
mixing the special characters, a very
common combination is ".*", this means 0 or more of any character. For
instance, to
search for all words beginning with "b" and ending in "t", we could use
"b.*t". Beware greedy and lazy discussed very, very soon. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Lazy/Greedy
Wildcards
|
Here we are at last. The
wildcards
?, + and *
are incredibly useful. But, there is a subtlety that trips everyone up
and I remember having problems with this as I was learning RegEx.
The problem is just when you think you have command of the RegEx
format, you search for a block of text and a huge great chunk lights
up! This is caused by the feature of greedy and lazy, no, not two of
snow whites'
dwarves, read on and get your brain ready...
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We have
mentioned that ".*" matches any
character except a line break (unless we switch the ". Matches \n" option on). So if I search the word "mississipi" with a regex of "m.*s"
what would I
get?, well, most people would expect "mis" to be the reported match.
But no!,
the match we get is "mississ". Before you start holding your head in
your
hands, read on, this believe it or not is a feature.
This unexpected match is because of laziness and greediness. The way RegEx wildcards such as the *, + and ? characters work is to be greedy. A RegEx of "m.*s" actually means; keep looking for an m followed by an s with the maximum amount of text in between. So we actually get "mississ" as the match. This is the greedy result, the wildcard has eaten as much text as it can to give a valid match. If we wanted "mis" returned we need to make the wildcard lazy, we can do this as follows, we change the RegEx to "m.*?s", this means keep looking for an m followed by an s, with the minimum text in between. You should remember that a question mark means zero or one of the previous character, in this case it means zero or one matches i.e. by one match, it means the first available match that yields a valid result. Probably the most questions that users report, concern misunderstanding the greedy/lazy feature. There are times when you do want as much text as possible returned, others when you want the minimum, so it a useful feature, but like several irritating things in RegEx behaviour, the default is greedy (whereas most users expect the default to be lazy). So when a huge great chunk of text lights up and your RegEx appears to skip straight over the text you intended, do not call the software a pile of horse manure, remember the greedy/lazy feature! |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| So, we're
getting to the point where you can comprehend a RegEx. By
combining these special characters we can do some very clever things. A few more, then we're
done... |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| [] |
"[]"
- putting characters into square brackets means 'any one of', so [abc]
will match a or b or c. So why not do this as a|b|c?, well the brackets
have an additional use, you can specify [A-Z] for instance, or [a-z] or
[0-9], or pulling them all together you can have [A-Za-z0-9] which will
match any character in the alphabet and any numerical digit. This can
also be made specific by putting [abcdefxyz] which matches any of the
letters enclosed. For example, if we have some text such as; "Hello, how are you today?, I'll see you at 9:30" We could find the time part using "[0-9][:][0-9][0-9]". This means any digit, followed by a colon, followed by any other digit and any other digit. A better way might be "[0-9]+[:][0-9][0-9]" which would pick up on dates that have one or two digits before the colon. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| () |
"()" - The final
special characters are round brackets. These surround characters that
form a group. So "(ell)" would
search for any occurrence of "ell"
in that order and would match the "ell" in "Hello".
Round brackets are used with another special character, so (ca)*t would
match "t", "cat", "cacat", "cacacat" etc. but not "ct". |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ^
again! |
I mentioned that the caret "^"
had a double life, well that was not quite true, it has a triple life!
When a caret is used inside square brackets, it means 'anything but',
or the opposite. So "[^A-Z]"
would match every character except
for the capital letters. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Prefix and Suffix |
ok, one thing
to finish with. You want to select text that lies between some
characters, e.g. the word "Text", but only when it appears surrounded
in this manner "AAATextBCD". If you use a Regex for this, you will end
up selecting the whole string, not just the "Text" part. This is very
common when you want to get the text between quotes, or between
brackets, but not actually including the quotes or brackets in the
result. To the rescue is prefix and suffix support.
For text of AAATextBCD (?<=AAA)Text(?=BCD) will give the result "Text" (?<=AAA) means a prefix of "AAA", but do not include that prefix in the Regex result. (?<=BCD) means a suffix of "BCD", but do not include that suffix in the Regex result. Note that the prefix and suffix do not have quotes, any special characters present must be escaped. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Putting
it all together... |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Now
you know the special characters, what's left?, well nothing, just
putting it all together. By combining the special characters, you can
create 'patterns' that allow you to search for just about anything. So practice with MultiSub, open a text file and try writing a RegEx that will select certain parts of it. You'll soon get the hang of it. A summary is below with some very common and useful examples. Is it still cryptic?, yes, but it's cryptic in an understandable way when you strip it down. In a very short time you'll be writing long RegEx expressions and someone looking over your shoulder will wonder what the hell it all means. The weird thing is, you will understand it! |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| A summary... | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| MultiSub is Freeware |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Back to top | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||