Difference between revisions of "ReNamer:Regular Expressions"
(Use for renaming files.) |
|||
(57 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
− | + | {{Up|ReNamer}} | |
− | + | == Introduction == | |
− | |||
− | RegEx | + | Regular Expressions (RegEx) is a syntax for specifying patterns of text to search and replace, which can be used for renaming files via the [[ReNamer:Rules:RegEx|Regular Expressions renaming rule]]. Special metacharacters allow you to specify, for instance, that a particular string you are looking for occurs at the beginning or end of a line, or contains N recurrences of a certain character. |
− | + | Metacharacters, such as <span class="hl-orange">$ . ^ { [ ( | ) * + ? \</span> are ''interpreted'' according to their individual meaning, instead of finding a literal match for them. | |
− | + | In this document, RegEx patterns are shown in <span class="hl-orange">bold orange</span>. The subject text which is checked against a RegEx pattern for a possible match is shown in '''bold black'''. Parts of the subject text are color-coded to provide a clue as to why a certain part matches (<span class="hl-teal">green</span> color), or does <u>not</u> match (<span class="hl-red">red</span> color). | |
− | + | == Simple matches == | |
− | + | When the search string does not contain any metacharacters, the RegEx engine works like "normal" search. (it tries to find an exact copy of the search string.) (This is also known as "literal match"). | |
− | = | + | If you want to find a literal match for a metacharacter, put a backslash '''\''' ''before'' it. (The <span class="hl-orange">\</span> character is called "''escape character''", because it lets the metacharacter escape from its special duty, and lets it act as a normal character. Its combination with a metacharacter is called "''escape sequence''"). |
− | + | For example, metacharacter <span class="hl-orange">^</span> matches the beginning of string, but <span class="hl-orange">\^</span> matches the character <span class="hl-teal">^</span>. | |
− | + | Note that the RegEx pattern <span class="hl-orange">\\</span> matches the character <span class="hl-teal">\</span>. | |
− | + | {| class="wikitable" | |
− | |||
− | |||
− | |||
− | {| class=" | ||
|- | |- | ||
− | + | ! RegEx pattern | |
− | + | ! Matches | |
− | + | ! Remarks | |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">foobar</span></center> |
− | | <center> | + | | <center><span class="hl-teal">foobar</span></center> |
| This RegEx pattern does not contain any metacharacters; so all characters are matched literally. | | This RegEx pattern does not contain any metacharacters; so all characters are matched literally. | ||
|- | |- | ||
− | | | + | | <center><span class="hl-orange">\^FooBarPtr</span></center> |
− | + | | <center><span class="hl-teal">^FooBarPtr</span></center> | |
− | + | | The <span class="hl-orange">\^</span> escape sequence searches for the character '''^''' ''literally'' . | |
|} | |} | ||
− | + | == Escape sequences == | |
We already saw one use of escape sequence (above). | We already saw one use of escape sequence (above). | ||
Line 45: | Line 40: | ||
Specific escape sequences are interpreted as special conditions, as listed below. | Specific escape sequences are interpreted as special conditions, as listed below. | ||
− | {| class=" | + | {| class="wikitable" |
|- | |- | ||
− | + | ! RegEx pattern | |
− | + | ! matches | |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">\xnn</span></center> |
| Character represented by the hex code ''nn'' | | Character represented by the hex code ''nn'' | ||
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">\x{nnnn}</span></center> |
| two bytes char with hex code nnnn (unicode) | | two bytes char with hex code nnnn (unicode) | ||
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">\t</span></center> |
| tab (HT/TAB), same as \x09 (Hex 09) | | tab (HT/TAB), same as \x09 (Hex 09) | ||
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">\n</span></center> |
| new line (NL), same as \x0a (Hex 0a) | | new line (NL), same as \x0a (Hex 0a) | ||
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">\r</span></center> |
| carriage return (CR), same as \x0d (Hex 0d) | | carriage return (CR), same as \x0d (Hex 0d) | ||
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">\f</span></center> |
| form feed (FF), same as \x0c (Hex 0c) | | form feed (FF), same as \x0c (Hex 0c) | ||
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">foo\x20bar</span></center> |
− | | matches | + | | matches <span class="hl-teal">foo bar</span> (note the space in the middle), but does ''not'' match '''foobar''' |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">\tfoobar</span></center> |
− | | matches | + | | matches <span class="hl-teal">foobar</span> preceded by a tab (the tab is needed for the match) |
|} | |} | ||
− | Note that the tab, new line, carriage return, and form feed are known as | + | Note that the tab, new line, carriage return, and form feed are known as "white spaces". But RegEx can distinguish between them. This allows you to make high-precision searches. |
− | + | == Character classes == | |
− | A character class is a list of characters | + | A character class is a list of characters surrounded by square brackets "<span class="hl-orange">[</span>" and "<span class="hl-orange">]</span>", which will match any one (and only one) character from the list. |
− | Note that | + | Note that: |
− | *The characters are not separated with a comma or a space. | + | * The characters are not separated with a comma or a space. |
− | *If you repeat any character in the list, it is considered only once (duplicates are ignored). | + | * If you repeat any character in the list, it is considered only once (duplicates are ignored). |
− | *A hyphen | + | * A hyphen "<span class="hl-orange">-</span>" is used to indicate range of characters. |
− | {| class=" | + | {| class="wikitable" |
|- | |- | ||
− | + | ! RegEx Pattern | |
− | + | ! Remarks | |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">[abcdef]</span></center> |
− | | Matches | + | | Matches <span class="hl-teal">a</span>, <span class="hl-teal">b</span>, <span class="hl-teal">c</span>, <span class="hl-teal">d</span>, <span class="hl-teal">e</span>, or <span class="hl-teal">f</span> (only ''one'' character), but no other characters |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">[c-m]</span></center> |
− | | Matches any one (and only one) of the small alphabetical characters, from | + | | Matches any one (and only one) of the small alphabetical characters, from <span class="hl-teal">c</span> to <span class="hl-teal">m</span> |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">[G-J]</span></center> |
− | | Matches any one (and only one) of the capital alphabetical characters from | + | | Matches any one (and only one) of the capital alphabetical characters from <span class="hl-teal">G</span> to <span class="hl-teal">J</span> |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">[a-zA-Z]</span></center> |
| Matches any one (and only one) of the alphabetical characters (capital or small) | | Matches any one (and only one) of the alphabetical characters (capital or small) | ||
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">[5-8]</span></center> |
− | | Matches any one (and only one) of numerical characters from | + | | Matches any one (and only one) of numerical characters from <span class="hl-teal">5</span> to <span class="hl-teal">8</span> |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">[\n-\x1F]</span></center> |
− | | | + | | |
+ | Matches any one (and only one) of characters with their ordinal value in range from <span class="hl-teal">#10</span> (<span class="hl-orange">\n</span>) to <span class="hl-teal">#31</span> (<span class="hl-orange">\x1F</span>), which in [https://en.wikipedia.org/wiki/ASCII ASCII] character table correspond to some non-printable characters. | ||
+ | |||
+ | Note the use of [[#Escape_sequences|escape sequences]] inside of this example. | ||
+ | |||
|} | |} | ||
There are some special conditions: | There are some special conditions: | ||
− | *If you do not want any of the characters in the specified class, then place | + | *If you do not want any of the characters in the specified class, then place <span class="hl-orange">^</span> at the very beginning of the list (RegEx interprets that as "none of the characters listed in this class"). |
− | *If you want | + | *If you want <span class="hl-orange">[</span> or <span class="hl-orange">]</span> itself to be a member of a class, put it at the start or end of the list, or create a [[#Simple_.28literal.29_matches|escape sequence]] (by putting <span class="hl-orange">\</span> before it). |
− | {| class=" | + | {| class="wikitable" |
|- | |- | ||
− | + | ! RegEx Pattern | |
− | + | ! Remarks | |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">[-az]</span></center> |
− | | matches | + | | matches <span class="hl-teal">a</span>, <span class="hl-teal">z</span>, and <span class="hl-teal">-</span><br/>(since <span class="hl-orange">–</span> is put at the beginning, the escape sequence is not needed) |
− | |||
− | |||
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">[a\-z]</span></center> |
− | | matches | + | | matches <span class="hl-teal">a</span>, <span class="hl-teal">z</span>, and <span class="hl-teal">-</span><br/>(since <span class="hl-orange">–</span> is ''not'' at the beginning/end, the escape sequence ''is'' needed) |
− | |||
− | |||
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">[^0-9]</span></center> |
| matches any ''non-digit'' character | | matches any ''non-digit'' character | ||
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">[]-a]</span></center> |
− | | matches any character from | + | | matches any character from <span class="hl-teal">]</span> to <span class="hl-teal">a</span>. <br>(since <span class="hl-orange">]</span> is at the beginning, the escape sequence ''is not'' needed) |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">foob[aeiou]r</span></center> |
− | | Matches with ''' | + | | Matches with '''foob<span class="hl-teal">a</span>r''', '''foob<span class="hl-teal">e</span>r''', etc. but not '''foob<span class="hl-red">b</span>r''', '''foob<span class="hl-red">c</span>r''', etc. |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">foob[^aeiou]r</span></center> |
− | | Matches with ''' | + | | Matches with '''foob<span class="hl-teal">b</span>r''', '''foob<span class="hl-teal">c</span>r''' etc. but not '''foob<span class="hl-red">a</span>r''', '''foob<span class="hl-red">e</span>r''', etc. |
|} | |} | ||
− | + | == Predefined classes == | |
− | |||
− | |||
Some of the character classes are used so often that RegEx has predefined escape sequences to represent them. | Some of the character classes are used so often that RegEx has predefined escape sequences to represent them. | ||
− | {| class=" | + | {| class="wikitable" |
|- | |- | ||
− | + | ! RegEx Pattern | |
− | + | ! Remarks | |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">\w</span></center> |
| an alphanumeric character, including an ''underscore'' ('''_''') | | an alphanumeric character, including an ''underscore'' ('''_''') | ||
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">\W</span></center> |
| a non-alphanumeric character | | a non-alphanumeric character | ||
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">\d</span></center> |
| a numeric character | | a numeric character | ||
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">\D</span></center> |
| a non-numeric character | | a non-numeric character | ||
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">\s</span></center> |
− | | any space (same as the | + | | any space (same as the <span class="hl-orange">[ \t\n\r\f]</span> class) |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">\S</span></center> |
| a non space | | a non space | ||
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">.</span></center> |
| any character in line (the symbol is just a dot) | | any character in line (the symbol is just a dot) | ||
|} | |} | ||
− | Notice that the capitalized letter is used to negate (for example, compare | + | Notice that the capitalized letter is used to negate (for example, compare <span class="hl-orange">\w</span> with <span class="hl-orange">\W</span>) |
− | + | == Word and text boundaries == | |
− | A word boundary | + | A word boundary <span class="hl-orange">\b</span> matches a position between a word character <span class="hl-orange">\w</span> and a non-word character <span class="hl-orange">\W</span>. For the purpose of a word boundary position, the start and end of text will be treated as non-word characters <span class="hl-orange">\W</span>. These markers are commonly used for matching patterns as whole words, while ignoring occurrences within words. |
− | {| class=" | + | {| class="wikitable" |
|- | |- | ||
− | + | ! RegEx Pattern | |
− | + | ! Remarks | |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">\b</span></center> |
| word boundary | | word boundary | ||
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">\B</span></center> |
| not word boundary | | not word boundary | ||
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">\A</span></center> |
− | | start of text ( | + | | start of text (<span class="hl-orange">^</span> is an alternative) |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">\Z</span></center> |
− | | end of text ( | + | | end of text (<span class="hl-orange">$</span> is an alternative) |
|} | |} | ||
− | + | For example, <span class="hl-orange">\bhis\b</span> will search for a whole word '''his''', but will ignore '''<span class="hl-red">t</span>his''', '''his<span class="hl-red">tory</span>''' or '''<span class="hl-red">w</span>his<span class="hl-red">tle</span>'''. | |
− | + | == Iterators == | |
− | Iterators (quantifiers) are meta-characters that specify how many times the ''preceding'' expression has to repeat, | + | Iterators (quantifiers) are meta-characters that specify how many times the ''preceding'' expression has to repeat. For example, finding a numeric sequence exactly 3 to 5 digits long. |
− | |||
− | |||
Iterators can be 'Greedy' or 'Non-Greedy'. Greedy means the expression grabs as ''much'' matching text as possible. In contrast, the non-greedy expression tries to match as ''little'' as possible. | Iterators can be 'Greedy' or 'Non-Greedy'. Greedy means the expression grabs as ''much'' matching text as possible. In contrast, the non-greedy expression tries to match as ''little'' as possible. | ||
− | + | All iterators are greedy by default. Adding <span class="hl-orange">?</span> (question mark) at the end of an iterator makes it non-greedy. | |
− | + | For example: | |
− | |||
− | + | * when <span class="hl-orange">b+</span> (a greedy expression) is applied to string '''abbbbc''', it matches '''bbbb''' (as many as possible), | |
+ | * but when <span class="hl-orange">b+?</span> (a non-greedy expression) is applied to '''abbbbc''', it matches only '''b''' (as few as possible). | ||
− | {| class=" | + | {| class="wikitable" |
|- | |- | ||
− | ! | + | ! RegEx pattern |
− | ! | + | ! Remarks |
− | ! | + | ! Greedy? |
− | ! | + | ! Remarks |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">*</span></center> |
| zero or more | | zero or more | ||
| <center>Yes</center> | | <center>Yes</center> | ||
− | | equivalent to | + | | equivalent to <span class="hl-orange">{0,}</span> |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">+</span></center> |
| one or more | | one or more | ||
| <center>Yes</center> | | <center>Yes</center> | ||
− | | equivalent to | + | | equivalent to <span class="hl-orange">{1,}</span> |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">?</span></center> |
| zero or one | | zero or one | ||
− | | | + | | <center>Yes</center> |
− | | equivalent to | + | | equivalent to <span class="hl-orange">{0,1}</span> |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">{n}</span></center> |
| exactly ''n'' times | | exactly ''n'' times | ||
| <center>Yes</center> | | <center>Yes</center> | ||
| | | | ||
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">{n,}</span></center> |
| at least ''n'' times | | at least ''n'' times | ||
| <center>Yes</center> | | <center>Yes</center> | ||
| | | | ||
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">{n,m}</span></center> |
| at least ''n'' but not more than ''m'' times | | at least ''n'' but not more than ''m'' times | ||
| <center>Yes</center> | | <center>Yes</center> | ||
| | | | ||
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">*?</span></center> |
| zero or more | | zero or more | ||
| <center>No</center> | | <center>No</center> | ||
− | | equivalent to | + | | equivalent to <span class="hl-orange">{0,}?</span> |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">+?</span></center> |
| one or more | | one or more | ||
| <center>No</center> | | <center>No</center> | ||
− | | equivalent to | + | | equivalent to <span class="hl-orange">{1,}?</span> |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">??</span></center> |
| zero or one | | zero or one | ||
| <center>No</center> | | <center>No</center> | ||
− | | equivalent to | + | | equivalent to <span class="hl-orange">{0,1}?</span> |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">{n}?</span></center> |
| exactly ''n'' times | | exactly ''n'' times | ||
| <center>No</center> | | <center>No</center> | ||
| | | | ||
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">{n,}?</span></center> |
− | | at least ''n ''times | + | | at least ''n'' times |
| <center>No</center> | | <center>No</center> | ||
| | | | ||
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">{n,m}?</span></center> |
| at least ''n'' but not more than ''m'' times | | at least ''n'' but not more than ''m'' times | ||
| <center>No</center> | | <center>No</center> | ||
Line 288: | Line 279: | ||
Let us see some examples: | Let us see some examples: | ||
− | {| class=" | + | {| class="wikitable" |
|- | |- | ||
− | ! | + | ! RegEx pattern |
− | ! | + | ! Remarks |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">foob.*r</span></center> |
− | | matches ''' | + | | matches '''foob<span class="hl-teal">a</span>r''', '''foob<span class="hl-teal">alkjdflkj9</span>r''' and '''foobr''' |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">foob.+r</span></center> |
− | | matches ''' | + | | matches '''foob<span class="hl-teal">a</span>r''', '''foob<span class="hl-teal">alkjdflkj9</span>r''' but not '''foobr''' |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">foob.?r</span></center> |
− | | matches ''' | + | | matches '''foob<span class="hl-teal">a</span>r,''' '''foob<span class="hl-teal">b</span>r''' and '''foobr''' but not '''foob<span class="hl-red">alkj9</span>r''' |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">fooba{2}r</span></center> |
− | | matches ''' | + | | matches '''foob<span class="hl-teal">aa</span>r''' |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">fooba{2,}r</span></center> |
− | | matches ''' | + | | matches '''foob<span class="hl-teal">aa</span>r,''' '''foob<span class="hl-teal">aaa</span>r''', '''foob<span class="hl-teal">aaaa</span>r''' etc. but not '''foob<span class="hl-red">a</span>r''' |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">fooba{2,3}r</span></center> |
− | | matches ''' | + | | matches '''foob<span class="hl-teal">aa</span>r''', or '''foob<span class="hl-teal">aaa</span>r''' but not '''foob<span class="hl-red">aaaa</span>r''' or '''foob<span class="hl-red">a</span>r''' |
|} | |} | ||
− | + | == Alternatives == | |
− | A RegEx expression can have multiple alternative characters or subexpressions. The metacharacter | + | A RegEx expression can have multiple alternative characters or subexpressions. The metacharacter <span class="hl-orange">|</span> is used to separate the alternatives. |
− | For example, | + | For example, <span class="hl-orange">fee|fie|foe</span> will match with '''fee''', '''fie''', or '''foe''' in the target string. |
− | + | It is difficult to understand where each alternative starts and ends. This is why it is a common practice to include alternatives in parentheses, to make it easier to understand. | |
− | For example, | + | For example, <span class="hl-orange">fee|fie|foe</span> can be written as <span class="hl-orange">f(e|i|o)e</span>, to make it easier to understand. |
− | Alternatives are tried from left to right, so the first alternative found for which the entire expression matches, is the one that is chosen | + | Alternatives are tried from left to right, so the first alternative found for which the entire expression matches, is the one that is chosen. For example, when matching <span class="hl-orange">foo|foot</span> against '''barefoot''', only the '''foo''' part will match, because that is the first alternative tried, and it successfully matches the target string. (This is important when you are capturing matched text using parentheses.) |
− | {| class=" | + | {| class="wikitable" |
|- | |- | ||
− | + | ! RegEx Pattern | |
− | + | ! Remarks | |
|- | |- | ||
− | | | + | | <span class="hl-orange">foo(bar|foo)</span> |
| matches '''foobar''' or '''foofoo''' | | matches '''foobar''' or '''foofoo''' | ||
|} | |} | ||
− | Also remember that alternatives cannot be used inside a character class (square brackets), because | + | Also remember that alternatives cannot be used inside a character class (square brackets), because <span class="hl-orange">|</span> is interpreted as a literal within <span class="hl-orange">[]</span>. That means <span class="hl-orange">[fee|fie|foe]</span> is same as <span class="hl-orange">[feio|]</span>. (The other characters are treated as duplicates, and ignored). |
− | + | == Subexpressions == | |
− | Parts of any RegEx pattern can be enclosed in brackets (), just like using brackets in a mathematics formula. Each part that is enclosed in brackets is called a | + | Parts of any RegEx pattern can be enclosed in brackets <span class="hl-orange">()</span>, just like using brackets in a mathematics formula. Each part that is enclosed in brackets is called a "''subexpression''". |
The brackets serve two main purposes: | The brackets serve two main purposes: | ||
− | *Better readability, as in the mathematical formula '''a+(b+c) | + | * Better readability, as in the mathematical formula '''a+(b+c)'''. |
− | *Make a functional group, as in the mathematical formula '''a(b+c) | + | * Make a functional group, as in the mathematical formula '''a(b+c)'''. This group is evaluated first. |
Let us see some examples: | Let us see some examples: | ||
− | {| class=" | + | {| class="wikitable" |
|- | |- | ||
− | + | ! RegEx Pattern | |
− | + | ! Remarks | |
|- | |- | ||
− | | | + | | <center><span class="hl-orange">(fee)|(fie)|(foe)</span></center> |
− | | fie|foe | + | | Much better readability than the equivalent RegEx pattern <span class="hl-orange">fee|fie|foe</span>. |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">(foobar){2,3}</span></center> |
− | | Matches with the entire enclosed string '''foobar '''repeated 2 or 3 times | + | | Matches with the entire enclosed string '''foobar''' repeated 2 or 3 times. |
− | (i.e., matches with '''foobarfoobar '''or '''foobarfoobarfoobar''') | + | (i.e., matches with '''foobarfoobar''' or '''foobarfoobarfoobar''')<br/>(The iterator acts on the entire subexpression. Compare with the example below!) |
− | |||
− | (The iterator acts on the entire subexpression. Compare with the example below!) | ||
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">foobar{2,3}</span></center> |
− | | Matches with ''' | + | | Matches with '''fooba''' followed by the character '''r''' repeated 2 or 3 times.<br/>(i.e., matches with '''fooba<span class="hl-teal">rr</span>''' or '''fooba<span class="hl-teal">rrr</span>''') (The iterator acts only on the last character.) |
− | (i.e., matches with ''' | ||
− | |||
− | (The iterator acts only on the last character.) | ||
− | |||
|- | |- | ||
− | | < | + | | <span class="hl-orange">foob([0-9]|a+)r</span> |
− | | matches only the character ''' | + | | matches only the character '''foob<span class="hl-teal">0</span>r''', '''foob<span class="hl-teal">1</span>r''', '''foob<span class="hl-teal">a</span>r''', '''foob<span class="hl-teal">aa</span>r''', '''foob<span class="hl-teal">aaaa</span>r''', etc.<br/>(The subexpression is evaluated first.) |
− | (The subexpression is evaluated first.) | ||
− | |||
|} | |} | ||
− | + | == Backreferences == | |
You must have told (or heard-) jokes like this one: | You must have told (or heard-) jokes like this one: | ||
− | + | "Two guys walk in a bar. The '''''first guy''''' says.... Then the '''''second guy''''' replies....". | |
Then you are already familiar with ''backreferences''! | Then you are already familiar with ''backreferences''! | ||
− | A '' | + | A "''backreference''" is a ''numbered reference'' to a previously mentioned thing. |
RegEx also has backreferences. Let us understand how backreferences are defined in RegEx. | RegEx also has backreferences. Let us understand how backreferences are defined in RegEx. | ||
Line 390: | Line 373: | ||
At this stage, the RegEx engine gives numbers to these matching parts: | At this stage, the RegEx engine gives numbers to these matching parts: | ||
− | *The text that matches the ''entire'' RegEx expression takes the number '0'. | + | * The text that matches the ''entire'' RegEx expression takes the number '0'. |
− | *The text matching any subexpression is given a number based on the position of that subexpression inside the pattern. In other words, text matching the ''n''th subexpression will take the number 'n'. | + | * The text matching any subexpression is given a number based on the position of that subexpression inside the pattern. In other words, text matching the ''n''th subexpression will take the number 'n'. |
− | Now we use those numbers to refer to the entire pattern and/or subexpressions. (That is why these numbers are called ''' | + | Now we use those numbers to refer to the entire pattern and/or subexpressions. (That is why these numbers are called "'''backreference'''".) |
− | The backreference to the ''n''<sup>th</sup> subexpression is written as '''\n'''. | + | The backreference to the ''n''<sup>th</sup> subexpression is written as '''<span class="hl-orange">\n</span>'''. |
The backreferences can be used to compose the RegEx pattern itself, as shown below: | The backreferences can be used to compose the RegEx pattern itself, as shown below: | ||
− | {| class=" | + | {| class="wikitable" |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">(.)\1+</span></center> |
− | | matches '''aaaa '''and '''cc '''(any single character that is repeated twice or more) | + | | matches '''aaaa''' and '''cc''' (any single character that is repeated twice or more) |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">(.+)\1+</span></center> |
− | | matches | + | | matches <span class="hl-blue">aa</span><span class="hl-pink">aa</span>, <span class="hl-blue">c</span><span class="hl-pink">c</span>, <span class="hl-blue">ab</span><span class="hl-pink">ab</span><span class="hl-blue">ab</span><span class="hl-pink">ab</span>, <span class="hl-blue">123</span><span class="hl-pink">123</span> |
(a set of one or more characters, repeated twice or more) | (a set of one or more characters, repeated twice or more) | ||
− | (The character-sets are alternately colored | + | (The character-sets are alternately colored <span class="hl-blue">blue</span> and <span class="hl-pink">pink</span> for easy identification. Observe how a RegEx pattern can match quite different text!) |
|} | |} | ||
− | + | === Substitution of text using backreference === | |
The backreferences are also used in ''find-and-replace'' operations, to re-assemble new text from old. | The backreferences are also used in ''find-and-replace'' operations, to re-assemble new text from old. | ||
− | *The expressions | + | *The expressions <span class="hl-orange">\1</span> through <span class="hl-orange">\9</span> serve as backreferences to the subexpressions found in the RegEx pattern. The expression <span class="hl-orange">\0</span> is used to represent the text that matches the whole RegEx pattern. These are used in the "find" part of the operation. |
− | *The expressions | + | *The expressions <span class="hl-orange">$1</span> through <span class="hl-orange">$9</span> represent the actual text that matches the ''respective'' subexpressions. These are used in the "replace" part of the operation. |
+ | *The expressions <span class="hl-orange">$0</span> refers to the whole original name. Note: it is not necessary to enclosed them in round brackets <span class="hl-orange">()</span> for this use, <span class="hl-orange">$0</span> is just there. | ||
The replacement text is typically a combination of- | The replacement text is typically a combination of- | ||
− | *The text that matched the subexpressions, and | + | * The text that matched the subexpressions, and |
− | *Some new text. | + | * Some new text. |
+ | |||
+ | Note that the RegEx pattern ''may'' have some parts that are not enclosed in (). (In other words, it may have parts that are not subexpressions.) Such parts are not used in the replacement text. | ||
− | Here are some | + | Here are some "find-and-replace" examples: |
− | {| class=" | + | {| class="wikitable" |
|- | |- | ||
− | + | ! Expression | |
− | + | ! Replace | |
− | + | ! Description | |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">(.*) (.*)</span></center> |
− | | <center> | + | | <center><span class="hl-orange">$2, $1</span></center> |
| Switch two words around and put a comma after the resulting first word. Example: if input string is "John Smith", then output will be "Smith, John". | | Switch two words around and put a comma after the resulting first word. Example: if input string is "John Smith", then output will be "Smith, John". | ||
Notice that the replacement text also has additional literal text in the middle (comma and space). | Notice that the replacement text also has additional literal text in the middle (comma and space). | ||
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">\b(\d{2})-(\d{2})-(\d{4})\b</span></center> |
− | | <center> | + | | <center><span class="hl-orange">$3-$2-$1</span></center> |
− | | Find date sequences in dd-mm-yyyy format and reverse them into yyyy-mm-dd format. (e.g. 25-10-2007 is converted to 2007-10-25). | + | | Find date sequences in dd-mm-yyyy format and reverse them into yyyy-mm-dd format. <br>(e.g. 25-10-2007 is converted to 2007-10-25). |
− | '''Note''': This is not a very robust example, because | + | '''Note''': This is not a very robust example, because <span class="hl-orange">\d</span> can represent any digit in range of 0-9. That means sequences like 99-99-9999 also will match this pattern, resulting in a problem. This in fact shows that you need to be careful with RegEx patterns! |
|- | |- | ||
− | | <center> | + | | <center><span class="hl-orange">\[.*?\]</span></center> |
| | | | ||
− | | | + | | Remove the contents of the [...] (square brackets), and the brackets too. <br>(Replace with ''nothing'' means ''deleting''.) |
− | (Replace with ''nothing'' means ''deleting''.) | + | |
+ | |} | ||
+ | === Upper case and lower case manipulations === | ||
+ | |||
+ | Backreferences can also be used to adjust the case of a certain patterns or fragments, which cannot be easily achieved with generic case manipulation rules. | ||
+ | |||
+ | {| class="wikitable" | ||
+ | ! Flag | ||
+ | ! Description | ||
+ | |- | ||
+ | | <code>\L</code> | ||
+ | | Convert all characters to lowercase. | ||
+ | |- | ||
+ | | <code>\l</code> | ||
+ | | Convert only the first character to lowercase (that's a lower case L). | ||
+ | |- | ||
+ | | <code>\U</code> | ||
+ | | Convert all characters to uppercase. | ||
+ | |- | ||
+ | | <code>\u</code> | ||
+ | | Convert only the first character to uppercase. | ||
|} | |} | ||
− | ''' | + | These flags can be used together with the backreferences in the replace pattern to adjust the case of text inserted by backreferences. |
+ | |||
+ | For example, we can do the following manipulations: | ||
+ | |||
+ | {| class="wikitable" | ||
+ | ! Input || Find || Replace || Result | ||
+ | |- | ||
+ | | test ExAmple || <span class="hl-orange">(.+) (.+)</span> || <span class="hl-orange">$1 $2</span> || <span class="hl-teal">test ExAmple</span> | ||
+ | |- | ||
+ | | test ExAmple || <span class="hl-orange">(.+) (.+)</span> || <span class="hl-orange">\U$1 $2</span> || <span class="hl-teal"><span class="hl-red">TEST</span> ExAmple</span> | ||
+ | |- | ||
+ | | test ExAmple || <span class="hl-orange">(.+) (.+)</span> || <span class="hl-orange">$1 \L$2</span> || <span class="hl-teal">test <span class="hl-red">example</span></span> | ||
+ | |- | ||
+ | | test ExAmple || <span class="hl-orange">(.+) (.+)</span> || <span class="hl-orange">\u$1 \L$2</span> || <span class="hl-teal"><span class="hl-red">T</span>est <span class="hl-red">example</span></span> | ||
+ | |} | ||
+ | |||
+ | '''Note:''' Case manipulation features were added in ''v5.72.4 Beta''. This feature is less common and may not exist in other RegEx engines. | ||
+ | |||
+ | == Limitations for binary data == | ||
+ | |||
+ | One of the known limitation of RegEx engine when working with binary data is that the <u>input string is not searched beyond the first occurrence of NULL character (\x00)</u>. This would not affect file names because there are simply no NULL characters in them, but may affect parsing of binary content of files when working in [[ReNamer:Pascal Script|Pascal Script]] for example. | ||
+ | |||
+ | == Useful references == | ||
− | *[http://www.regular-expressions.info/ www.regular-expressions.info] | + | * [http://www.regular-expressions.info/ www.regular-expressions.info] |
− | * | + | *: Excellent site devoted to regular expressions. It is nicely structured, with many easy to understand examples. |
− | *[ | + | * [https://github.com/andgineer/TRegExpr TRegExpr] component with documentation at [https://regex.sorokin.engineer/ regex.sorokin.engineer] |
− | * | + | *: Regular expressions library for Delphi / Free Pascal. |
− | + | [[Category:ReNamer]] |
Latest revision as of 09:54, 24 January 2022
Introduction
Regular Expressions (RegEx) is a syntax for specifying patterns of text to search and replace, which can be used for renaming files via the Regular Expressions renaming rule. Special metacharacters allow you to specify, for instance, that a particular string you are looking for occurs at the beginning or end of a line, or contains N recurrences of a certain character.
Metacharacters, such as $ . ^ { [ ( | ) * + ? \ are interpreted according to their individual meaning, instead of finding a literal match for them.
In this document, RegEx patterns are shown in bold orange. The subject text which is checked against a RegEx pattern for a possible match is shown in bold black. Parts of the subject text are color-coded to provide a clue as to why a certain part matches (green color), or does not match (red color).
Simple matches
When the search string does not contain any metacharacters, the RegEx engine works like "normal" search. (it tries to find an exact copy of the search string.) (This is also known as "literal match").
If you want to find a literal match for a metacharacter, put a backslash \ before it. (The \ character is called "escape character", because it lets the metacharacter escape from its special duty, and lets it act as a normal character. Its combination with a metacharacter is called "escape sequence").
For example, metacharacter ^ matches the beginning of string, but \^ matches the character ^.
Note that the RegEx pattern \\ matches the character \.
RegEx pattern | Matches | Remarks |
---|---|---|
This RegEx pattern does not contain any metacharacters; so all characters are matched literally. | ||
The \^ escape sequence searches for the character ^ literally . |
Escape sequences
We already saw one use of escape sequence (above).
Specific escape sequences are interpreted as special conditions, as listed below.
RegEx pattern | matches |
---|---|
Character represented by the hex code nn | |
two bytes char with hex code nnnn (unicode) | |
tab (HT/TAB), same as \x09 (Hex 09) | |
new line (NL), same as \x0a (Hex 0a) | |
carriage return (CR), same as \x0d (Hex 0d) | |
form feed (FF), same as \x0c (Hex 0c) | |
matches foo bar (note the space in the middle), but does not match foobar | |
matches foobar preceded by a tab (the tab is needed for the match) |
Note that the tab, new line, carriage return, and form feed are known as "white spaces". But RegEx can distinguish between them. This allows you to make high-precision searches.
Character classes
A character class is a list of characters surrounded by square brackets "[" and "]", which will match any one (and only one) character from the list.
Note that:
- The characters are not separated with a comma or a space.
- If you repeat any character in the list, it is considered only once (duplicates are ignored).
- A hyphen "-" is used to indicate range of characters.
RegEx Pattern | Remarks |
---|---|
Matches a, b, c, d, e, or f (only one character), but no other characters | |
Matches any one (and only one) of the small alphabetical characters, from c to m | |
Matches any one (and only one) of the capital alphabetical characters from G to J | |
Matches any one (and only one) of the alphabetical characters (capital or small) | |
Matches any one (and only one) of numerical characters from 5 to 8 | |
Matches any one (and only one) of characters with their ordinal value in range from #10 (\n) to #31 (\x1F), which in ASCII character table correspond to some non-printable characters. Note the use of escape sequences inside of this example. |
There are some special conditions:
- If you do not want any of the characters in the specified class, then place ^ at the very beginning of the list (RegEx interprets that as "none of the characters listed in this class").
- If you want [ or ] itself to be a member of a class, put it at the start or end of the list, or create a escape sequence (by putting \ before it).
RegEx Pattern | Remarks |
---|---|
matches a, z, and - (since – is put at the beginning, the escape sequence is not needed) | |
matches a, z, and - (since – is not at the beginning/end, the escape sequence is needed) | |
matches any non-digit character | |
matches any character from ] to a. (since ] is at the beginning, the escape sequence is not needed) | |
Matches with foobar, foober, etc. but not foobbr, foobcr, etc. | |
Matches with foobbr, foobcr etc. but not foobar, foober, etc. |
Predefined classes
Some of the character classes are used so often that RegEx has predefined escape sequences to represent them.
RegEx Pattern | Remarks |
---|---|
an alphanumeric character, including an underscore (_) | |
a non-alphanumeric character | |
a numeric character | |
a non-numeric character | |
any space (same as the [ \t\n\r\f] class) | |
a non space | |
any character in line (the symbol is just a dot) |
Notice that the capitalized letter is used to negate (for example, compare \w with \W)
Word and text boundaries
A word boundary \b matches a position between a word character \w and a non-word character \W. For the purpose of a word boundary position, the start and end of text will be treated as non-word characters \W. These markers are commonly used for matching patterns as whole words, while ignoring occurrences within words.
RegEx Pattern | Remarks |
---|---|
word boundary | |
not word boundary | |
start of text (^ is an alternative) | |
end of text ($ is an alternative) |
For example, \bhis\b will search for a whole word his, but will ignore this, history or whistle.
Iterators
Iterators (quantifiers) are meta-characters that specify how many times the preceding expression has to repeat. For example, finding a numeric sequence exactly 3 to 5 digits long.
Iterators can be 'Greedy' or 'Non-Greedy'. Greedy means the expression grabs as much matching text as possible. In contrast, the non-greedy expression tries to match as little as possible.
All iterators are greedy by default. Adding ? (question mark) at the end of an iterator makes it non-greedy.
For example:
- when b+ (a greedy expression) is applied to string abbbbc, it matches bbbb (as many as possible),
- but when b+? (a non-greedy expression) is applied to abbbbc, it matches only b (as few as possible).
RegEx pattern | Remarks | Greedy? | Remarks |
---|---|---|---|
zero or more | equivalent to {0,} | ||
one or more | equivalent to {1,} | ||
zero or one | equivalent to {0,1} | ||
exactly n times | |||
at least n times | |||
at least n but not more than m times | |||
zero or more | equivalent to {0,}? | ||
one or more | equivalent to {1,}? | ||
zero or one | equivalent to {0,1}? | ||
exactly n times | |||
at least n times | |||
at least n but not more than m times |
Let us see some examples:
RegEx pattern | Remarks |
---|---|
matches foobar, foobalkjdflkj9r and foobr | |
matches foobar, foobalkjdflkj9r but not foobr | |
matches foobar, foobbr and foobr but not foobalkj9r | |
matches foobaar | |
matches foobaar, foobaaar, foobaaaar etc. but not foobar | |
matches foobaar, or foobaaar but not foobaaaar or foobar |
Alternatives
A RegEx expression can have multiple alternative characters or subexpressions. The metacharacter | is used to separate the alternatives.
For example, fee|fie|foe will match with fee, fie, or foe in the target string.
It is difficult to understand where each alternative starts and ends. This is why it is a common practice to include alternatives in parentheses, to make it easier to understand.
For example, fee|fie|foe can be written as f(e|i|o)e, to make it easier to understand.
Alternatives are tried from left to right, so the first alternative found for which the entire expression matches, is the one that is chosen. For example, when matching foo|foot against barefoot, only the foo part will match, because that is the first alternative tried, and it successfully matches the target string. (This is important when you are capturing matched text using parentheses.)
RegEx Pattern | Remarks |
---|---|
foo(bar|foo) | matches foobar or foofoo |
Also remember that alternatives cannot be used inside a character class (square brackets), because | is interpreted as a literal within []. That means [fee|fie|foe] is same as [feio|]. (The other characters are treated as duplicates, and ignored).
Subexpressions
Parts of any RegEx pattern can be enclosed in brackets (), just like using brackets in a mathematics formula. Each part that is enclosed in brackets is called a "subexpression".
The brackets serve two main purposes:
- Better readability, as in the mathematical formula a+(b+c).
- Make a functional group, as in the mathematical formula a(b+c). This group is evaluated first.
Let us see some examples:
RegEx Pattern | Remarks |
---|---|
Much better readability than the equivalent RegEx pattern fee|fie|foe. | |
Matches with the entire enclosed string foobar repeated 2 or 3 times.
(i.e., matches with foobarfoobar or foobarfoobarfoobar) | |
Matches with fooba followed by the character r repeated 2 or 3 times. (i.e., matches with foobarr or foobarrr) (The iterator acts only on the last character.) | |
foob([0-9]|a+)r | matches only the character foob0r, foob1r, foobar, foobaar, foobaaaar, etc. (The subexpression is evaluated first.) |
Backreferences
You must have told (or heard-) jokes like this one:
"Two guys walk in a bar. The first guy says.... Then the second guy replies....".
Then you are already familiar with backreferences!
A "backreference" is a numbered reference to a previously mentioned thing.
RegEx also has backreferences. Let us understand how backreferences are defined in RegEx.
The RegEx engine tries to find text that matches the whole RegEx pattern. If a matching text is found, the RegEx engine identifies the matching text for each of the subexpressions in the pattern.
At this stage, the RegEx engine gives numbers to these matching parts:
- The text that matches the entire RegEx expression takes the number '0'.
- The text matching any subexpression is given a number based on the position of that subexpression inside the pattern. In other words, text matching the nth subexpression will take the number 'n'.
Now we use those numbers to refer to the entire pattern and/or subexpressions. (That is why these numbers are called "backreference".)
The backreference to the nth subexpression is written as \n.
The backreferences can be used to compose the RegEx pattern itself, as shown below:
matches aaaa and cc (any single character that is repeated twice or more) | |
matches aaaa, cc, abababab, 123123
(a set of one or more characters, repeated twice or more) (The character-sets are alternately colored blue and pink for easy identification. Observe how a RegEx pattern can match quite different text!) |
Substitution of text using backreference
The backreferences are also used in find-and-replace operations, to re-assemble new text from old.
- The expressions \1 through \9 serve as backreferences to the subexpressions found in the RegEx pattern. The expression \0 is used to represent the text that matches the whole RegEx pattern. These are used in the "find" part of the operation.
- The expressions $1 through $9 represent the actual text that matches the respective subexpressions. These are used in the "replace" part of the operation.
- The expressions $0 refers to the whole original name. Note: it is not necessary to enclosed them in round brackets () for this use, $0 is just there.
The replacement text is typically a combination of-
- The text that matched the subexpressions, and
- Some new text.
Note that the RegEx pattern may have some parts that are not enclosed in (). (In other words, it may have parts that are not subexpressions.) Such parts are not used in the replacement text.
Here are some "find-and-replace" examples:
Expression | Replace | Description |
---|---|---|
Switch two words around and put a comma after the resulting first word. Example: if input string is "John Smith", then output will be "Smith, John".
Notice that the replacement text also has additional literal text in the middle (comma and space). | ||
Find date sequences in dd-mm-yyyy format and reverse them into yyyy-mm-dd format. (e.g. 25-10-2007 is converted to 2007-10-25). Note: This is not a very robust example, because \d can represent any digit in range of 0-9. That means sequences like 99-99-9999 also will match this pattern, resulting in a problem. This in fact shows that you need to be careful with RegEx patterns! | ||
Remove the contents of the [...] (square brackets), and the brackets too. (Replace with nothing means deleting.) |
Upper case and lower case manipulations
Backreferences can also be used to adjust the case of a certain patterns or fragments, which cannot be easily achieved with generic case manipulation rules.
Flag | Description |
---|---|
\L
|
Convert all characters to lowercase. |
\l
|
Convert only the first character to lowercase (that's a lower case L). |
\U
|
Convert all characters to uppercase. |
\u
|
Convert only the first character to uppercase. |
These flags can be used together with the backreferences in the replace pattern to adjust the case of text inserted by backreferences.
For example, we can do the following manipulations:
Input | Find | Replace | Result |
---|---|---|---|
test ExAmple | (.+) (.+) | $1 $2 | test ExAmple |
test ExAmple | (.+) (.+) | \U$1 $2 | TEST ExAmple |
test ExAmple | (.+) (.+) | $1 \L$2 | test example |
test ExAmple | (.+) (.+) | \u$1 \L$2 | Test example |
Note: Case manipulation features were added in v5.72.4 Beta. This feature is less common and may not exist in other RegEx engines.
Limitations for binary data
One of the known limitation of RegEx engine when working with binary data is that the input string is not searched beyond the first occurrence of NULL character (\x00). This would not affect file names because there are simply no NULL characters in them, but may affect parsing of binary content of files when working in Pascal Script for example.
Useful references
- www.regular-expressions.info
- Excellent site devoted to regular expressions. It is nicely structured, with many easy to understand examples.
- TRegExpr component with documentation at regex.sorokin.engineer
- Regular expressions library for Delphi / Free Pascal.