Negative Lookahead Assertions In Regular Expressions

,

·

·

  • Negative lookahead assertions are ‘zero-width’ assertions that specify that the text ahead in the input string SHOULD NOT match the pattern specified in the assertion
  • Since it is a zero-width assertion, it does not consume any characters in the input string
  • Even though the part of the regular expression that comes after the negative assertion matches the input string from the current position, the regex will fail if there is text in the input that matches the assertion
  • You can specify a negative lookahead assertion in your regular expression by including the assertion pattern inside (?! and )

A Simple Example

  • You want to match a pattern of 3 digits followed by 3 lower case alphabets but not the string ‘abc‘. For example 123bca is a match but not 123abc
let wordPttn = /\d{3}(?!abc)[a-z]{3}/;
let targetString = "123bca";
let result = targetString.match(wordPttn);
console.log((result !== null ? `Pattern ${wordPttn} FOUND in '${targetString}'. Matched string is '${result[0]}'` : `Pattern ${wordPttn} NOT FOUND in '${targetString}'`));
targetString = "123abc";
result = targetString.match(wordPttn);
console.log((result !== null ? `Pattern ${wordPttn} FOUND in '${targetString}'. Matched string is '${result[0]}'` : `Pattern ${wordPttn} NOT FOUND in '${targetString}'`));
  • Let us walk through the pattern in the above example /\d{3}(?!abc)[a-z]{3}/
  • The \d{3} sub -expression says that the matching string must start with 3 digits
  • The negative lookahead (?!abc) specifies that after matching the part of the regex encountered so far, \d{3}, in the input string, the string ahead if matched by the regex, should not contain the string abc
  • That is the part matched by [a-z]{3} must not contain abc. 123xyz, 123pqr, 123bca are all valid matches but not 123abc

A Slightly More Complex Example

  • You can write a pattern to match all ‘Mission Impossible’ movie titles except the 2nd (because its the worst)
let wordPttn = /Mission: Impossible(?![ ]2)[\w\W]*/;
let targetStrings =
[
"Mission: Impossible",
"Mission: Impossible 2",
"Mission: Impossible III",
"Mission: Impossible - Ghost Protocol",
"Mission: Impossible - Rogue Nation",
"Mission: Impossible - Fallout",
"Mission: Impossible - Dead Reckoning Part One",
"Mission: Impossible - The Final Reckoning"
] ;
for (let targetString of targetStrings) {
let result = targetString.match(wordPttn);
console.log((result !== null ?
`Pattern ${wordPttn} FOUND in '${targetString}'. Matched string is '${result[0]}'`
: `Pattern ${wordPttn} NOT FOUND in '${targetString}'`));
}
  • Let’s analyze the pattern from the above example
    /Mission: Impossible(?![ ]2)[\w\W]*/
  • The Mission: Impossible part matches the text ‘Mission: Impossible’ literally
  • The pattern ‘[ ]2‘ in the negative lookahead (?![ ]2) tells the regex engine that the matched text so far must not be followed ' 2' (space followed by 2)
  • The last part of the pattern ‘[\w\W]*‘ tells that after matching ‘Mission: Impossible‘, subsequent text can contain any number of word and non word characters – except of course the ones eliminated by the negative lookahead

Finding Text Within Quotes

  • You can use negative lookahead to write a pattern to find a string in quotes while allowing single quotes in double quotes and vice versa
let wordPttn = /(['"])((?!\1|\\).|\\.)*\1/g;
let targetStrings = [
'"Poor devil!" he said, commiseratingly, after he had listened to my misfortunes. "What are you up to now?"',
"'Poor devil!' he said, commiseratingly, after he had listened to my misfortunes. 'What are you up to now?'",
"\"Poor devil!\" he said, commiseratingly, after he had listened to my misfortunes. \"What are you up to now?\"",
] ;
for (let targetString of targetStrings) {
let results = targetString.matchAll(wordPttn);
for (let result of results) {
console.log("match = " + result[0]);
}
console.log("\n");
}
  • Let’s analyze the pattern /(['"])((?!\1|\\).|\\.)*\1/g
  • The first part (['"]) matches either a ‘ (single quote) or a ” (double quote). Note that it is in a capture group so that the matched type of quote, either single or double, can be referenced later
  • The next part is ((?!\1|\\).|\\.)* which consists of the two alternatives ‘(?!\1|\\).‘ and ‘ \\.
  • There is the negative lookahead (?!\1|\\) followed by the . (dot). The dot matches everything other than a new line but the negative lookahead prohibits the backreference \1, which would either be ‘ (single quote) or ” (double quote) from matching. The negative lookahead also prevents a backslash (\\) from matching
    Note that the negative lookahead only applies to the first alternative and not the second one
  • So (?!\1|\\). matches anything other than the starting quote and a backslash ‘\’
  • The next alternative \\. matches a backslash followed by any character. It helps to match escape sequences like \”
  • Why is it necessary to avoid matching a back slash in the negative look ahead ‘(?!\1|\\).‘? It is because if a backslash is matched in the first alternative of the moified regex ‘(?!\1).|\\.‘, then escape sequences like \” will not be matched by the next alternative \\.
  • In a string like "Poor \"devil!", the backslash is matched by the first alternative and the escaped quote (\") is not matched by the second alternative
  • In fact it is matched by the backreference '\1' in the negative lookahead and the match ends there with the matched string being “Poor \” isntead of “Poor \”devil!”
  • If you want to match new lines just add the 's' flag to the pattern
    /(['"])((?!\1|\).|\.)*\1/gs
  • If you want to learn about positive lookahead assertions, you can see here

Leave a Reply

Discover more from Programmer Lite

Subscribe now to keep reading and get access to the full archive.

Continue reading