Regular expression for duplicate words

RegexDuplicatesBackreferenceCapture Group

Regex Problem Overview


I'm a regular expression newbie and I can't quite figure out how to write a single regular expression that would "match" any duplicate consecutive words such as:

> Paris in the the spring. > > Not that that is related. > > Why are you laughing? Are my my regular expressions THAT bad??

Is there a single regular expression that will match ALL of the bold strings above?

Regex Solutions


Solution 1 - Regex

Try this regular expression:

\b(\w+)\s+\1\b

Here \b is a word boundary and \1 references the captured match of the first group.

Regex101 example here

Solution 2 - Regex

I believe this regex handles more situations:

/(\b\S+\b)\s+\b\1\b/

A good selection of test strings can be found here: http://callumacrae.github.com/regex-tuesday/challenge1.html

Solution 3 - Regex

The below expression should work correctly to find any number of duplicated words. The matching can be case insensitive.

String regex = "\\b(\\w+)(\\s+\\1\\b)+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);

Matcher m = p.matcher(input);

// Check for subsequences of input that match the compiled pattern
while (m.find()) {
     input = input.replaceAll(m.group(0), m.group(1));
}

Sample Input : Goodbye goodbye GooDbYe

Sample Output : Goodbye

Explanation:

The regex expression:

\b : Start of a word boundary

\w+ : Any number of word characters

(\s+\1\b)* : Any number of space followed by word which matches the previous word and ends the word boundary. Whole thing wrapped in * helps to find more than one repetitions.

Grouping :

m.group(0) : Shall contain the matched group in above case Goodbye goodbye GooDbYe

m.group(1) : Shall contain the first word of the matched pattern in above case Goodbye

Replace method shall replace all consecutive matched words with the first instance of the word.

Solution 4 - Regex

Try this with below RE

  • \b start of word word boundary

  • \W+ any word character

  • \1 same word matched already

  • \b end of word

  • ()* Repeating again

     public static void main(String[] args) {
     
     	String regex = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";//  "/* Write a RegEx matching repeated words here. */";
     	Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE/* Insert the correct Pattern flag here.*/);
     
     	Scanner in = new Scanner(System.in);
     	
     	int numSentences = Integer.parseInt(in.nextLine());
     	
     	while (numSentences-- > 0) {
     		String input = in.nextLine();
     		
     		Matcher m = p.matcher(input);
     		
     		// Check for subsequences of input that match the compiled pattern
     		while (m.find()) {
     			input = input.replaceAll(m.group(0),m.group(1));
     		}
     		
     		// Prints the modified sentence.
     		System.out.println(input);
     	}
     	
     	in.close();
     }
    

Solution 5 - Regex

Regex to Strip 2+ duplicate words (consecutive/non-consecutive words)

Try this regex that can catch 2 or more duplicates words and only leave behind one single word. And the duplicate words need not even be consecutive.

/\b(\w+)\b(?=.*?\b\1\b)/ig

Here, \b is used for Word Boundary, ?= is used for positive lookahead, and \1 is used for back-referencing.

Example Source

Solution 6 - Regex

The widely-used PCRE library can handle such situations (you won't achieve the the same with POSIX-compliant regex engines, though):

(\b\w+\b)\W+\1

Solution 7 - Regex

No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.

Solution 8 - Regex

This is the regex I use to remove duplicate phrases in my twitch bot:

(\S+\s*)\1{2,}

(\S+\s*) looks for any string of characters that isn't whitespace, followed whitespace.

\1{2,} then looks for more than 2 instances of that phrase in the string to match. If there are 3 phrases that are identical, it matches.

Solution 9 - Regex

Here is one that catches multiple words multiple times:

(\b\w+\b)(\s+\1)+

Solution 10 - Regex

The example in Javascript: The Good Parts can be adapted to do this:

var doubled_words = /([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1(?:\s|$)/gi;

\b uses \w for word boundaries, where \w is equivalent to [0-9A-Z_a-z]. If you don't mind that limitation, the accepted answer is fine.

Solution 11 - Regex

Since some developers are coming to this page in search of a solution which not only eliminates duplicate consecutive non-whitespace substrings, but triplicates and beyond, I'll show the adapted pattern.

Pattern: /(\b\S+)(?:\s+\1\b)+/ (Pattern Demo)
Replace: $1 (replaces the fullstring match with capture group #1)

This pattern greedily matches a "whole" non-whitespace substring, then requires one or more copies of the matched substring which may be delimited by one or more whitespace characters (space, tab, newline, etc).

Specifically:

  • \b (word boundary) characters are vital to ensure partial words are not matched.
  • The second parenthetical is a non-capturing group, because this variable width substring does not need to be captured -- only matched/absorbed.
  • the + (one or more quantifier) on the non-capturing group is more appropriate than * because * will "bother" the regex engine to capture and replace singleton occurrences -- this is wasteful pattern design.

*note if you are dealing with sentences or input strings with punctuation, then the pattern will need to be further refined.

Solution 12 - Regex

This expression (inspired from Mike, above) seems to catch all duplicates, triplicates, etc, including the ones at the end of the string, which most of the others don't:

/(^|\s+)(\S+)(($|\s+)\2)+/g, "$1$2")

I know the question asked to match duplicates only, but a triplicate is just 2 duplicates next to each other :)

First, I put (^|\s+) to make sure it starts with a full word, otherwise "child's steak" would go to "child'steak" (the "s"'s would match). Then, it matches all full words ((\b\S+\b)), followed by an end of string ($) or a number of spaces (\s+), the whole repeated more than once.

I tried it like this and it worked well:

var s = "here here here     here is ahi-ahi ahi-ahi ahi-ahi joe's joe's joe's joe's joe's the result result     result";
print( s.replace( /(\b\S+\b)(($|\s+)\1)+/g, "$1"))         
--> here is ahi-ahi joe's the result

Solution 13 - Regex

Try this regular expression it fits for all repeated words cases:

\b(\w+)\s+\1(?:\s+\1)*\b

Solution 14 - Regex

I think another solution would be to use named capture groups and backreferences like this:

.* (?<mytoken>\w+)\s+\k<mytoken> .*/  OR  .*(?<mytoken>\w{3,}).+\k<mytoken>.*/

Kotlin logo Kotlin:

val regex = Regex(""".* (?<myToken>\w+)\s+\k<myToken> .*""")
val input = "This is a test test data"
val result = regex.find(input)
println(result!!.groups["myToken"]!!.value)

Java logo Java:

var pattern = Pattern.compile(".* (?<myToken>\\w+)\\s+\\k<myToken> .*");
var matcher = pattern.matcher("This is a test test data");
var isFound = matcher.find();
var result = matcher.group("myToken");
System.out.println(result);

JavaScript logo JavaScript:

const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/;
const input = "This is a test test data";
const result = regex.exec(input);
console.log(result.groups.myToken);

// OR

const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/g;
const input = "This is a test test data";
const result = [...input.matchAll(regex)];
console.log(result[0].groups.myToken);

All the above detect the test as the duplicate word.
Tested with Kotlin 1.7.0-Beta, Java 11, Chrome and Firefox 100.

Solution 15 - Regex

Use this in case you want case-insensitive checking for duplicate words.

(?i)\\b(\\w+)\\s+\\1\\b

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionJoshuaView Question on Stackoverflow
Solution 1 - RegexGumboView Answer on Stackoverflow
Solution 2 - RegexMike ViensView Answer on Stackoverflow
Solution 3 - RegexAkritiView Answer on Stackoverflow
Solution 4 - RegexFaakhirView Answer on Stackoverflow
Solution 5 - RegexNiket PathakView Answer on Stackoverflow
Solution 6 - RegexsoulmergeView Answer on Stackoverflow
Solution 7 - RegexIgnacio Vazquez-AbramsView Answer on Stackoverflow
Solution 8 - RegexNecerosView Answer on Stackoverflow
Solution 9 - RegexsynaptikonView Answer on Stackoverflow
Solution 10 - RegexDanielView Answer on Stackoverflow
Solution 11 - RegexmickmackusaView Answer on Stackoverflow
Solution 12 - RegexNicoView Answer on Stackoverflow
Solution 13 - RegexMIsmailView Answer on Stackoverflow
Solution 14 - RegexMahozadView Answer on Stackoverflow
Solution 15 - RegexNeelamView Answer on Stackoverflow