Multiline regex pattern

Task: Parse a file and capture whatever text appears between a pair of double quotes like the following:

“Catch me”

Not so difficult, you could use the following regex:

“.*”

This will catch any character within double quotes in a group
¿any?

Well, if you have to deal with multi-line characters (CR / LF) like in the following text:

“Catch me
if you can”

The . special character that means “any character” in fact means “any character, except for newlines”. So it won’t work in our case. And you can’t also put it within a set of characters like: [.s] which would mean:
. match any character
s plus any space character (including new line ones)
The problem is that inside [] the special characters like . or * or ? lose their meaning and are treated as literals.
Python address this issue with the option re.DOTALL, which makes a dot mean really any character even a new lines.
If you are working with other language, with a regex library, but without this option, like C# for instance you could use this trick:
“[wW]*”

The [wW] means: “catch any word character + non word characters”. You could solve it by combining other special characters, but I find this way specially clear as there is no doubt that you will truly catch any character as you are just adding two complementary sets.

Making things more complex

If you have a file like this:

“Catch me
if you can”
“other line”
“and the last one”

The regex will catch from the first double quote, to the last one. To solve it you shoud use a non greedy multiplier like: *?


When you use * you are saying “match zero or more characters that fulfill the preceding condition.” But the regex engine will choose the longest match possible, as you are using a greedy quantifier. To solve this you should use a non-greedy quantifier, like in this regex:
“[wW]*?”
And to put anything (except the double-quotes) in a group (so you could for instance iterate over the results) , just add some brackets after and before the quotes:

“([wW]*?)”


Code examples

In C# (Visual C# 2010) we need to do the following:


using System;
using System.Text.RegularExpressions;</code>

namespace ConsoleApplication1
{
	class TestRegularExpressions
	{
		static void Main()
		{
			// double "" are used to escape double-quotes
			// "?" is used to give the capture text a simple name
			// @ means the text is a string literal and we don't want that C# escapes any character (like is usual when you write regex patterns)
			string pattern = @"""(?[wW]*?)""";</code>

			Regex regex = new Regex(pattern);

			string text = new System.IO.StreamReader(@"c:Usersadriantest.txt").ReadToEnd();
			/*
			* Suppose that c:\Users\adrian\test.txt has the following content:
			*
			"Catch me
			if you can"
			"other line"
			"and the last one"
			*/

			Match m = regex.Match(text);

			//iterate in all the captures
			while (m.Success)
			{
				Console.WriteLine("Captured line: " + m.Groups["quoted_line"]);
				m = m.NextMatch();
			}

			Console.WriteLine();

		}
	}
}

That will print:
Captured line: Catch me
if you can
Captured line: other line
Captured line: and the last one


Of course in Python you have to invest much less effort to get the same.

'''
Created on 25/12/2009</code>

capturing regex groups example

@author: adrian
'''

import re

pattern = r'"(?P.*?)"'

text = """"Catch me
if you can"
"other line"
"and the last one" """

# Retrieve group(s) by name
for m in re.finditer(pattern, text, re.DOTALL):
    print "Captured line: %s " % m.group("quoted_line")

The output is the same as before:

Captured line: Catch me
if you can
Captured line: other line
Captured line: and the last one

Note the differences between Python and C#:

  • As we previously mentioned, you can use re.DOTALL to capture also new lines.
  • To name a group “quoted line” you write in Python ?P<quoted_line> instead of the C# version ?<quoted_line>
  • You write less and get more!

7 thoughts on “Multiline regex pattern

  1. Carmelia Ceasar

    Hi, i think that i saw you visited my web site thus i came to “return the favor”.I’m attempting to find things to improve my site!I suppose its ok to use a few of your ideas!!

    Reply
  2. bicis electricas

    Very great post. I just stumbled upon your weblog and wished to mention that I have really enjoyed browsing your weblog
    posts. After all I’ll be subscribing for your rss feed and I’m hoping you write once more soon!

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>