Discussion:
Start and end of line anchors in middle of regexp
gilles_arcas@hotmail.com [sed-users]
2014-06-27 19:53:17 UTC
Permalink
I am trying to understand the working of ^ and $ in the middle of a regexp with gnused 4.2.1. The help file says they are considered as normal characters (except at start and end of sub expressions).


This can be checked with (xa^by in input.txt):


sed -n -e "/a^b/p" input.txt
sed -n -e "/\(a^b\)/p" input.txt


However when using the -r switch, none of the previous commands work as expected:


sed -r -n -e "/a^b/p" input.txt
sed -r -n -e "/(a^b)/p" input.txt



This is puzzling. Am I missing something?
Davide Brini dave_br@gmx.com [sed-users]
2014-06-30 15:22:18 UTC
Permalink
Post by ***@hotmail.com [sed-users]
I am trying to understand the working of ^ and $ in the middle of a
regexp with gnused 4.2.1. The help file says they are considered as
normal characters (except at start and end of sub expressions).
sed -n -e "/a^b/p" input.txt
sed -n -e "/\(a^b\)/p" input.txt
sed -r -n -e "/a^b/p" input.txt
sed -r -n -e "/(a^b)/p" input.txt
This is puzzling. Am I missing something?
Without checking, I seem to remember that when using basic RE (the default
if you don't use -r), the ^ and $ anchors are special only if they appear
respectively at the beginning and at the end of a subexpression and normal
characters otherwise, whereas with extended RE (those you get with -r/-E)
they are always special.

So in short, the difference in behavior is expected.
--
D.
Daniel Goldman dgoldman@ehdp.com [sed-users]
2014-06-30 19:21:42 UTC
Permalink
I also found this puzzling and confusing. This was a new one for me.

Looking through the official documentation, here is what I found on the
"Open Group": "1. A <circumflex> ( '^' ) outside a bracket expression
shall anchor the expression or subexpression it begins to the beginning
of a string; such an expression or subexpression can match only a
sequence starting at the first character of a string. For example, the
EREs "^ab" and "(^ab)" match "ab" in the string "abcdef", but fail to
match in the string "cdefab", and the ERE "a^b" is valid, but can never
match because the 'a' prevents the expression "^b" from matching
starting at the first character."

This makes it clear that with ERE "a^b" can never match. Now why they
would design something that can never match is beyond me. :( Anyway,
that's the way it is.

Here is what I found testing sed, using variations on your examples:

$ sed --version
GNU sed version 4.2.1

///////// First, basic regular expression (BRE) (no -r)

----- As BRE, ^ as char #1 means "begin pattern space"
----- We already knew that.

$ echo 'xaby' | sed 's/^xa/==/'
==by

----- As BRE, ^ NOT char #1 means "literal ^ character"
----- We already knew that.

$ echo 'xa^by' | sed 's/a^b/===/'
x===y

----- As BRE, \^ ALWAYS means "literal ^ character"
----- We already knew that.

$ echo 'xa^by' | sed 's/a\^b/===/'
x===y

$ echo 'xaby' | sed 's/\^xa/==/'
xaby

$ echo '^xaby' | sed 's/\^xa/==/'
==by

///////// Next, extended regular expression (ERE) (-r)

----- As ERE, ^ as char #1 means "begin pattern space"
----- We already knew that. Same as BRE behavior.

$ echo 'xaby' | sed -r 's/^xa/==/'
==by

----- As ERE, ^ NOT char #1 still means "begin pattern space"
----- Big surprise, at least to me. "a^b" NEVER matches as ERE.

$ echo 'xa^by' | sed -r 's/a^b/===/'
xa^by

----- As ERE, \^ ALWAYS means "literal ^ character"
----- We already kind of knew that. Same as BRE behavior.

$ echo 'xa^by' | sed -r 's/a\^b/===/'
x===y

$ echo 'xaby' | sed -r 's/\^xa/==/'
xaby

$ echo '^xaby' | sed -r 's/\^xa/==/'
==by

Assuming I have not missed something or made a mistake, here is what I
learned:

If using ^ as a literal in the middle of BRE, it's perhaps better to use
\^ instead, so the ERE will also work OK. Instead of matching to "a^b"
(admittedly ambiguous), it's maybe better to always use "a\^b" to make
clear matching literal hat character. Same logic applies to dollar sign,
as "e$f" can never match as ERE.

I don't think this quirky behavior was generally recognized before your
post. I appreciate you found this.

Daniel
Post by ***@hotmail.com [sed-users]
I am trying to understand the working of ^ and $ in the middle of a
regexp with gnused 4.2.1. The help file says they are considered as
normal characters (except at start and end of sub expressions).
sed -n -e "/a^b/p" input.txt
sed -n -e "/\(a^b\)/p" input.txt
sed -r -n -e "/a^b/p" input.txt
sed -r -n -e "/(a^b)/p" input.txt
This is puzzling. Am I missing something?
Visit Your Group
<https://groups.yahoo.com/neo/groups/sed-users/info;_ylc=X3oDMTJldnJxbjVsBF9TAzk3MzU5NzE0BGdycElkAzI0ODk2MzkEZ3Jwc3BJZAMxNzA5MzM1MDAyBHNlYwN2dGwEc2xrA3ZnaHAEc3RpbWUDMTQwNDEzOTM3MA-->
Yahoo! Groups
<https://groups.yahoo.com/neo;_ylc=X3oDMTJkaThtOTdpBF9TAzk3NDc2NTkwBGdycElkAzI0ODk2MzkEZ3Jwc3BJZAMxNzA5MzM1MDAyBHNlYwNmdHIEc2xrA2dmcARzdGltZQMxNDA0MTM5Mzcw>
• Privacy <https://info.yahoo.com/privacy/us/yahoo/groups/details.html>
• Unsubscribe
Terms of Use <https://info.yahoo.com/legal/us/yahoo/utos/terms/>
dgoldman@ehdp.com [sed-users]
2014-07-03 17:47:48 UTC
Permalink
I'm Posting for Gilles. He tried to post twice, and it just ended up going to me. The problem is the yahoo groups interface is kind of confusing and poorly designed. The default is to reply to the person, not the group. So the message is easily lost. :( This tripped me up in the past. Now it tripped up someone else. In it's wisdom, the yahoo group interface even "hides" the "To:" box, so the poster is left in the dark. :( You can "unhide" the "To:" box, by clicking on the double arrow symbol, but it's not obvious at all. Anyway, here is the message from Gilles:


-------------------------------------------------------------------------


Thanks a lot for answers and details. Quoting GNU sed 4.2.1 manual:

"The only difference between basic and extended regular expressions is
in the behavior of a few characters: ‘?’, ‘+’, parentheses, and braces
(‘{}’)".

No mention to ^ and $. This could perhaps be added. Are they other known
differences between BRE and ERE?

Thanks again.
dgoldman@ehdp.com [sed-users]
2014-07-03 17:58:56 UTC
Permalink
<< No mention to ^ and $. This could perhaps be added. Are they
<< other known differences between BRE and ERE?


Yes, the GNU sed docs say nothing about ERE treatment of ^ and $. I wrote a book about sed, basically because the sed documentation and other learning resources are rather outdated and poor. In the course of writing the book, I did a LOT of research on sed and regular expressions. Yet I never discovered what you posted. I updated the web site related to the book to add what you found. It's the first correction to the book! I appreciate your finding it.

I would add that BRE and ERE also treat '|' differently. With BRE '\|' is used. But '|' is used with ERE. Otherwise, I do not know any other differences between BRE and ERE. But this has gotten me thinking maybe there are some other differences lurking out there.


I have no idea about updating the GNU sed manual, but posting as you did might help bring that about.

Thanks,
Daniel


PS - I almost posted this to myself, even after my previous explanation. That yahoo interface IS unfriendly.
Loading...