encountered a sed POSIX gotcha (GNU vs. BSD)

Discussion:

Tim Chase sed@thechases.com [sed-users]

2018-05-06 12:31:30 UTC

Encountered this one yesterday and thought I'd bring it to light.

On a GNU system this works:

sed -n '/pattern/{s/foo/bar/p}'

but fails on BSD sed (tested on OpenBSD & FreeBSD)

sed: 1: "/pattern/{s/foo/bar/p}": bad flag in substitute command: '}'

This is actually POSIX which requires a newline or semicolon before a
close-brace:

"The <right-brace> shall be preceded by a <newline> or <semicolon>
(before any optional <blank> characters preceding the <right-brace>)."

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html#tag_20_116_13_03

Thus on a BSD system it needs to have the extra semi-colon:

sed -n '/pattern/{s/foo/bar/p;}'

Unfortunately, I suspect that a lot of scripts out there depend on
the GNU-specific relaxed (but POSIX-violating) rules, so enforcing
the POSIX will likely break them. However, I also know that a number
of GNU tools respect a POSIXLY_CORRECT environment variable which
might be useful in this situation.

And now you know too.

-tim

Stephane Chazelas stephane.chazelas@gmail.com [sed-users]

2018-05-06 18:53:39 UTC

Permalink

Post by Tim Chase ***@thechases.com [sed-users]
Encountered this one yesterday and thought I'd bring it to light.
sed -n '/pattern/{s/foo/bar/p}'
but fails on BSD sed (tested on OpenBSD & FreeBSD)
sed: 1: "/pattern/{s/foo/bar/p}": bad flag in substitute command: '}'
This is actually POSIX which requires a newline or semicolon before a
"The <right-brace> shall be preceded by a <newline> or <semicolon>
(before any optional <blank> characters preceding the <right-brace>)."
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html#tag_20_116_13_03
sed -n '/pattern/{s/foo/bar/p;}'
Unfortunately, I suspect that a lot of scripts out there depend on
the GNU-specific relaxed (but POSIX-violating) rules, so enforcing
the POSIX will likely break them. However, I also know that a number
of GNU tools respect a POSIXLY_CORRECT environment variable which
might be useful in this situation.

[...]

GNU sed is not violating POSIX here, a script that relies on the
specific behaviour of GNU sed (of treating it as if there was
a ; before the }) or for that matters on the BSD behaviour (of
failing with an error, though it would be very unusual for a
script to rely on a utility returning with an error) would be
the ones being non-conformant.

POSIX doesn't specify the behaviour for {s/x/y/} so either
failing with an error, or the GNU behaviour (or any other
behaviour) are valid behaviours. The GNU behaviour is a more
useful one. The only problem with it is that it doesn't help you
realise that your script is not portable.

But it's not GNU sed's job to do that.

What you're looking for is a linting tool that can spot non
portable constructs in your code. You could make a "sedlint"
fork of GNU sed that does that, like posh forked pdksh to make a
shell that helps identify non-POSIX shell constructs, or one
could write a shellcheck-like static code analysis tool to
identify non-standard/portable code for sed.

POSIXLY_CORRECT is to have tools align with POSIX when they
don't by default.

For instance, sed 's/[\t]//' is required to remove every
instance of backslash and t characters per POSIX, which GNU sed
doesn't do by default (and in that would not be compliant). GNU
sed only does that under POSIXLY_CORRECT (otherwise it removes
TAB characters instead).

Here, a more sensible thing to do would be to ask FreeBSD folks
to have their sed align with GNU sed, as it's a useful feature.

And then, if enough implementations align with GNU sed, ask
POSIX to make that into law.

Here however, there's a problem in that for instance:

{w file}

is currently meant to write to a file called "file}" (and that's
also what GNU sed does), and changing that could break backward
compatibility, so it's possible that POSIX would not bother
specifying a "}" that appears even without a preceding
";"/newline. They did however relax the requirement that "b foo}"
should branch to a "foo}" label (which GNU sed didn't even under
POSIXLY_CORRECT), so you never know.

--
Stephane

Tim Chase sed@thechases.com [sed-users]

2018-05-06 23:40:26 UTC

Permalink

Post by Stephane Chazelas ***@gmail.com [sed-users]

Post by Tim Chase ***@thechases.com [sed-users]
This is actually POSIX which requires a newline or semicolon
"The <right-brace> shall be preceded by a <newline> or <semicolon>
(before any optional <blank> characters preceding the
<right-brace>)."
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html#tag_20_116_13_03

GNU sed is not violating POSIX here, a script that relies on the
specific behaviour of GNU sed (of treating it as if there was
a ; before the }) or for that matters on the BSD behaviour (of
failing with an error, though it would be very unusual for a
script to rely on a utility returning with an error) would be
the ones being non-conformant.

The violation is that the spec says that a <newline> or <semicolon>
MUST precede a closing brace. GNU sed doesn't adhere to this part of
the spec or enforce it. That means that people can end up writing
sed scripts that they think are POSIX compliant, but aren't; and when
those non-compliant scripts are run in a POSIX-compliant version of
sed, they break.

Post by Stephane Chazelas ***@gmail.com [sed-users]
POSIX doesn't specify the behaviour for {s/x/y/} so either
failing with an error, or the GNU behaviour (or any other
behaviour) are valid behaviours.

POSIX does define correctness though: it SHALL be preceded by a
semicolon or newline. From RFC2119

"MUST: This word, or the terms "REQUIRED" or "SHALL", mean that
the definition is an absolute requirement of the specification."
(*) https://www.ietf.org/rfc/rfc2119.txt

GNU sed does not treat a semicolon/newline-before-close-brace as
required, despite the spec requiring it.

Post by Stephane Chazelas ***@gmail.com [sed-users]
The GNU behaviour is a more useful one. The only problem with it is
that it doesn't help you realise that your script is not portable.

Agreed.

Post by Stephane Chazelas ***@gmail.com [sed-users]
POSIXLY_CORRECT is to have tools align with POSIX when they
don't by default.

Which is exactly what GNU sed is doing here, being misaligned with
POSIX so the best outcome would be for GNU sed to respect

Post by Stephane Chazelas ***@gmail.com [sed-users]
For instance, sed 's/[\t]//' is required to remove every
instance of backslash and t characters per POSIX, which GNU sed
doesn't do by default (and in that would not be compliant). GNU
sed only does that under POSIXLY_CORRECT (otherwise it removes
TAB characters instead).

Also helpful to know.

-tim

Stephane Chazelas stephane.chazelas@gmail.com [sed-users]

2018-05-07 06:05:48 UTC

Permalink

2018-05-06 18:40:26 -0500, Tim Chase ***@thechases.com [sed-users]:
[...]

Post by Tim Chase ***@thechases.com [sed-users]
The violation is that the spec says that a <newline> or <semicolon>
MUST precede a closing brace. GNU sed doesn't adhere to this part of
the spec or enforce it. That means that people can end up writing
sed scripts that they think are POSIX compliant, but aren't; and when
those non-compliant scripts are run in a POSIX-compliant version of
sed, they break.

No again, that's a common misconception when interpreting a
standard specification and you wouldn't be the first one making
it, you'll see a lot of those on the Austin Group mailing list;
even implementators have been known to code into their
implementations what they thought was a requirement on *them* as
opposed to the applications using *them* (ending up reproducing
the same limitations/misfeatures of historical implementations).

A very important point about terminology when reading the spec
is *implementation* versus *application*.

POSIX specifies a *Portable* Operating System *Interface*.

That is it specifies a portable API. It tells *applications*
(the things that use the API; in the case of "sed", sed
scripts/invocations) how they should be written to be portable.
And it occasionally tells *implementations* (in this case, the
"sed" implementations like GNU, BSD, Solaris sed) how to do
things so they interpret portable code correctly.

POSIX doesn't specify what happens if an application doesn't
follow its specification like when a script does {s/x/y/}. It
doesn't specify what happens if you use the "i" flag (as in
s/x/y/i) or the "x" or "}" flag. It doesn't specify the "v" or
"k" sed commands, the -E, -i or --help options or if you use
"\t", "\+", "\|" in regexps outside of bracket expressions.

That doesn't mean that *implementations* have to report an error
when an *application* use any of these.

It would be silly if a specification like POSIX did that, if it
forbade extensions. That would mean that the interface would
have no chance of evolving.

POSIX is primarily *descriptive*, it describes a portable
interface, an interface portable across implementations that
were already existing, very rarely prescriptive (introduce new
APIs that implementations must start implementing to become
compliant).

The -E above is a good example. Since POSIX allows
*implementations* to support any option beside the ones it
specifies (but of course an *application* cannot use them),
several sed *implementations* have added a -E option to support
extended regexps. And as a result, that is being introduced in
a future edition of the POSIX specification
(http://austingroupbugs.net/view.php?id=528 scheduled for issue
8)

Post by Tim Chase ***@thechases.com [sed-users]

POSIX does define correctness though: it SHALL be preceded by a
semicolon or newline. From RFC2119
"MUST: This word, or the terms "REQUIRED" or "SHALL", mean that
the definition is an absolute requirement of the specification."
(*) https://www.ietf.org/rfc/rfc2119.txt

Note that POSIX has nothing to do with the IETF. It's a bad idea
to look at one specification to explain another specification.

For POSIX, look at
http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap01.html#tag_21_01_05
which refers to ISO/IEC directives. Like
https://www.iso.org/foreword-supplementary-information.html

The "shall" in the text that was quoted earlier (POSIX generally
doesn't use "must" let alone "MUST") applies to *applications*
(portable applications that make use of that standard API,
shell/sed scripts).

Post by Tim Chase ***@thechases.com [sed-users]
GNU sed does not treat a semicolon/newline-before-close-brace as
required, despite the spec requiring it.

It does treat it as required. sed '{s/x/y/;}' works as specified
in GNU sed. The behaviour for sed '{s/x/y}' however is not
specified in the POSIX spec. It is specified in the GNU sed
documentation however and again works as documented.

[...]

Post by Tim Chase ***@thechases.com [sed-users]

Post by Stephane Chazelas ***@gmail.com [sed-users]
POSIXLY_CORRECT is to have tools align with POSIX when they
don't by default.

Which is exactly what GNU sed is doing here, being misaligned with
POSIX so the best outcome would be for GNU sed to respect

No, POSIX *requires* sed '/[\t]/!d' to match on \ and t.
/[\t]/!d is a perfectly valid sed script whose behaviour is
fully specified by POSIX. An implementation like GNU sed without
POSIXLY_CORRECT that would fail to match on a line that contains
"t" would not be compliant.

On the other hand:

sed '{s/x/y/}' (or sed -E 's/x/y/' currently) is not a valid
POSIX sed invocation. The behaviour is unspecified as the
*application* did not obey the specification (POSIX doesn't
specify a "}" flag to the "s" command and requires the "}" to be
preceded by ";" or newline to be recognised as a closing group).

So sed *implementations* can do what they want here.

The closest POSIX will say to what you think it says is that
*if* an *implementation* considers it as a syntax error (or if
it doesn't recognise the -E option), then it shall report an
error on stderr (whose text is not specified) and exit with a
non-zero exit status.

--
Stephane

Daniel Goldman dgoldman@ehdp.com [sed-users]

2018-05-06 20:32:00 UTC

Permalink

That's worth discussing. I interpret the POSIX text the same as you do,
that POSIX would require a newline or semicolon before the right-brace.
So on one hand, you seem perfectly right.

OTOH, my understanding was that ';' is used to *separate* sed commands,
not terminate them. The gnu sed manual says, "A semicolon (';') may be
used to separate most simple commands".

Can anyone else find ';' mentioned in POSIX documentation? I looked, did
not seem any mention of using ';' to separate commands.

To me, having ';' capability seems pretty essential. If POSIX does not
mention ';' between commands, it would seem POSIX is maybe obsolete, in
need of updating.

I know the original sed did not use ';' syntax. I believe there are
still some sed versions that do not support using ';' between commands.
Those versions seem not adequate to me.

If I was designing from scratch, I think gnu sed probably has it "right"
here, because sed should be able to see that the right-brace is
terminating the { } group. How does the ';' help figure that out? To me
';}' seems OK (like an odd emoticon!), but just '}' seems better.

I tend to take gnu sed as the "truth". It seems to me the best / most
functional version. But maybe I am wrong. Which takes precedence: gnu
sed behavior (my vote) or POSIX specification?

Perhaps no easy answers here. I have the luxury of just using gnu sed,
so not a problem for me. But yes, script portability matters.

Daniel

------------------------------------

------------------------------------
--
------------------------------------

Yahoo Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/sed-users/

<*> Your email settings:
Individual Email | Traditional

<*> To change settings online go to:
http://groups.yahoo.com/group/sed-users/join
(Yahoo! ID required)

<*> To change settings via email:
sed-users-***@yahoogroups.com
sed-users-***@yahoogroups.com

<*> To unsubscribe from this group, send an email to:
sed-users-***@yahoogroups.com

<*> Your use of Yahoo Groups is subject to:
https://info.yahoo.com/legal/us/yahoo/utos/terms/

Tim Chase sed@thechases.com [sed-users]

2018-05-06 22:35:13 UTC

Permalink

Post by Daniel Goldman ***@ehdp.com [sed-users]
Can anyone else find ';' mentioned in POSIX documentation? I
looked, did not seem any mention of using ';' to separate commands.

From the same POSIX spec I linked to, right above the previous text I
quoted I found this:

"Editing commands other than {...}, a, b, c, i, r, t, w, :, and # can
be followed by a <semicolon>, optional <blank> characters, and
another editing command."

Post by Daniel Goldman ***@ehdp.com [sed-users]
To me, having ';' capability seems pretty essential. If POSIX does
not mention ';' between commands, it would seem POSIX is maybe
obsolete, in need of updating.

So the ";" separating commands *is* part of POSIX.

Post by Daniel Goldman ***@ehdp.com [sed-users]
If I was designing from scratch, I think gnu sed probably has it
"right" here, because sed should be able to see that the
right-brace is terminating the { } group. How does the ';' help
figure that out? To me ';}' seems OK (like an odd emoticon!), but
just '}' seems better.

I agree with the sentiment here -- it's how I found it because I'm
usually in a GNU environment and I usually omit it, but got stung
testing a command on OpenBSD that had worked on Debian.

Post by Daniel Goldman ***@ehdp.com [sed-users]
I tend to take gnu sed as the "truth". It seems to me the best /
most functional version. But maybe I am wrong. Which takes
precedence: gnu sed behavior (my vote) or POSIX specification?

I think this is why I lean towards the GNU convention of a
POSIXLY_CORRECT environment variable. For those cases where POSIX
compliance matters, you can get it; yet without POSIXLY_CORRECT in
the environment, it shouldn't break existing non-POSIX scripts. And
if you claim you are POSIXLY_CORRECT via your environment and aren't
compliant, it should balk at such violations and error out.

-tim

Daniel Goldman dgoldman@ehdp.com [sed-users]

2018-05-06 23:12:03 UTC

Permalink

"From the same POSIX spec" - Thank you for pointing out the ; text in
the POSIX specification. I searched for ';' and did not see anything
relevant. Did not think to search for 'semicolon'... And thanks again
for pointing out the gotcha, and about POSIXLY_CORRECT. Daniel

Post by Tim Chase ***@thechases.com [sed-users]

Post by Daniel Goldman ***@ehdp.com [sed-users]
Can anyone else find ';' mentioned in POSIX documentation? I
looked, did not seem any mention of using ';' to separate commands.

From the same POSIX spec I linked to, right above the previous text I
"Editing commands other than {...}, a, b, c, i, r, t, w, :, and # can
be followed by a <semicolon>, optional <blank> characters, and
another editing command."

So the ";" separating commands *is* part of POSIX.

I agree with the sentiment here -- it's how I found it because I'm
usually in a GNU environment and I usually omit it, but got stung
testing a command on OpenBSD that had worked on Debian.