repair comb-quoting

Discussion:

repair comb-quoting

Sven Guckes maillists-yahoo@guckes.net [sed-users]

2016-01-21 04:23:42 UTC

yet another problem.. "comb-quoted" text.

this happens on replies to messages where the lines extend 72chars.
the mailer cuts of the words which extend that limit and then put
the word into the next line. usually these result in short lines.
(besides, this happens a lot with gmail user using the browser.)

I ****'* ****** ********* *** ****'* ******* *** ******* ** *** ******
*******, *** ** ** ****'* *** *** ******** ** ******* *** **** **** ***
**
***** ** ** **** ******* *** ***** ****** *** ** ** *** *******, **
*****
*** **** *****'** **** **** *********.
* ***** ** ***** ** ****** ** ***** *** * ************ ****** ******
******** *** ********* ************* (**** * "***** *** ** *****"
********,
********* ** *********; *** * ******* ** ******* ** ** ****).

as you can see: long and short lines alternate. looks like a comb. ;)

so - how to fix that?

basically, a simple algorithm could work like this:

if current line is quoted and is less than M chars
and previous line was quoted and is more than N chars
then remove quotation and join with previous line.

do you see an easy way to do this within sed?
or is this a better job for awk already
(easier for branches and comparions)?

ps: yes, i aware of the t-prot tool
http://www.escape.de/~tolot/mutt/
but this is sed country, right?

Sven

sharma__r@hotmail.com [sed-users]

2016-01-21 05:53:43 UTC

Post by Sven Guckes maillists-***@guckes.net [sed-users]
do you see an easy way to do this within sed?
or is this a better job for awk already
(easier for branches and comparions)?

Oh yeah, "sed" is upto the task here...

sed -e '
# limit = 72 chars/line

/.\{72\}./bgreater

$q; N; s/\n//; s/^/\n/; D

:greater

s/.\{72\}/&\n/;P;D
' < your_comb_quoted.txt

HTH
-Rakesh

---In sed-***@yahoogroups.com, <maillists-***@...> wrote :

yet another problem.. "comb-quoted" text.

this happens on replies to messages where the lines extend 72chars.
the mailer cuts of the words which extend that limit and then put
the word into the next line. usually these result in short lines.
(besides, this happens a lot with gmail user using the browser.)

Post by Sven Guckes maillists-***@guckes.net [sed-users]
I ****'* ****** ********* *** ****'* ******* *** ******* ** *** ******
*******, *** ** ** ****'* *** *** ******** ** ******* *** **** **** ***
**
***** ** ** **** ******* *** ***** ****** *** ** ** *** *******, **
*****
*** **** *****'** **** **** *********.
* ***** ** ***** ** ****** ** ***** *** * ************ ****** ******
******** *** ********* ************* (**** * "***** *** ** *****"
********,
********* ** *********; *** * ******* ** ******* ** ** ****).

as you can see: long and short lines alternate. looks like a comb. ;)

so - how to fix that?

basically, a simple algorithm could work like this:

if current line is quoted and is less than M chars
and previous line was quoted and is more than N chars
then remove quotation and join with previous line.

do you see an easy way to do this within sed?
or is this a better job for awk already
(easier for branches and comparions)?

ps: yes, i aware of the t-prot tool
http://www.escape.de/~tolot/mutt/ http://www.escape.de/~tolot/mutt/
but this is sed country, right?

Sven

[Non-text portions of this message have been removed]

dgoldman@ehdp.com [sed-users]

2016-01-21 17:34:18 UTC

I often run into the same problem with wrapped lines in mail messages in the wrong places. I don't like them either. It's great Sven has the potential to automatically clean them up.

It seems to me the suggested solution does not accomplish the purpose, instead basically destroys the mail message. Here is my understanding:

$ cat mail-input.txt
line-start some-text some-text some-text some-text some-text
wrapped-text

line-start some-text some-text some-text some-text some-text
wrapped-text
line-start some-text some-text some-text some-text some-text
line-start some-text some-text some-text some-text some-text
line-start some-text some-text some-text some-text some-text
wrapped-text

line-start some-text some-text some-text some-text some-text
wrapped-text

$ cat desired-output.txt
line-start some-text some-text some-text some-text some-text
wrapped-text

line-start some-text some-text some-text some-text some-text wrapped-text
line-start some-text some-text some-text some-text some-text
line-start some-text some-text some-text some-text some-text
line-start some-text some-text some-text some-text some-text wrapped-text

line-start some-text some-text some-text some-text some-text
wrapped-text

$ cat script-2.txt
/.\{72\}./ b greater
$q; N; s/\n//; s/^/\n/; D
:greater
s/.\{72\}/&\n/; P; D

$ sed -f script-2.txt mail-input.txt
line-start some-text some-text some-text some-text some-text wrapped-tex
t> line-start some-text some-text some-text some-text some-text > wrappe
d-text> line-start some-text some-text some-text some-text some-text > l
ine-start some-text some-text some-text some-text some-text > line-start
some-text some-text some-text some-text some-text > wrapped-text line-s
tart some-text some-text some-text some-text some-text wrapped-text

To me, the mail message seems messed up by the script. Maybe I'm missing something.

Daniel

[Non-text portions of this message have been removed]

dgoldman@ehdp.com [sed-users]

2016-01-21 17:54:11 UTC

Does the t-prot tool take care of the comb-quoting problem? If so, maybe it's not a bad idea to take advantage of that capability in this instance. Again, if you put the filter inside a shell wrapper, you can potentially take advantage of multiple tools, use sed for X problem, t-prot for Y problem, ex for Z problem.

When I see this problem with comb-quoting, I seem to remember it happens when the quote is indented several levels (makes sense). Is that the case with you? If so, that would complicate the solution, because the quoted line might start with > or >> or >> etc.

Daniel

[Non-text portions of this message have been removed]

Sven Guckes maillists-yahoo@guckes.net [sed-users]

2016-01-22 05:24:50 UTC

Post by ***@ehdp.com [sed-users]
Does the t-prot tool take care of the comb-quoting problem?
If so, maybe it's not a bad idea to take advantage of that
capability in this instance. Again, if you put the filter inside a
shell wrapper, you can potentially take advantage of multiple tools,
use sed for X problem, t-prot for Y problem, ex for Z problem.

yes - a script to employ more than one tool is a neat idea.
but for now i am still trying to put it all into if possible.
next step is maybe to put the really necessary things into awk.
so it will still be "one program - with one setup file."

t-prot is written in perl5 and simply hides text, like this:

[---=| TOFU protection by t-prot: 23 lines snipped |=---]

DESCRIPTION
This program is a filter to improve the readability of
internet messages (emails and usenet posts) by *hiding*
some annoying parts, e.g. mailing list footers,
signatures, and TOFU (see definition below), as well as
squeezing sequences of blank lines or punctuation.

source: script+manual -> http://www.escape.de/~tolot/mutt/
written by Jochen Striepe

and he references the display_filter script by Phil Gold:
http://aperiodic.net/phil/configs/bin/mutt-display-filter
this one uses t-prot (though commented out) and sed. :-)

Post by ***@ehdp.com [sed-users]
When I see this problem with comb-quoting, I seem to remember it
happens when the quote is indented several levels (makes sense).
Is that the case with you? If so, that would complicate the solution,
because the quoted line might start with > or >> or >> etc.

indeed - a simple fix only caters for the first level of
quotation only. this is far from perfect, of course.

i still have to fix it all when i respond to
the message. but this is where vim gets in.
vim has support for built-in reformatting which
easily caters for multiple levels of quotation.

so once i fix this comb-quoting by joining those
dangling words from short lines to the previous line
then a reformatting of the paragraph does it all. :)

Sven

sharma__r@hotmail.com [sed-users]

2016-01-22 06:59:35 UTC

Post by Sven Guckes maillists-***@guckes.net [sed-users]
so once i fix this comb-quoting by joining those
dangling words from short lines to the previous line
then a reformatting of the paragraph does it all. :)
Sven

A preliminary version of the comb de-quoter is as implemented below:

sed -e '
# to un-quote comb quotes in a mailer
# v1.0

# uncited lines
/^>/!{x; /./p; s/.*//; x; b;}

# cited lines below this
/^>.\{1,30\}$/{
# minor line = line length <= 30
x
/^>.\{50\}/{
# major line = line length >= 50
G; s/\n>//; x; s/.*//; x
}
b
}
x;/./p;d
'

and as a one-liner for incorporation in your mailer setup:

sed -e '/^>/!buncited' -e '/^>.\{30\}./bbigcited' -e 'x;/^>.\{50\}/!b' -e 'G;s/\n>//;x;s/.*//;x;b' -e ':bigcited' -e 'x;/./p;d' -e ':uncited' -e 'x;/./p;s/.*//;x'

HTH.
-Rakesh

[Non-text portions of this message have been removed]

Sven Guckes maillists-yahoo@guckes.net [sed-users]

2016-01-22 14:00:00 UTC

A preliminary version of the comb de-quoter ..
sed -e '/^>/!buncited' -e '/^>.\{30\}./bbigcited' -e 'x;/^>.\{50\}/!b' -e 'G;s/\n>//;x;s/.*//;x;b' -e ':bigcited' -e 'x;/./p;d' -e ':uncited' -e 'x;/./p;s/.*//;x'

whee! :) big thanks!

works very nicely on my sample text here (see below).

challenge:
here is a text with *multiple* citations,
wrapped at a texwidth of 72 characters:
http://www.guckes.net/examples/gg.tw72.comb.txt
can you get this to work with this, too?

Sven

----------------------------------------------------------------------

Artikel 5 (1) Jeder hat das Recht, seine Meinung in Wort, Schrift
und Bild
frei zu Ã€uÃern und zu verbreiten und sich aus allgemein zugÃ€nglichen
Quellen
ungehindert zu unterrichten. Die Pressefreiheit und die Freiheit
der
Berichterstattung durch Rundfunk und Film werden gewÃ€hrleistet. Eine
Zensur
findet nicht statt. (2) Diese Rechte finden ihre Schranken in
den
Vorschriften der allgemeinen Gesetze, den gesetzlichen Bestimmungen
zum
Schutze der Jugend und in dem Recht der persÃ¶nlichen Ehre. (3)
Kunst und
Wissenschaft, Forschung und Lehre sind frei. Die Freiheit der Lehre
entbindet
nicht von der Treue zur Verfassung.

Jim Hill gjthill@gmail.com [sed-users]

2016-01-22 15:16:18 UTC

On windows atm, gmail's my option, sorry for formatting,

#/usr/bin/sed -Ef
/^> / { N;
/.{70,}/ s/((> )*)([^>\n][^\n]*)\n\1([^> ].{,})$/\1\3 \4/
P;D;
}

works on your sample, which doesn't seem to have actually been wrapped as
the `tw72` in its name implies because some of the apparently desired
output lines are less than that after joining. The constraints are
arbitrary anyway, play with the {,}s to taste, I think the N;munch;P;D
usage is the payload here.

[Non-text portions of this message have been removed]

Sven Guckes maillists-yahoo@guckes.net [sed-users]

2016-01-22 15:42:32 UTC

Post by Jim Hill ***@gmail.com [sed-users]
On windows atm, gmail's my option, sorry for formatting,
#/usr/bin/sed -Ef
/^> / { N;
/.{70,}/ s/((> )*)([^>\n][^\n]*)\n\1([^> ].{,})$/\1\3 \4/
P;D;
}
works on your sample,

indeed, it does! :) thank you very much!

http://www.guckes.net/examples/gg.tw72.comb.txt

my sed does not have the option "-E",
so i had to add change it a bit:

"(> )*" --> "[> ]*"
"{...}" --> "\{...\}"

and the "\4" has to be "\2" ;)

adding a semicolon allows to put it all into one line:

# 2016-01-22 by Jim Hill <***@gmail.com>
/^> / { N; /.\{70,\}/s/\([> ]*\)\([^>\n][^\n]*\)\n\1\([^> ].\{,\}\)$/\1\3 \2/ ; P;D; }

now your solution and Rakesh's one are in my display_filter!

Post by Jim Hill ***@gmail.com [sed-users]
which doesn't seem to have actually been wrapped as the
`tw72` in its name implies because some of the apparently
desired output lines are less than that after joining.
The constraints are arbitrary anyway, play with the {,}s to
taste, I think the N;munch;P;D usage is the payload here.

will do. thanks again! :)

Sven [rewriting his display_filter]

'Ruud H.G. van Tol' rvtol@isolution.nl [sed-users]

2016-01-22 16:47:05 UTC

Post by Sven Guckes maillists-***@guckes.net [sed-users]
http://www.guckes.net/examples/gg.tw72.comb.txt

A Perl alternative: concatenate a short line to its previous line,
but only if it has the same prefix.

perl -e'
my $buf= join "", <>;
$buf =~ s/^(>(?>(?: *>)*) )(.*)\n^\1(.{1,20})$/$1$2 $3/mg;
print $buf;
' ~/mail.txt

-- Ruud

'Ruud H.G. van Tol' rvtol@isolution.nl [sed-users]

2016-01-22 16:54:40 UTC

Post by 'Ruud H.G. van Tol' ***@isolution.nl [sed-users]

Post by Sven Guckes maillists-***@guckes.net [sed-users]
http://www.guckes.net/examples/gg.tw72.comb.txt

A Perl alternative: concatenate a short line to its previous line,
but only if it has the same prefix.
perl -e'
my $buf= join "", <>;
$buf =~ s/^(>(?>(?: *>)*) )(.*)\n^\1(.{1,20})$/$1$2 $3/mg;
print $buf;
' ~/mail.txt

If there can be two short lines after another, then make the main line:

1 while $buf =~ s/^(>(?>(?: *>)*) )(.*)\n^\1(.{1,20})$/$1$2 $3/mg;

-- Ruud

sharma__r@hotmail.com [sed-users]

2016-01-22 21:13:09 UTC

sed -e '
# v2.0
/^>/!b
$q
/.\{30\}./!b
N
/\n\(> \)\{1,\}$/b
/\n>.\{30\}./{P;D;}
/^>.\{50\}.*\n/s/\n\(> \)\{1,\}//
' your_comb_quoted_file

[Non-text portions of this message have been removed]

sharma__r@hotmail.com [sed-users]

2016-01-22 21:56:54 UTC

sed -e '
# v3.0
# skip uncited, last, & small lines
/^>/!b
$q
/.\{30\}./!b
N
/^\(\(> \)\2*\)[^>].*\n\1[^>]/!{P;D;}
/\n\(> \)\1*$/b
/\n>.\{30\}./{P;D;}
/^>.\{50\}.*\n/s/\n\(> \)\1*//
' comb_quoted_file

can be made into a 1-liner by putting each uncommented line above in an -e '....' block, to get as many -e '...' blocks as the number of lines.

[Non-text portions of this message have been removed]

dgoldman@ehdp.com [sed-users]

2016-02-01 03:47:45 UTC

I'm sorry, but I need to point out that it is incorrect to say that the given script can be "made into a 1-liner by putting each uncommented line above in an -e '....' block, to get as many -e '...' blocks as the number of lines."

I have never seen a definition of "one-liner". So it's understandable some might be confused. However, this seems pretty basic to me. I submit that a long script cannot be a "one-liner", unless converted to a short script. I believe there is consensus that "one-liner" means "short script written on one line". Here is my reasoning:

1) Usage - Each http://sed.sourceforge.net/sed1line.txt http://sed.sourceforge.net/sed1line.txt example is a "short script written on one line". The usage on this famous page makes the definition clear. 2) Common sense - If any script can be a one-liner by putting each line into a -e block, then one-liner has no meaning, because any script would qualify. However, common sense says "one-liner" does have a meaning ("short script written on one line").

There is nothing wrong with long sed scripts, as long as they work, are easy to understand, and easy to maintain. The given script is not easy to understand and maintain. However, converting it to a "one-liner" makes it even worse. It ends up about 180 columns long, all strung out on one line. This qualifies as a "coding horror", is a bad idea no matter what we call it:

sed -e '/^>/!b' -e '$q' -e '/.\{30\}./!b' -e 'N' -e '/^\(\(> \)\2*\)[^>].*\n\1[^>]/!{P;D;}' -e '/\n\(> \)\1*$/b' -e '/\n>.\{30\}./{P;D;}' -e '/^>.\{50\}.*\n/s/\n\(> \)\1*//' comb_quoted_file :(

Words are important. To have meaningful discussion, we have to agree on basic concepts and vocabulary. If "one-liner" does not mean "short script written on one line", I'd like to hear. Or if someone agrees with my suggested definition, that would also be helpful.

Thanks.
Daniel

------------------------------

sed -e '
# v3.0
# skip uncited, last, & small lines
/^>/!b
$q
/.\{30\}./!b
N
/^\(\(> \)\2*\)[^>].*\n\1[^>]/!{P;D;}
/\n\(> \)\1*$/b
/\n>.\{30\}./{P;D;}
/^>.\{50\}.*\n/s/\n\(> \)\1*//
' comb_quoted_file

can be made into a 1-liner by putting each uncommented line above in an -e '....' block, to get as many -e '...' blocks as the number of lines.

[Non-text portions of this message have been removed]

dgoldman@ehdp.com [sed-users]

2016-02-08 04:24:42 UTC

Post by ***@ehdp.com [sed-users]
Words are important. To have meaningful discussion,
we have to agree on basic concepts and vocabulary.
If "one-liner" does not mean "short script written on one line",
I'd like to hear. Or if someone agrees with my suggested
definition, that would also be helpful.

Anybody?

[Non-text portions of this message have been removed]

Jim Hill gjthill@gmail.com [sed-users]

2016-02-08 05:34:54 UTC

One-liner's a characterization for humans, not a syntactic entity for
machines.

[Non-text portions of this message have been removed]

dgoldman@ehdp.com [sed-users]

2016-02-08 18:22:42 UTC

Post by Jim Hill ***@gmail.com [sed-users]
One-liner's a characterization for humans,
not a syntactic entity for machines.

Yes, "one-liner" is a characterization for humans. But I never suggested that "one-liner" is a "syntactic entity for machines". So I don't understand the comment.

What I did suggest was that it seems wrong to say that a long script "can be made into a 1-liner by putting each uncommented line in an -e '....' block, to get as many -e '...' blocks as the number of lines". I won't repeat the reasoning here, please refer back to and respond to my previous post.

I also suggested that, in the lack of a formal definition anywhere of the commonly used term "one-liner", "short script written on one line" might serve. I was asking for comments (agree?, disagree?) on that definition of "one-liner".

Thanks,
Daniel

[Non-text portions of this message have been removed]

sharma__r@hotmail.com [sed-users]

2016-01-22 22:18:16 UTC

Post by Sven Guckes maillists-***@guckes.net [sed-users]
my sed does not have the option "-E",

"(> )*" --> "[> ]*"

Note the above transformation is not equivalent. They mean entirely different things.

in your sed you try to accommodate the regex as

(> )* --> \(> \)\1* assuming at least one > space pair.

while the regex

[> ]* --> can mean any sequence of > or spaces & not necessarily the repeat of > space
, though that would be just one special case of that.

[Non-text portions of this message have been removed]

Sven Guckes maillists-yahoo@guckes.net [sed-users]

2016-02-09 06:07:54 UTC

so on Feb 8th this list has been in existence for 5555 days. yay!
however, time to move on. more news coming up. stay tuned!

Sven

dgoldman@ehdp.com [sed-users]

2016-02-09 08:12:09 UTC

Thanks for the work you do organizing and maintaining the list. Daniel

[Non-text portions of this message have been removed]

Jim Hill gjthill@gmail.com [sed-users]

2016-01-22 23:06:26 UTC

Substituting `[> ]` for `(> )` looks like an improvement to me, but unless
I'm missing something you want `\1\2 \3`.

Out of curiosity, what sed doesn't do `-E`?

[Non-text portions of this message have been removed]

dgoldman@ehdp.com [sed-users]

2016-01-23 01:33:33 UTC

Post by Jim Hill ***@gmail.com [sed-users]
Out of curiosity, what sed doesn't do `-E`?

None of the gnu sed documentation mentions -E (big E) option. So I think it's understandable someone might be confused by -E syntax. However, GNU sed allows the (undocumented) -E as a -r synonym. In a kind of mirror image, some BSD sed versions allow the -r as a -E synonym.

Personally, given the few number of sed options, and -e (little e) already used so much, -r seems the preferable syntax, but it does not really matter.

To clear up things in case anyone not sure, when extended regular expressions are allowed in a sed version, -E / -r tells sed to turn it on. An example is to use cleaner s/(..)(..)/\2\1/ instead of s/\(..\)\(..\)/\2\1/ syntax.

Daniel

[Non-text portions of this message have been removed]

Jim Hill gjthill@gmail.com [sed-users]

2016-02-08 21:42:23 UTC

I was agreeing with you, and suggesting that no formal definition is
possible. Characterizations are a matter of perception. I think "anything
I might repeatedly write at the command prompt rather than bother inventing
a name for" is closer, but why bother trying to get closer?

[Non-text portions of this message have been removed]

dgoldman@ehdp.com [sed-users]

2016-02-09 05:08:12 UTC

Post by Jim Hill ***@gmail.com [sed-users]
I was agreeing with you, and suggesting that no formal definition is
possible. Characterizations are a matter of perception. I think "anything
I might repeatedly write at the command prompt rather than bother inventing
a name for" is closer, but why bother trying to get closer?

Thanks for agreeing. I was certainly not expecting any disagreement, because my point is so obvious. However, in the face of a dead silence, some future viewer might be unsure, so I appreciate your post. Many words do not have exact definitions, but we still know when they are being misused, and I thought it would help to point out the incorrect usage. A long, complex sed script typed on the command line is not a "one-liner", it's just a confusing jumble / coding horror. :( I agree that, as you say, a "one-liner" can readily be typed at a command prompt, I think we are basically saying the same thing. - Daniel

[Non-text portions of this message have been removed]

23 Replies
10 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Sven Guckes maillists-yahoo@guckes.net [sed-users] 2016-01-21 04:23:42 UTC

sharma__r@hotmail.com [sed-users] 2016-01-21 05:53:43 UTC

dgoldman@ehdp.com [sed-users] 2016-01-21 17:34:18 UTC

dgoldman@ehdp.com [sed-users] 2016-01-21 17:54:11 UTC

Sven Guckes maillists-yahoo@guckes.net [sed-users] 2016-01-22 05:24:50 UTC

sharma__r@hotmail.com [sed-users] 2016-01-22 06:59:35 UTC

Sven Guckes maillists-yahoo@guckes.net [sed-users] 2016-01-22 14:00:00 UTC

Jim Hill gjthill@gmail.com [sed-users] 2016-01-22 15:16:18 UTC

Sven Guckes maillists-yahoo@guckes.net [sed-users] 2016-01-22 15:42:32 UTC

'Ruud H.G. van Tol' rvtol@isolution.nl [sed-users] 2016-01-22 16:47:05 UTC

'Ruud H.G. van Tol' rvtol@isolution.nl [sed-users] 2016-01-22 16:54:40 UTC

sharma__r@hotmail.com [sed-users] 2016-01-22 21:13:09 UTC

sharma__r@hotmail.com [sed-users] 2016-01-22 21:56:54 UTC

dgoldman@ehdp.com [sed-users] 2016-02-01 03:47:45 UTC

dgoldman@ehdp.com [sed-users] 2016-02-08 04:24:42 UTC

Jim Hill gjthill@gmail.com [sed-users] 2016-02-08 05:34:54 UTC

dgoldman@ehdp.com [sed-users] 2016-02-08 18:22:42 UTC

sharma__r@hotmail.com [sed-users] 2016-01-22 22:18:16 UTC

Sven Guckes maillists-yahoo@guckes.net [sed-users] 2016-02-09 06:07:54 UTC

dgoldman@ehdp.com [sed-users] 2016-02-09 08:12:09 UTC

Jim Hill gjthill@gmail.com [sed-users] 2016-01-22 23:06:26 UTC

dgoldman@ehdp.com [sed-users] 2016-01-23 01:33:33 UTC

Jim Hill gjthill@gmail.com [sed-users] 2016-02-08 21:42:23 UTC

dgoldman@ehdp.com [sed-users] 2016-02-09 05:08:12 UTC

about - legalese

Loading...