repair comb-quoting
Sven Guckes maillists-yahoo@guckes.net [sed-users]
2016-01-21 04:23:42 UTC
yet another problem.. "comb-quoted" text.

this happens on replies to messages where the lines extend 72chars.
the mailer cuts of the words which extend that limit and then put
the word into the next line. usually these result in short lines.
(besides, this happens a lot with gmail user using the browser.)
I ****'* ****** ********* *** ****'* ******* *** ******* ** *** ******
*******, *** ** ** ****'* *** *** ******** ** ******* *** **** **** ***
***** ** ** **** ******* *** ***** ****** *** ** ** *** *******, **
*** **** *****'** **** **** *********.
* ***** ** ***** ** ****** ** ***** *** * ************ ****** ******
******** *** ********* ************* (**** * "***** *** ** *****"
********* ** *********; *** * ******* ** ******* ** ** ****).
as you can see: long and short lines alternate. looks like a comb. ;)

so - how to fix that?

basically, a simple algorithm could work like this:

if current line is quoted and is less than M chars
and previous line was quoted and is more than N chars
then remove quotation and join with previous line.

do you see an easy way to do this within sed?
or is this a better job for awk already
(easier for branches and comparions)?

ps: yes, i aware of the t-prot tool
but this is sed country, right?

sharma__r@hotmail.com [sed-users]
2016-01-21 05:53:43 UTC
Post by Sven Guckes maillists-***@guckes.net [sed-users]
do you see an easy way to do this within sed?
or is this a better job for awk already
(easier for branches and comparions)?
Oh yeah, "sed" is upto the task here...

sed -e '
# limit = 72 chars/line


$q; N; s/\n//; s/^/\n/; D


' < your_comb_quoted.txt


dgoldman@ehdp.com [sed-users]
2016-01-21 17:34:18 UTC
I often run into the same problem with wrapped lines in mail messages in the wrong places. I don't like them either. It's great Sven has the potential to automatically clean them up.

It seems to me the suggested solution does not accomplish the purpose, instead basically destroys the mail message. Here is my understanding:

$ cat mail-input.txt
line-start some-text some-text some-text some-text some-text
line-start some-text some-text some-text some-text some-text
line-start some-text some-text some-text some-text some-text
line-start some-text some-text some-text some-text some-text
line-start some-text some-text some-text some-text some-text
line-start some-text some-text some-text some-text some-text

$ cat desired-output.txt
line-start some-text some-text some-text some-text some-text
line-start some-text some-text some-text some-text some-text wrapped-text
line-start some-text some-text some-text some-text some-text
line-start some-text some-text some-text some-text some-text
line-start some-text some-text some-text some-text some-text wrapped-text
line-start some-text some-text some-text some-text some-text

$ cat script-2.txt
/.\{72\}./ b greater
$q; N; s/\n//; s/^/\n/; D
s/.\{72\}/&\n/; P; D

$ sed -f script-2.txt mail-input.txt
line-start some-text some-text some-text some-text some-text wrapped-tex
t> line-start some-text some-text some-text some-text some-text > wrappe
d-text> line-start some-text some-text some-text some-text some-text > l
ine-start some-text some-text some-text some-text some-text > line-start
some-text some-text some-text some-text some-text > wrapped-text line-s
tart some-text some-text some-text some-text some-text wrapped-text

To me, the mail message seems messed up by the script. Maybe I'm missing something.


dgoldman@ehdp.com [sed-users]
2016-01-21 17:54:11 UTC
Does the t-prot tool take care of the comb-quoting problem? If so, maybe it's not a bad idea to take advantage of that capability in this instance. Again, if you put the filter inside a shell wrapper, you can potentially take advantage of multiple tools, use sed for X problem, t-prot for Y problem, ex for Z problem.

When I see this problem with comb-quoting, I seem to remember it happens when the quote is indented several levels (makes sense). Is that the case with you? If so, that would complicate the solution, because the quoted line might start with > or >> or >> etc.


Sven Guckes maillists-yahoo@guckes.net [sed-users]
2016-01-22 05:24:50 UTC
indeed - a simple fix only caters for the first level of
quotation only. this is far from perfect, of course.

i still have to fix it all when i respond to
the message. but this is where vim gets in.
vim has support for built-in reformatting which
easily caters for multiple levels of quotation.

so once i fix this comb-quoting by joining those
dangling words from short lines to the previous line
then a reformatting of the paragraph does it all. :)

sharma__r@hotmail.com [sed-users]
2016-01-22 06:59:35 UTC
A preliminary version of the comb de-quoter is as implemented below:

sed -e '
# to un-quote comb quotes in a mailer
# v1.0

# uncited lines
/^>/!{x; /./p; s/.*//; x; b;}

# cited lines below this
# minor line = line length <= 30
# major line = line length >= 50
G; s/\n>//; x; s/.*//; x

and as a one-liner for incorporation in your mailer setup:

sed -e '/^>/!buncited' -e '/^>.\{30\}./bbigcited' -e 'x;/^>.\{50\}/!b' -e 'G;s/\n>//;x;s/.*//;x;b' -e ':bigcited' -e 'x;/./p;d' -e ':uncited' -e 'x;/./p;s/.*//;x'


Sven Guckes maillists-yahoo@guckes.net [sed-users]
2016-01-22 14:00:00 UTC
A preliminary version of the comb de-quoter ..
sed -e '/^>/!buncited' -e '/^>.\{30\}./bbigcited' -e 'x;/^>.\{50\}/!b' -e 'G;s/\n>//;x;s/.*//;x;b' -e ':bigcited' -e 'x;/./p;d' -e ':uncited' -e 'x;/./p;s/.*//;x'
whee! :) big thanks!

works very nicely on my sample text here (see below).

here is a text with *multiple* citations,
wrapped at a texwidth of 72 characters:
can you get this to work with this, too?


Artikel 5 (1) Jeder hat das Recht, seine Meinung in Wort, Schrift
und Bild
frei zu Àußern und zu verbreiten und sich aus allgemein zugÀnglichen
ungehindert zu unterrichten. Die Pressefreiheit und die Freiheit
Berichterstattung durch Rundfunk und Film werden gewÀhrleistet. Eine
findet nicht statt. (2) Diese Rechte finden ihre Schranken in
Vorschriften der allgemeinen Gesetze, den gesetzlichen Bestimmungen
Schutze der Jugend und in dem Recht der persönlichen Ehre. (3)
Kunst und
Wissenschaft, Forschung und Lehre sind frei. Die Freiheit der Lehre
nicht von der Treue zur Verfassung.
Jim Hill gjthill@gmail.com [sed-users]
2016-01-22 15:16:18 UTC
On windows atm, gmail's my option, sorry for formatting,

#/usr/bin/sed -Ef
/^> / { N;
/.{70,}/ s/((> )*)([^>\n][^\n]*)\n\1([^> ].{,})$/\1\3 \4/

works on your sample, which doesn't seem to have actually been wrapped as
the `tw72` in its name implies because some of the apparently desired
output lines are less than that after joining. The constraints are
arbitrary anyway, play with the {,}s to taste, I think the N;munch;P;D
usage is the payload here.

Sven Guckes maillists-yahoo@guckes.net [sed-users]
2016-01-22 15:42:32 UTC
will do. thanks again! :)

Sven [rewriting his display_filter]
'Ruud H.G. van Tol' rvtol@isolution.nl [sed-users]
2016-01-22 16:47:05 UTC
If there can be two short lines after another, then make the main line:

1 while $buf =~ s/^(>(?>(?: *>)*) )(.*)\n^\1(.{1,20})$/$1$2 $3/mg;

-- Ruud
sharma__r@hotmail.com [sed-users]
2016-01-22 21:13:09 UTC
sed -e '
# v2.0
/\n\(> \)\{1,\}$/b
/^>.\{50\}.*\n/s/\n\(> \)\{1,\}//
' your_comb_quoted_file

sharma__r@hotmail.com [sed-users]
2016-01-22 21:56:54 UTC
sed -e '
# v3.0
# skip uncited, last, & small lines
/^\(\(> \)\2*\)[^>].*\n\1[^>]/!{P;D;}
/\n\(> \)\1*$/b
/^>.\{50\}.*\n/s/\n\(> \)\1*//
' comb_quoted_file

can be made into a 1-liner by putting each uncommented line above in an -e '....' block, to get as many -e '...' blocks as the number of lines.

dgoldman@ehdp.com [sed-users]
2016-02-01 03:47:45 UTC
I'm sorry, but I need to point out that it is incorrect to say that the given script can be "made into a 1-liner by putting each uncommented line above in an -e '....' block, to get as many -e '...' blocks as the number of lines."

I have never seen a definition of "one-liner". So it's understandable some might be confused. However, this seems pretty basic to me. I submit that a long script cannot be a "one-liner", unless converted to a short script. I believe there is consensus that "one-liner" means "short script written on one line". Here is my reasoning:

1) Usage - Each http://sed.sourceforge.net/sed1line.txt http://sed.sourceforge.net/sed1line.txt example is a "short script written on one line". The usage on this famous page makes the definition clear. 2) Common sense - If any script can be a one-liner by putting each line into a -e block, then one-liner has no meaning, because any script would qualify. However, common sense says "one-liner" does have a meaning ("short script written on one line").

There is nothing wrong with long sed scripts, as long as they work, are easy to understand, and easy to maintain. The given script is not easy to understand and maintain. However, converting it to a "one-liner" makes it even worse. It ends up about 180 columns long, all strung out on one line. This qualifies as a "coding horror", is a bad idea no matter what we call it:

sed -e '/^>/!b' -e '$q' -e '/.\{30\}./!b' -e 'N' -e '/^\(\(> \)\2*\)[^>].*\n\1[^>]/!{P;D;}' -e '/\n\(> \)\1*$/b' -e '/\n>.\{30\}./{P;D;}' -e '/^>.\{50\}.*\n/s/\n\(> \)\1*//' comb_quoted_file :(

Words are important. To have meaningful discussion, we have to agree on basic concepts and vocabulary. If "one-liner" does not mean "short script written on one line", I'd like to hear. Or if someone agrees with my suggested definition, that would also be helpful.



dgoldman@ehdp.com [sed-users]
2016-02-08 04:24:42 UTC
Jim Hill gjthill@gmail.com [sed-users]
2016-02-08 05:34:54 UTC
One-liner's a characterization for humans, not a syntactic entity for

dgoldman@ehdp.com [sed-users]
2016-02-08 18:22:42 UTC
sharma__r@hotmail.com [sed-users]
2016-01-22 22:18:16 UTC
Sven Guckes maillists-yahoo@guckes.net [sed-users]
2016-02-09 06:07:54 UTC
so on Feb 8th this list has been in existence for 5555 days. yay!
however, time to move on. more news coming up. stay tuned!

dgoldman@ehdp.com [sed-users]
2016-02-09 08:12:09 UTC
Thanks for the work you do organizing and maintaining the list. Daniel

Jim Hill gjthill@gmail.com [sed-users]
2016-01-22 23:06:26 UTC
Substituting `[> ]` for `(> )` looks like an improvement to me, but unless
I'm missing something you want `\1\2 \3`.

Out of curiosity, what sed doesn't do `-E`?

dgoldman@ehdp.com [sed-users]
2016-01-23 01:33:33 UTC
Jim Hill gjthill@gmail.com [sed-users]
2016-02-08 21:42:23 UTC
I was agreeing with you, and suggesting that no formal definition is
possible. Characterizations are a matter of perception. I think "anything
I might repeatedly write at the command prompt rather than bother inventing
a name for" is closer, but why bother trying to get closer?

dgoldman@ehdp.com [sed-users]
2016-02-09 05:08:12 UTC
Post by Jim Hill ***@gmail.com [sed-users]
I was agreeing with you, and suggesting that no formal definition is
possible. Characterizations are a matter of perception. I think "anything
I might repeatedly write at the command prompt rather than bother inventing
a name for" is closer, but why bother trying to get closer?
Thanks for agreeing. I was certainly not expecting any disagreement, because my point is so obvious. However, in the face of a dead silence, some future viewer might be unsure, so I appreciate your post. Many words do not have exact definitions, but we still know when they are being misused, and I thought it would help to point out the incorrect usage. A long, complex sed script typed on the command line is not a "one-liner", it's just a confusing jumble / coding horror. :( I agree that, as you say, a "one-liner" can readily be typed at a command prompt, I think we are basically saying the same thing. - Daniel

