Discussion:
sed on html pages
Silvio Siefke
2014-03-01 17:36:37 UTC
Permalink
Hello,

i use static website generator and so i clean with tidy the page. But
tidy make some mistakes and i think with sed is the way easy to fix it.

The empty lines i delete with sed -i '/^$/d' index.html, the div i delete
with sed -i 's|<[/]\?div[^>]*>||g' index.html. But i found not a way i can
break line and insert 4 steps of whitespace.

Before:

<title>Silvio Siefke | Blog</title><!--[if lt IE 9]>
<script src="/static/js/html5.js" type="text/javascript"></script>
<![endif]-->
<link href="/static/css/style.css" rel="stylesheet" type="text/css">

After:

<title>Silvio Siefke | Blog</title>

<!--[if lt IE 9]>
<script src="/static/js/html5.js" type="text/javascript"></script>
<![endif]-->

<link href="/static/css/style.css" rel="stylesheet" type="text/css">

How can i relize?

Can i combined all step, delete empty line, delete div, id="ext", id="vid"
and this break line?


Thank you for help & Nice Day
Silvio
Joel Hammer
2014-03-01 21:52:11 UTC
Permalink
Not sure I understand all you want to do, but, to insert a line break,
this works:

sed s/target/\n &/
Post by Silvio Siefke
Hello,
i use static website generator and so i clean with tidy the page. But
tidy make some mistakes and i think with sed is the way easy to fix it.
The empty lines i delete with sed -i '/^$/d' index.html, the div i delete
with sed -i 's|<[/]\?div[^>]*>||g' index.html. But i found not a way i can
break line and insert 4 steps of whitespace.
<title>Silvio Siefke | Blog</title><!--[if lt IE 9]>
<script src="/static/js/html5.js" type="text/javascript"></script>
<![endif]-->
<link href="/static/css/style.css" rel="stylesheet" type="text/css">
<title>Silvio Siefke | Blog</title>
<!--[if lt IE 9]>
<script src="/static/js/html5.js" type="text/javascript"></script>
<![endif]-->
<link href="/static/css/style.css" rel="stylesheet" type="text/css">
How can i relize?
Can i combined all step, delete empty line, delete div, id="ext", id="vid"
and this break line?
Thank you for help & Nice Day
Silvio
Silvio Siefke
2014-03-01 22:30:05 UTC
Permalink
Post by Joel Hammer
Not sure I understand all you want to do, but, to insert a line break,
sed s/target/\n &/
ok that work not so i like, it break </title> but i has now try with
<!-- and it works. Hhh funny i swear i search all day break line sed
and all what i found not work. hhhh maybe i to stupid for google :)

One question i have then im happy. When i delete a html tag, how can
i fill the whitespace?

Example (tidy clean append id="ext">
<a href="whatever" title="whatever" name="ext" id="ext">whatever</a>

i use sed:
find -type f -name '*html' -exec sed -i 's!id="ext"!!g' {} \;

this come out:
<a href="whatever" title="whatever" name="ext" >whatever</a>

But can sed clean all so that is not whitespace in it?

better come out:
<a href="whatever" title="whatever" name="ext">whatever</a>


And last question im happy for first use of sed :)

Can i combined the follow commands to one sed command:

siefke blog $ cat ~/.bin/scripts/webclean
#!/bin/bash
find -type f -name '*html' -exec tidy -m -config ~/.config/tidy/com {} \;
find -type f -name '*html' -exec sed -i '/^$/d' {} \;
find -type f -name '*html' -exec sed -i 's|<[/]\?div[^>]*>||g' {} \;
find -type f -name '*html' -exec sed -i 's!id="ext"!!g' {} \;
find -type f -name '*html' -exec sed -i 's!id="vid"!!g' {} \;
find -type f -name '*html' -exec sed -i 's!&nbsp;!!g' {} \;
exit


Thank you for help & Nice Day
Silvio
Sven Guckes
2014-03-02 14:42:55 UTC
Permalink
Post by Silvio Siefke
siefke blog $ cat ~/.bin/scripts/webclean
#!/bin/bash
find -type f -name '*html' -exec tidy -m -config ~/.config/tidy/com {} \;
find -type f -name '*html' -exec sed -i '/^$/d' {} \;
find -type f -name '*html' -exec sed -i 's|<[/]\?div[^>]*>||g' {} \;
find -type f -name '*html' -exec sed -i 's!id="ext"!!g' {} \;
find -type f -name '*html' -exec sed -i 's!id="vid"!!g' {} \;
find -type f -name '*html' -exec sed -i 's!&nbsp;!!g' {} \;
exit
yes, you can put all commands into a file
and tell sed to run them from there:

sed -f file

this would probably change your script to this:

$ cat sedcommands
s|<[/]\?div[^>]*>||g
s!id="ext"!!g
s!id="vid"!!g
s!&nbsp;!!g

$ cat silvioscript
#!/bin/bash
find -type f -name '*html' -exec tidy -m -config ~/.config/tidy/com {} \;
find -type f -name '*html' -exec sed -i -f sedcommands {} \;

but here you use the same find command twice to find
all the html files yet again for the next command.
you only need to find all html files only once,
then execute the tidy+sed commands on these in one go.

if your shell is zsh then finding the html files
is as easy as using this pattern: **/*html
so you can put the finding of files
outside the script.

and when you put tidy+sed into on script
then it should all boil down to this:

script **/*html

okay, the filename globbing *might* fill up the space,
depending on how many html files you actually got.
in this case you are advised to use find+xargs.

find ... | xargs script

so much for ideas. :)

Sven
Silvio Siefke
2014-03-02 16:59:35 UTC
Permalink
Hello,

On Sun, 2 Mar 2014 15:42:55 +0100 Sven Guckes
Post by Sven Guckes
yes, you can put all commands into a file
sed -f file
Ah okay i has put then after after with ; But i think now i understand
what mean Ron in the mail before. There come some error message.
Post by Sven Guckes
$ cat sedcommands
s|<[/]\?div[^>]*>||g
s!id="ext"!!g
s!id="vid"!!g
s!&nbsp;!!g
Ah okay so i will try.
Post by Sven Guckes
$ cat silvioscript
#!/bin/bash
find -type f -name '*html' -exec tidy -m -config ~/.config/tidy/com
{} \; find -type f -name '*html' -exec sed -i -f sedcommands {} \;
but here you use the same find command twice to find
all the html files yet again for the next command.
you only need to find all html files only once,
then execute the tidy+sed commands on these in one go.
Yes should run over all html files.
Post by Sven Guckes
if your shell is zsh then finding the html files
is as easy as using this pattern: **/*html
so you can put the finding of files
outside the script.
and when you put tidy+sed into on script
script **/*html
Ah yes that i has do. I put all in a bash script and after website build
run tidy and then sed.
Post by Sven Guckes
so much for ideas. :)
Thank you so much for help and Ideas.


Thank you for help & Nice Day
Silvio
Ron Scott-Adams
2014-03-02 03:47:37 UTC
Permalink
There’s a larger problem here: do not use RegEx to parse HTML. It ends in sadness, guaranteed. See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

It’s tempting to assume that with a little RegEx you can whip HTML pages into the shape you want, but it is a steadily degrading process, and it is simply the wrong way to do things, for various reasons outlined in the above link and elsewhere. There’s a reason why dynamic web languages were invented.

At some point, the best way to provide help with sed is to assist in the understanding of when it is the wrong tool for the job. This is such a point.
Post by Silvio Siefke
Post by Joel Hammer
Not sure I understand all you want to do, but, to insert a line break,
sed s/target/\n &/
ok that work not so i like, it break </title> but i has now try with
<!-- and it works. Hhh funny i swear i search all day break line sed
and all what i found not work. hhhh maybe i to stupid for google :)
One question i have then im happy. When i delete a html tag, how can
i fill the whitespace?
Example (tidy clean append id="ext">
<a href="whatever" title="whatever" name="ext" id="ext">whatever</a>
find -type f -name '*html' -exec sed -i 's!id="ext"!!g' {} \;
<a href="whatever" title="whatever" name="ext" >whatever</a>
But can sed clean all so that is not whitespace in it?
<a href="whatever" title="whatever" name="ext">whatever</a>
And last question im happy for first use of sed :)
siefke blog $ cat ~/.bin/scripts/webclean
#!/bin/bash
find -type f -name '*html' -exec tidy -m -config ~/.config/tidy/com {} \;
find -type f -name '*html' -exec sed -i '/^$/d' {} \;
find -type f -name '*html' -exec sed -i 's|<[/]\?div[^>]*>||g' {} \;
find -type f -name '*html' -exec sed -i 's!id="ext"!!g' {} \;
find -type f -name '*html' -exec sed -i 's!id="vid"!!g' {} \;
find -type f -name '*html' -exec sed -i 's! !!g' {} \;
exit
Thank you for help & Nice Day
Silvio
Silvio Siefke
2014-03-02 16:55:30 UTC
Permalink
There’s a larger problem here: do not use RegEx to parse HTML. It
ends in sadness, guaranteed. See
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
Ok for that to understand need sure little more time. Thank you for link,
but what give alternatives? Hand Job? I has run now sed more then 10 times
over the html blog and found not a mistake.
It’s tempting to assume that with a little RegEx you can whip HTML
pages into the shape you want, but it is a steadily degrading
process, and it is simply the wrong way to do things, for various
reasons outlined in the above link and elsewhere. There’s a reason
why dynamic web languages were invented.
Yes that the website more open security lacks. I can understand
when you has bigger sites and more as 10 articles for day. I has before
run PHP script and i hate it. Not all are programmer and i think self
when my blog was running i feel not well because not know is a lack or
not. Now i use Nikola (Python Static Blog Generator) and be happy. Only
the code i hate when is not to read. That's why i generate the page, let
run tidy but tidy only should clean and no they set divs and ids. That's
why i think sed was best option to fix it. I has onboard and can direct
use. Only when i see what people make with it i know typical linux thinking
must begin to study for use.
At some point, the best way to provide help with sed is to assist in
the understanding of when it is the wrong tool for the job. This is
such a point.
No Problem, then ask not when need help. When not find i can use
geany. So much is it not, but i think for what i need a computer when
at end all must make manual? Not all are profis to search this ultimative
tool for a specific job. When sed not the right tool for html pages, why
sed page is built with this tool? To understand sed sure must understand
regex, for understand regex you need study. Im a hobby computer user, so
i work in a construction company. I can you sure explain which tools you
need to build a house, for a pc i search the easy way without much install
program on my system. That's why i use Linux, i can use software without
much trash.

Im open for all advice, i swear and im ever impressed in people which
understand. For that link which u write say only explain so what i see
a really tool they not called by a name. Or im blind, but i read the
article later when i has other work fix.


Thank you for help & Nice Day
Silvio


PS: Sorry English is not so perfect, i hope all is understanding. When not
my box is open for a dialog.
d***@ehdp.com
2014-03-11 04:57:20 UTC
Permalink
Regardless of all the discussion (and silliness) on the link you provided - You're right sed is not well suited to "whip HTML pages into shape". I doubt any generic tool (sed, awk, perl, etc.) is up to that task, because *** HTML is complex, *** HTML allows many variants, *** much HTML code is sloppy, *** you run into multi-line processing challenges. So it is good to warn about the potential for "steadily degrading process". Keep some backups, in case there are unexpected "side effects".

However, I think sed is fine for editing HTML (or any text file), as long as the task is somewhat limited and clearly defined, and the sed syntax is right. sed is just a tool, does exactly what you tell it.

Another option for cleaning up HTML files is vi / vim, which will let you speed things up a lot, but still confirm the changes and maybe avoid some "gotchas", but can get pretty tedious with the n.n.n.n. routine. There is a lot of overlap between sed and vi - same heritage, shared syntax, big learning curve, great power.

I can also suggest HTP http://htp.sourceforge.net/ http://htp.sourceforge.net/ "HTML pre-processor" for helping maintain HTML files. I used to use (abuse?) cpp for a lot to pre-process HTML, but HTP is a lot better, because specifically designed for the task.

Daniel

In case helpful - For those using the yahoo groups interface, the default is to reply to an individual, not the group (I messed up on this). You have to click on the left hand side of the formatting bar (two "hats") to expose the "To:" line so you can change the recipient to the entire group.
Petr Lázňovský
2014-03-11 10:59:37 UTC
Permalink
Do not know if this is related to your discussion, but have you ever seen Xidel?

http://videlibri.sourceforge.net/xidel.html
Post by d***@ehdp.com
Regardless of all the discussion (and silliness) on the link you provided - You're right sed is not well suited to "whip HTML pages into shape". I doubt any generic tool (sed, awk, perl, etc.) is up to that task, because *** HTML is complex, *** HTML allows many variants, *** much HTML code is sloppy, *** you run into multi-line processing challenges. So it is good to warn about the potential for "steadily degrading process". Keep some backups, in case there are unexpected "side effects".
However, I think sed is fine for editing HTML (or any text file), as long as the task is somewhat limited and clearly defined, and the sed syntax is right. sed is just a tool, does exactly what you tell it.
Another option for cleaning up HTML files is vi / vim, which will let you speed things up a lot, but still confirm the changes and maybe avoid some "gotchas", but can get pretty tedious with the n.n.n.n. routine. There is a lot of overlap between sed and vi - same heritage, shared syntax, big learning curve, great power.
I can also suggest HTP http://htp.sourceforge.net/ "HTML pre-processor" for helping maintain HTML files. I used to use (abuse?) cpp for a lot to pre-process HTML, but HTP is a lot better, because specifically designed for the task.
Daniel
In case helpful - For those using the yahoo groups interface, the default is to reply to an individual, not the group (I messed up on this). You have to click on the left hand side of the formatting bar (two "hats") to expose the "To:" line so you can change the recipient to the entire group.
d***@ehdp.com
2014-03-11 16:44:02 UTC
Permalink
xidel looks potentially pretty useful for helping extract data from HTML. I've never used it, but added it to my "list". I think part of the point is that xidel reads the HTML file into data structures, and "knows" about DOM, so it is a lot more powerful at manipulating HTML. A specialized tool like xidel can greatly outperform a general-purpose tool, again for what it does. But if you want xidel to do something "different", of course it cannot. For example, I don't think it would help with the task posed by the OP.


Daniel

Loading...