Discussion:
"sed libraries"
Thierry Blanc
2013-01-31 11:55:38 UTC
Permalink
Hi


I was thinking about a kind of sed "libraries" that can help to convert
from different formats like html to latex or anything else. Usually,
there are programs for that but when you want to add something special,
a sheet for main conversation that can be adapted would be very helpful.
E.g. I want to convert html to latex, and every <font 16pt ...> should
give me a subsection entry. I cannot use html2latex as this information
is lost and no subsection is created. Therefore, I have to create my own
script, what leaves me with having to set up a huge conversion table for
all special chars, etc.
For that, a sed script listing all these conversions would be of immense
help.

Let's create a sed library with such scripts (if it does not exist
already ...)



See example below.


---
html to latex sed:
# delete header
0,/<body>/d

# convert signs
s|\$|\\$|g
s|%|\\%|g
s|\#|\\#|g
s|{|\\{|g
s|}|\\}|g
s|_|\\_|g


s|\x84|,,|g
s|\x85|{\\ldots}|g
s|\x91|{\\textquoteleft}|g
s|\x92|{\\textquoteright}|g
s|\x93|{\\textquotedblleft}|g
s|\x94|{\\textquotedblright}|g
s|\x95|\\bullet|g
s|\x96|\\textendash|g
s|\x97|\\textemdash|g

s|'|{\\textquoteright}|g

s|&copy;|\\copyright|g
s|&le;|\\le|g
s|&ge;|\\ge|g
s|&lt;|<|g
s|&gt;|>|g


s|\.\.\.|\{\\ldots\}|g

# do you want newlines expressed as newlines?
s|\n| \\n\\n |g
s|<br */*>|\n\n |g

# style conversion
s|<i [^>]*>|&\\textit{|g
s|<i>|&\\textit{|g
s|</i>|}|g
s|<b [^>]*>|&\\textbf{|g
s|<b>|&\\textbf{|g
s|</b>|}|g
s|\&nbsp;| |g
s|<ol&[^>]*>|\n\\begin{enumerate}|g
s|</ol>|\\end{enumerate}|g
s|<ul[^>]*>|&\n\\begin{itemize}|g
s|</ul>|\\end{itemize}|g
s|<li[^>]*>|&\\item |g


# formatting
s|<h1>|\\section*{|
s|<h2>|\\subsection*{|
s|<h3>|\\subsubsection*{|
s|<h4>|\\textbf{|
s|<h5>||
s|<h6>||
s|<h7>||
s|<h8>||

s|<\/h[0-9]>|}|

# sign conversion
s/<[^>]*>//g
s|&nbsp;||g
s|&lt;|<|g
s|&gt;|>|g
s|&amp;|\\&|g
s|&middot;|{\\textperiodcentered}|g
s|&ldquo;|{\\textquotedblleft}|g
s|&rdquo;|{\\textquotedblright}|g
s|&lsquo;|,|g
s|&ndash;|--|g
s|&mdash;|---|g
s|&rsquo;|{\\textquoteright}|g
s|&hellip;|{\\ldots}|g
s|&agrave;|\\`{a}|g
s|&atilde;|\\~{a}|g
s|&iuml;|\\"{\i}|g
s|&eacute;|\\'{e}|g
s|&uacute;|\\'{u}|g
s|&euro;|{\\euro}|g
s|&ccedil;|\\c{c}|g
Mark Edgar
2013-02-01 12:45:59 UTC
Permalink
Post by Thierry Blanc
Let's create a sed library with such scripts (if it does not exist
already ...)
There is the PYX format, supported by http://xmlstar.sf.net/ for XML
languages. I'm not sure whether that meets your needs, but I've used
it before in combination with tidy (to convert HTML to XHTML):

# Change TITLE to "New Title"
tidy -asxml <input.html 2>/dev/null | xmlstarlet pyx | sed
'1,/(title/b;/)title/,$b;s/.*/-New Title/' | xmlstarlet p2x
Post by Thierry Blanc
output.xhtml
-Mark

Loading...