Get and substitute filenames of filepathes that contain umlauts/special chars

Discussion:

zharif@arcor.de [sed-users]

2015-11-08 15:46:58 UTC

Env: German Win8.1x64 with sed v407 from UnxUtils.

I want to queries full filepathes and finds any filename (+extension)
containing umlauts and some special characters. Goal is to do some
substitutions for the filenames only and to redirect results into a new
file.
This is what I have so far (working):

Command to get a list of filenames that contain umlauts and some special
chars (piping the DIR command):
DIR /B /S /A:-D "E:\"|sed --text -n
"s/^$.*\\$$.*[Ã€ÃŒÃ¶ÃÃÃÃÃ&\d183]\+[^\\]\+$$/\1\2/p" >> OutFile1

The resulting file should:
1. contain a renaming command (REN+space) at start of each line in OutFile1.
2. contain the filename of the filepath only at end of each line.
Example: REN "e:\XYZ\XY\2011 - SchÃŒrer _Low Back Pain.pdf"
"2011-Schuerer_Low Back Pain.pdf"

This is my loop command in which substitutions are made to the original
file. Output is redirected to a new file:
FOR /F "tokens=1* delims= " %%a IN (OutFile1) DO (
FOR /F "tokens=1* delims= " %%A IN ('ECHO "%%a"^|sed --text
"s/Ã/Ae/g;s/Ã€/ae/g;s/Ã/Ue/g;s/ÃŒ/ue/g;s/Ã/Oe/g;s/Ã¶/oe/g;s/Ã/ss/g;s/&/+/g;s/\d183/-/g;s/[[:space:]]*$[-+_.]$\{1,\}[[:space:]]*/\1/g;s/$[-+_.]$\{2,\}/\1/g;s/[[:space:]]\{2,\}/
/g"') DO (ECHO REN "%%a" "%%~nxA" >> %2)
)
%%~nxA prints only the filename(n)+extension(x) of the full path(A).

This code is increddible slow for long files (as always for loops).
Questions:
1. I'm sure it must be possible to solve this task with sed only without
using a FOR loop. I'm runnig out of ideas here.
2. What about the used sed commands? In need of being improved? Or are
they even faulty?

3. Another one; is sed able to count and print characters for each line
of a file (similar to wc)?

Thanks much in advance for any reply
Zharif

zharif@arcor.de [sed-users]

2015-11-16 11:01:37 UTC

Permalink

Because no one replied I assume that my request was not clear.
Maybe I used wrong phrasing or yahoo ate some of my code?

So let me try to re-compose my request with a simple example.
I do have an inputfile containing file pathes of files inside a directory.
inputfile example:
D:\dir1\subdir1\2006_A programme.-.a randomized controlled trial.pdf
D:\dir1\subdir1\2011_SchÃŒrer-Low Back Pain.pdf
...
...
I want sed to print the full filepath+filename+extension,
then to do some substitutions to the FILENAMES ONLY and
append these substituted filename into the same line (separated by a
space).
Substitutions in this example:
- replace ".-." with "-"
- replace umlaut "ÃŒ" or "Ã" with "ue" or "Ue"

Desired output should be like this:
D:\dir1\subdir1\2006_A programme.-.a randomized controlled trial.pdf
2006_A programme-a randomized controlled trial.pdf
D:\dir1\subdir1\2011_SchÃŒrer-Low Back Pain.pdf 2011_Schuerer-Low Back
Pain.pdf"
...
...

Is this somehow possible via oneliner or maybe by a sed script?

Again, thanks for any reply
Zharif

Post by ***@arcor.de [sed-users]
Env: German Win8.1x64 with sed v407 from UnxUtils.
I want to queries full filepathes and finds any filename (+extension)
containing umlauts and some special characters. Goal is to do some
substitutions for the filenames only and to redirect results into a new
file.
Command to get a list of filenames that contain umlauts and some special
DIR /B /S /A:-D "E:\"|sed --text -n
"s/^$.*\\$$.*[Ã€ÃŒÃ¶ÃÃÃÃÃ&\d183]\+[^\\]\+$$/\1\2/p" >> OutFile1
1. contain a renaming command (REN+space) at start of each line in OutFile1.
2. contain the filename of the filepath only at end of each line.
Example: REN "e:\XYZ\XY\2011 - SchÃŒrer _Low Back Pain.pdf"
"2011-Schuerer_Low Back Pain.pdf"
This is my loop command in which substitutions are made to the original
FOR /F "tokens=1* delims= " %%a IN (OutFile1) DO (
FOR /F "tokens=1* delims= " %%A IN ('ECHO "%%a"^|sed --text
"s/Ã/Ae/g;s/Ã€/ae/g;s/Ã/Ue/g;s/ÃŒ/ue/g;s/Ã/Oe/g;s/Ã¶/oe/g;s/Ã/ss/g;s/&/+/g;s/\d183/-/g;s/[[:space:]]*$[-+_.]$\{1,\}[[:space:]]*/\1/g;s/$[-+_.]$\{2,\}/\1/g;s/[[:space:]]\{2,\}/
/g"') DO (ECHO REN "%%a" "%%~nxA" >> %2)
)
%%~nxA prints only the filename(n)+extension(x) of the full path(A).
This code is increddible slow for long files (as always for loops).
1. I'm sure it must be possible to solve this task with sed only without
using a FOR loop. I'm runnig out of ideas here.
2. What about the used sed commands? In need of being improved? Or are
they even faulty?
3. Another one; is sed able to count and print characters for each line
of a file (similar to wc)?
Thanks much in advance for any reply
Zharif
------------------------------------
------------------------------------

Thierry Blanc Thierry.Blanc@gmx.ch [sed-users]

2015-11-16 18:09:59 UTC

Permalink

sed -r 'p;s|.*\\||;s|ÃŒ|ue|g;s|\.-\.|-|' yourfile

* print the line
* replace path with nothing
* replace Umlaute and else (here you have to extend the list)

Post by ***@arcor.de [sed-users]
Because no one replied I assume that my request was not clear.
Maybe I used wrong phrasing or yahoo ate some of my code?
So let me try to re-compose my request with a simple example.
I do have an inputfile containing file pathes of files inside a directory.
D:\dir1\subdir1\2006_A programme.-.a randomized controlled trial.pdf
D:\dir1\subdir1\2011_SchÃŒrer-Low Back Pain.pdf
...
...
I want sed to print the full filepath+filename+extension,
then to do some substitutions to the FILENAMES ONLY and
append these substituted filename into the same line (separated by a
space).
- replace ".-." with "-"
- replace umlaut "ÃŒ" or "Ã" with "ue" or "Ue"
D:\dir1\subdir1\2006_A programme.-.a randomized controlled trial.pdf
2006_A programme-a randomized controlled trial.pdf
D:\dir1\subdir1\2011_SchÃŒrer-Low Back Pain.pdf 2011_Schuerer-Low Back
Pain.pdf"
...
...
Is this somehow possible via oneliner or maybe by a sed script?
Again, thanks for any reply
Zharif

------------------------------------
------------------------------------

[Non-text portions of this message have been removed]

gnfalex@gmail.com [sed-users]

2015-11-16 19:49:43 UTC

Permalink

Greetings.
Sorry for bad English

dir /b /s | sed -r -e "h;s/.*\\//;s/[ÃÃÃ]/\0e/g;s/Ã/ss/g;y/ÃÃÃÃ€ÃŒÃ¶&\d183/AOUaou+-/;s/([-+_.\x20]){2,}/\1/g;s/\x20*.?([-+_.]).?\x20*/\1/g;H;x;s/^/ren \x22/;s/$/\x22/;s/\n/\x22\x20\x22/"
Main fragment -
"h;s/.*\\//;H;x;s/^/ren \x22/;s/$/\x22/;s/\n/\x22\x20\x22/"
Copy to hold space; delete all up to last slash; add to hold space (with \n as delimiter) ; switch hold and pattern space; add text to begin of string; add text to end of string; replace \n

Best regards

---In sed-***@yahoogroups.com, <***@...> wrote :

Because no one replied I assume that my request was not clear.
Maybe I used wrong phrasing or yahoo ate some of my code?

So let me try to re-compose my request with a simple example.
I do have an inputfile containing file pathes of files inside a directory.
inputfile example:
D:\dir1\subdir1\2006_A programme.-.a randomized controlled trial.pdf
D:\dir1\subdir1\2011_SchÃŒrer-Low Back Pain.pdf
...
...
I want sed to print the full filepath+filename+extension,
then to do some substitutions to the FILENAMES ONLY and
append these substituted filename into the same line (separated by a
space).
Substitutions in this example:
- replace ".-." with "-"
- replace umlaut "ÃŒ" or "Ã" with "ue" or "Ue"

Desired output should be like this:
D:\dir1\subdir1\2006_A programme.-.a randomized controlled trial.pdf
2006_A programme-a randomized controlled trial.pdf
D:\dir1\subdir1\2011_SchÃŒrer-Low Back Pain.pdf 2011_Schuerer-Low Back
Pain.pdf"
...
...

Is this somehow possible via oneliner or maybe by a sed script?

Again, thanks for any reply
Zharif

[Non-text portions of this message have been removed]

zharif@arcor.de [sed-users]

2015-11-16 23:05:30 UTC

Permalink

Thanks to Thierry and ***@gmail for your replies.

Thierry,
sed --text -r "p;s|.*\\||;s|ÃŒ|ue|g;s|\.-\.|-|" "infile" > "outfile"

Result:
Line1: D:\dir1\subdir1\2006_A programme.-.a randomized controlled trial.pdf
Line2: 2006_A programme-a randomized controlled trial.pdf
Line3: D:\dir1\subdir1\2011_SchÃŒrer-Low Back Pain.pdf2011_Schuerer-Low
Back Pain.pdf

Only the third line gives the desired result.
The first filepath is splitted into two lines.

***@gmail.com:
Your approach fits to the original request of the first mail I sent -
thanks much.

"Main fragment":
...does exactly what I want. It appends the filename only into the same
line (without any substitution of the filename).
Result:
Line1: ren "D:\dir1\subdir1\2006_A programme.-.a randomized controlled
trial.pdf" "2006_A programme.-.a randomized controlled trial.pdf"
Line2: ren "D:\dir1\subdir1\2011_SchÃŒrer-Low Back Pain.pdf"
"2011_SchÃŒrer-Low Back Pain.pdf"

Full command:
sed --text -r -e
"h;s/.*\\//;s/[ÃÃÃ]/\0e/g;s/Ã/ss/g;y/ÃÃÃÃ€ÃŒÃ¶&\d183/AOUaou+-/;s/([-+_.\x20]){2,}/\1/g;s/\x20*.?([-+_.]).?\x20*/\1/g;H;x;s/^/ren
\x22/;s/$/\x22/;s/\n/\x22\x20\x22/" "infile" > "outfile"
...results in an error message:
sed: -e expression #1, char 58: strings for y command are different lengths

Removing the "\d183" on the LHS and the "-" on the RHS of the s-command
results in the following output file:
Line1: ren "D:\dir1\subdir1\2006_A programme.-.a randomized controlled
trial.pdf" "200_programm.randomized controlled tria.df"
Line2: ren "D:\dir1\subdir1\2011_SchÃŒrer-Low Back Pain.pdf"
"201_chore-ow Back Pai.df"

Any further help is much appreciated
Zharif

------------------------------------
------------------------------------

Thierry Blanc Thierry.Blanc@gmx.ch [sed-users]

2015-11-17 10:01:18 UTC

Permalink

sed -r 'h;s|.*\\||;s|ÃŒ|ue|g;s|\.-\.|-|;x;G;s|\n||'

Post by ***@arcor.de [sed-users]
Thierry,
sed --text -r "p;s|.*\\||;s|ÃŒ|ue|g;s|\.-\.|-|" "infile" > "outfile"
Line1: D:\dir1\subdir1\2006_A programme.-.a randomized controlled trial.pdf
Line2: 2006_A programme-a randomized controlled trial.pdf
Line3: D:\dir1\subdir1\2011_SchÃŒrer-Low Back Pain.pdf2011_Schuerer-Low
Back Pain.pdf
Only the third line gives the desired result.
The first filepath is splitted into two lines.
Your approach fits to the original request of the first mail I sent -
thanks much.
...does exactly what I want. It appends the filename only into the same
line (without any substitution of the filename).
Line1: ren "D:\dir1\subdir1\2006_A programme.-.a randomized controlled
trial.pdf" "2006_A programme.-.a randomized controlled trial.pdf"
Line2: ren "D:\dir1\subdir1\2011_SchÃŒrer-Low Back Pain.pdf"
"2011_SchÃŒrer-Low Back Pain.pdf"
sed --text -r -e
"h;s/.*\\//;s/[ÃÃÃ]/\0e/g;s/Ã/ss/g;y/ÃÃÃÃ€ÃŒÃ¶&\d183/AOUaou+-/;s/([-+_.\x20]){2,}/\1/g;s/\x20*.?([-+_.]).?\x20*/\1/g;H;x;s/^/ren
\x22/;s/$/\x22/;s/\n/\x22\x20\x22/" "infile" > "outfile"
sed: -e expression #1, char 58: strings for y command are different lengths
Removing the "\d183" on the LHS and the "-" on the RHS of the s-command
Line1: ren "D:\dir1\subdir1\2006_A programme.-.a randomized controlled
trial.pdf" "200_programm.randomized controlled tria.df"
Line2: ren "D:\dir1\subdir1\2011_SchÃŒrer-Low Back Pain.pdf"
"201_chore-ow Back Pai.df"
Any further help is much appreciated
Zharif

------------------------------------
------------------------------------

zharif@arcor.de [sed-users]

2015-11-18 11:54:05 UTC

Permalink

Thierry,
thank you much. This code works.
sed --text -r "h ;s/.*\\// ;s/ÃŒ/ue/g ;s/\.-\./-/ ;x ;G ;s/\n/ /"
"InFile" > "OutFile"

Just for my understanding (although being at risk of needling you), is
this explaination right:
1. Line 1: copy line to hold space (unchanged-remember)
2. Line 1: delete all up to the last backslash
3. Line 1: do some more replacements
4. exchange the pattern space with the hold space
5: append the pattern space to the hold space (adding a newline \n)
6: replace the newline with something (a space) and print
7: repeat this for the next line(s)
?

Hold buffer: Is "H" and "h" the same?

***@gmail.com,
your input is also much appreciated.
I simply removed the "y" command but must confess that your code
sometimes gave some unexpected results.
But anyway, thanks.

Zharif

Post by Thierry Blanc ***@gmx.ch [sed-users]
sed -r 'h;s|.*\\||;s|ÃŒ|ue|g;s|\.-\.|-|;x;G;s|\n||'

------------------------------------
------------------------------------

gnfalex@gmail.com [sed-users]

2015-11-19 06:42:25 UTC

Permalink

Greetings
Sorry for long silence.

Post by ***@arcor.de [sed-users]
Hold buffer: Is "H" and "h" the same?

No. "h" - replace current hold with pattern. "H" - add pattern to current hold.

Post by ***@arcor.de [sed-users]

Post by ***@arcor.de [sed-users]
Removing the "\d183" on the LHS and the "-" on the RHS of the s-command

It work for me. Checked on sed 4.2.2.93-31c8-dirty - windows binary https://db.tt/0nXKF049
Please check commands
y/\d034\d032\d111/123/
It must replace quote ("), space and "o" to 1 2 3 accordnly

Post by ***@arcor.de [sed-users]
I simply removed the "y" command but must confess that your code
sometimes gave some unexpected results.

Sorry, i made an error in replacement. I wrote
s/\x20*.?([-+_.]).?\x20*/\1/g
Correct string
s/\x20*\.?([-+_.])\.?\x20*/\1/g

s/[ÃÃÃ]/\0e/g
Add "e" to umlauts

s/Ã/ss/g
Replace ss

y/ÃÃÃÃ€ÃŒÃ¶&\d183/AOUaou+-/
Replace umlauts and other symbols with analogues

s/([-+_.\x20]){2,}/\1/g
Remove duplicates of [-+_.\x20]

s/\x20*\.?([-+_.])\.?\x20*/\1/g
Remove spaces and dots around [-+_.]

sed --text -r -e "h;s/.*\\//;s/[ÃÃÃ]/\0e/g;s/Ã/ss/g;y/ÃÃÃÃ€ÃŒÃ¶&\d183/AOUaou+-/;s/([-+_.\x20]){2,}/\1/g;s/\x20*\.?([-+_.])\.?\x20*/\1/g;H;x;s/^/ren \x22/;s/$/\x22/;s/\n/\x22\x20\x22/" "infile" > "outfile"

Best regards.

[Non-text portions of this message have been removed]

zharif@arcor.de [sed-users]

2015-11-19 08:17:33 UTC

Permalink

***@gmail.com,
I did understand your code and still changed some commands before
(regarding "." and "\.").
Thanks for clarifying the difference between "h" and "H".

I downloaded your file and did a simple test with the "y" command.
UnxUtils sed v407 is available here (check UnxUpdates):
http://www.weihenstephan.de/~syring/win32/

Indeed and in comparison with your version,
it's not possible to use dec or hex values either on the LHS or RHS
with UnxUtils sed 407 using the "y" command.
This seems to be a "bug" I never discovered before.
Your code works as expected with the version you use.
With UnxUtils sed your code must be changed (regarding the "y" command).

Snip:
;s/[ÃÃÃ]/\0e/g ;s/Ã/ss/g ;y/ÃÃÃÃ€ÃŒÃ¶&\d183/AOUauo+-/
(changed a typo mistake LHS "aÃŒÃ¶" RHS "aou")
...is much more clever than mine and saves some bytes for length of my
batch file.

Due to the fact that I'm a frequent user of sed only,
I wonder if one of the two codes provided by you and Thierry has some
drawbacks?

Thanks very much
Zharif

Post by ***@gmail.com [sed-users]
Greetings
Sorry for long silence.

Post by ***@arcor.de [sed-users]
Hold buffer: Is "H" and "h" the same?

No. "h" - replace current hold with pattern. "H" - add pattern to current hold.

Post by ***@arcor.de [sed-users]

Post by ***@arcor.de [sed-users]
Removing the "\d183" on the LHS and the "-" on the RHS of the s-command

It work for me. Checked on sed 4.2.2.93-31c8-dirty - windows binary https://db.tt/0nXKF049
Please check commands
y/\d034\d032\d111/123/
It must replace quote ("), space and "o" to 1 2 3 accordnly

Post by ***@arcor.de [sed-users]
I simply removed the "y" command but must confess that your code
sometimes gave some unexpected results.

s/\x20*.?([-+_.]).?\x20*/\1/g
Correct string
s/\x20*\.?([-+_.])\.?\x20*/\1/g
s/[ÃÃÃ]/\0e/g
Add "e" to umlauts
s/Ã/ss/g
Replace ss
y/ÃÃÃÃ€ÃŒÃ¶&\d183/AOUaou+-/
Replace umlauts and other symbols with analogues
s/([-+_.\x20]){2,}/\1/g
Remove duplicates of [-+_.\x20]
s/\x20*\.?([-+_.])\.?\x20*/\1/g
Remove spaces and dots around [-+_.]
sed --text -r -e "h;s/.*\\//;s/[ÃÃÃ]/\0e/g;s/Ã/ss/g;y/ÃÃÃÃ€ÃŒÃ¶&\d183/AOUaou+-/;s/([-+_.\x20]){2,}/\1/g;s/\x20*\.?([-+_.])\.?\x20*/\1/g;H;x;s/^/ren \x22/;s/$/\x22/;s/\n/\x22\x20\x22/" "infile" > "outfile"
Best regards.
[Non-text portions of this message have been removed]
------------------------------------
------------------------------------

Thierry Blanc Thierry.Blanc@gmx.ch [sed-users]

2015-11-19 14:16:11 UTC

Permalink

yes, your interpretation is correct. I would suggest to export the
commands to a sed script, one command per line. One command per line is
more clear when you have a long list of replacements.

There was a discussion on the y command recently. Not sure if there are
drawbacks so I would stick to s command

Post by ***@arcor.de [sed-users]
Thierry,
thank you much. This code works.
sed --text -r "h ;s/.*\\// ;s/ÃŒ/ue/g ;s/\.-\./-/ ;x ;G ;s/\n/ /"
"InFile" > "OutFile"
Just for my understanding (although being at risk of needling you), is
1. Line 1: copy line to hold space (unchanged-remember)
2. Line 1: delete all up to the last backslash
3. Line 1: do some more replacements
4. exchange the pattern space with the hold space
5: append the pattern space to the hold space (adding a newline \n)
6: replace the newline with something (a space) and print
7: repeat this for the next line(s)
?
Hold buffer: Is "H" and "h" the same?
your input is also much appreciated.
I simply removed the "y" command but must confess that your code
sometimes gave some unexpected results.
But anyway, thanks.
Zharif

Post by Thierry Blanc ***@gmx.ch [sed-users]
sed -r 'h;s|.*\\||;s|ÃŒ|ue|g;s|\.-\.|-|;x;G;s|\n||'

------------------------------------
------------------------------------

zharif@arcor.de [sed-users]

2015-11-21 15:55:51 UTC

Permalink

Thanks ***@gmail.com.
Could you tell me the official download location of
sed 4.2.2.93-31c8-dirty - windows binary?

Thanks Thierry,
your response about the "interpretation" was very important for me.

Following your suggestion I wrote a sed script (sedcmd.txt) containing
all the desired steps. Help to improve it is still much appreciated.

The real world environment:
A huge server repository of papers from medical journals (free full text).
This server is a unix server.
It's shared to a several number of users (mostly students).
They do have the rights to add and to download papers into this repository,
often without sense for simple rules of filename conventions and ordering.
In some rare cases the filenames are too long if copied from the
server to a local windows PC (259+null chars boundary) and must be
snipped somehow. But this is another task I think.

We (me in most cases) are in need to name (or rename :-( ) these files with
a beginning year "four digits_" followed by the word "Review_"
(if this appears somewhere in the filename).

It's not possible to change all naming mistakes but the code below tries
to cover the most common issues here.

Regarding the y command I re-arranged some other commands to work
with sed v407 from the UnxUtils.

Command: DIR /B /S /A:-D "DirPath"|sed -r -f "sedcmd.txt" > "OutFile"

--- start sedcmd.txt---
# copy line to hold space
h

# delete all up to the last backslash
s/.*\\//

# add "e" to all "ÃÃÃÃ€ÃŒÃ¶"
s/[ÃÃÃ]/\0e/g
s/[Ã€ÃŒÃ¶]/\0e/g

# replace "ÃÃÃÃ€ÃŒÃ¶" with "AOUauo"
y/ÃÃÃÃ€ÃŒÃ¶/AOUauo/

# replace all "Ã" with "ss"
s/Ã/ss/g

# replace one or more occurences of
# "PlusMinusHyphenPeriodCommaUnderscoreAmpersandMiddotDashes"
# with one "Hyphen" (all)
s/[+-.,_&\d183\d150\d151]{1,}/-/g

# print filename extension and add one preceding "Dot" for lines
# containing a "Hyphen" and at least "1-3 chars" at end of line
# (dot has been overwritten due to the preceding command)
# (should be improved to cover any filename extension)
s/-(.{1,3}$)/.\1/

# replace two or more occurences of a "Hyphen" with one "Hyphen" (all)
s/-{2,}/-/g

# print "1-4 digits" (= Year) and add one "Underscore" for lines starting
# with "1-4 digits" and none or any occurence of a "hyphen"
s/(^[[:digit:]]{1,4})-*/\1_/

# replace the word "Review" (case insensitive) and none or any
# occurence of a "hyphen" with "ReviewUnderscore"
# (sometimes the word "Review" appears somewhere in the filename.
# it would be nice to shift the word "Review" right behind
# the Year_. I don't know how to achive this)
s/Review-*/Review_/I

# remove any space (incl. horiz. tab) from start and end of line (TrimLR)
# (not needed for listings piped by the DIR command)
# s/^[[:space:]]*//g
# s/[[:space:]]*$//g

# Some of the filenames contain unicode chars that are displayed as
# "QuestionMark" in the command line window. To make this somehow clear
# replace any occurence of a "QuestionMark" with an "Asterisk"
s/\?/*/g

# exchange the pattern space with the hold space
x

# append the pattern space to the hold space (adding a newline \n)
G

# replace the newline with "QuoteSpaceQuote"
s/\n/\d34\d32\d34/

# add the word "REN" and "SpaceQuote" at start of line
s/^/REN \d34/

# append a "Quote" at end of line
s/$/\d34/
--- end sedcmd.txt---

Desired result (OutFile) is a renaming script that can be checked and
may be called inside a batch or via command line directly.

Zharif

Post by Thierry Blanc ***@gmx.ch [sed-users]
yes, your interpretation is correct. I would suggest to export the
commands to a sed script, one command per line. One command per line is
more clear when you have a long list of replacements.
There was a discussion on the y command recently. Not sure if there are
drawbacks so I would stick to s command

Post by Thierry Blanc ***@gmx.ch [sed-users]
sed -r 'h;s|.*\\||;s|ÃŒ|ue|g;s|\.-\.|-|;x;G;s|\n||'

------------------------------------
------------------------------------

Thierry Blanc Thierry.Blanc@gmx.ch [sed-users]

2015-11-21 18:41:20 UTC

Permalink

Post by ***@arcor.de [sed-users]
Could you tell me the official download location of
sed 4.2.2.93-31c8-dirty - windows binary?
Thanks Thierry,
your response about the "interpretation" was very important for me.
Following your suggestion I wrote a sed script (sedcmd.txt) containing
all the desired steps. Help to improve it is still much appreciated.
A huge server repository of papers from medical journals (free full text).
This server is a unix server.
It's shared to a several number of users (mostly students).
They do have the rights to add and to download papers into this repository,
often without sense for simple rules of filename conventions and ordering.

forbid it and make uploads only possible by web interface. So you can
prevent wrong chars, lenghts and else.

Post by ***@arcor.de [sed-users]
In some rare cases the filenames are too long if copied from the
server to a local windows PC (259+null chars boundary) and must be
snipped somehow. But this is another task I think.
We (me in most cases) are in need to name (or rename :-( ) these files with
a beginning year "four digits_" followed by the word "Review_"
(if this appears somewhere in the filename).
It's not possible to change all naming mistakes but the code below tries
to cover the most common issues here.
Regarding the y command I re-arranged some other commands to work
with sed v407 from the UnxUtils.
Command: DIR /B /S /A:-D "DirPath"|sed -r -f "sedcmd.txt" > "OutFile"
--- start sedcmd.txt---
# copy line to hold space
h
# delete all up to the last backslash
s/.*\\//
# add "e" to all "ÃÃÃÃ€ÃŒÃ¶"
s/[ÃÃÃ]/\0e/g
s/[Ã€ÃŒÃ¶]/\0e/g
# replace "ÃÃÃÃ€ÃŒÃ¶" with "AOUauo"
y/ÃÃÃÃ€ÃŒÃ¶/AOUauo/
# replace all "Ã" with "ss"
s/Ã/ss/g
# replace one or more occurences of
# "PlusMinusHyphenPeriodCommaUnderscoreAmpersandMiddotDashes"
# with one "Hyphen" (all)
s/[+-.,_&\d183\d150\d151]{1,}/-/g
# print filename extension and add one preceding "Dot" for lines
# containing a "Hyphen" and at least "1-3 chars" at end of line
# (dot has been overwritten due to the preceding command)
# (should be improved to cover any filename extension)
s/-(.{1,3}$)/.\1/
# replace two or more occurences of a "Hyphen" with one "Hyphen" (all)
s/-{2,}/-/g
# print "1-4 digits" (= Year) and add one "Underscore" for lines starting
# with "1-4 digits" and none or any occurence of a "hyphen"
s/(^[[:digit:]]{1,4})-*/\1_/
# replace the word "Review" (case insensitive) and none or any
# occurence of a "hyphen" with "ReviewUnderscore"
# (sometimes the word "Review" appears somewhere in the filename.
# it would be nice to shift the word "Review" right behind
# the Year_. I don't know how to achive this)
s/Review-*/Review_/I
# remove any space (incl. horiz. tab) from start and end of line (TrimLR)
# (not needed for listings piped by the DIR command)
# s/^[[:space:]]*//g
# s/[[:space:]]*$//g
# Some of the filenames contain unicode chars that are displayed as
# "QuestionMark" in the command line window. To make this somehow clear
# replace any occurence of a "QuestionMark" with an "Asterisk"
s/\?/*/g
# exchange the pattern space with the hold space
x
# append the pattern space to the hold space (adding a newline \n)
G
# replace the newline with "QuoteSpaceQuote"
s/\n/\d34\d32\d34/
# add the word "REN" and "SpaceQuote" at start of line
s/^/REN \d34/
# append a "Quote" at end of line
s/$/\d34/
--- end sedcmd.txt---
Desired result (OutFile) is a renaming script that can be checked and
may be called inside a batch or via command line directly.

if you have ssh access to the unix server, you can use a shell script
with a loop, read the two names (original and new one) and rename file
by file. The output file with the two names you can check and edit with
a text editor before.

Post by ***@arcor.de [sed-users]
Zharif

Post by Thierry Blanc ***@gmx.ch [sed-users]
sed -r 'h;s|.*\\||;s|ÃŒ|ue|g;s|\.-\.|-|;x;G;s|\n||'

------------------------------------
------------------------------------

zharif@arcor.de [sed-users]

2015-11-21 19:40:30 UTC

Permalink

Yes, your suggestion was planned more than one year ago :-( !!!
As so often, time walks slow if additionally costs are involved.
At least I'm unsure if a web interface will ever be implemented.
Don't wanna bore anyone with details but in the meanwhile the
current/existing repository is the problem. At least Sensitivity for
many seach strategies is low due to rising numbers of false negative
search results. Same goes for exclusion of false positive results. I do
have unrestricted sftp access to this directory but not to the root of
the server. As a "lost windows user" with some programming and
scripting knowledge I simply connect to the server share and do my work.
My final goal is to automate some corrections (via script or app) and to
keep time spended as low as possible.

Post by Thierry Blanc ***@gmx.ch [sed-users]

forbid it and make uploads only possible by web interface. So you can
prevent wrong chars, lenghts and else.

Post by ***@arcor.de [sed-users]
Zharif

Post by Thierry Blanc ***@gmx.ch [sed-users]
sed -r 'h;s|.*\\||;s|ÃŒ|ue|g;s|\.-\.|-|;x;G;s|\n||'

------------------------------------
------------------------------------

gnfalex@gmail.com [sed-users]

2015-11-22 18:28:21 UTC

Permalink

Greetings

Post by ***@arcor.de [sed-users]
Could you tell me the official download location of
sed 4.2.2.93-31c8-dirty - windows binary? Sorry, i don't know.

Exactly this build is compiled by myself.
You can try sed 4.2.2 from https://code.google.com/p/gnu-on-windows/ https://code.google.com/p/gnu-on-windows/ , but i never tried it...
Best regards.

[Non-text portions of this message have been removed]