Forum: >>> Magnum BBS <<<

Character Encoding (Was: while loop taking input from file via ico

From Java Jive@21:1/5 to Spiros Bousbouras on Sun Aug 15 16:00:15 2021

Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )

On 15/08/2021 13:58, Spiros Bousbouras wrote:

On Sun, 15 Aug 2021 12:57:24 +0100
Java Jive <java@evij.com.invalid> wrote:

Now the crunch, when I unzip these on a Linux machine, I see different
bastardisations of accented characters. So, for example where the full
7zip archive when extracted shows an e acute correctly in both a console
and a file manager listing ...
"Chat Botté, Le" [e is correctly acute]
... (if you're wondering, a French children's picture book version of
apparently 'Puss In Boots'), while with the WinZip main archive a
console listing shows a very odd character sequence instead of the e
acute ...
"Chat Bott'$'\302\202'', Le"
... and a file manager listing has a graphic character resembling a 2x2
matrix, concerning which note that while \302 octal = \xC2 hex, and
\202 octal = \x82 hex, only the second of these and not the first
appears in the symbol:
|00|
|82|

You aren't going to get anywhere with using high level tools for this. You need to go low level and see the values of the actual bytes in the filenames. So for example something like

ls *Chat* | od -A n -t x1

which will show the bytes in hexadecimal.

Thanks again, will look into that.

My problem is that I can't find a search term to trap this strange
character to correct it, for example the following, and a few similar
that I've tried, don't work because they don't find the directory:
mv "Chat Bott'$'\302\202'', Le" "Chat Botté, Le"
mv Chat\ Bott\'$\'\\302\\202\'\',\ Le "Chat Botté, Le"

What directory ? Your post says that some files have strange names. Do also some directories have strange names ? In any case , the commands above do not show a directory separator.

As part of my manual investigations of the problem, I changed to the
directory of which the problem directory is a direct sub-directory, to
allow experimentation without having to type tediously extended pathnames.

--

Fake news kills!

I may be contacted via the contact address given on my website:
www.macfh.co.uk

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From J.O. Aho@21:1/5 to Paul on Sun Aug 15 21:21:19 2021

Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )

On 15/08/2021 20.33, Paul wrote:

If I did the test purely in Linux, against an NTFS file
system, who knows whether the text string display would
look just like it does on Windows. I'm not a character
set expert and cannot predict what those look like on
the Linux side.

As long the systems has the same charset, then there shouldn't be any differences, this do not just apply to Linux but other operating systems
as microsoft windows.

It's unlikely at the moment, that
Linux will even mount that file system (MFTMIRR) :-/ Thanks
to Microsoft. Only Fedora could mount it without whining.

Much depends on the ntfs module loaded, the current in kernel ntfs
support is crappy and still used by some distributions, but most do have support for the ntfs-3g driver, just you may install it manually.
The good news is that this driver will be in the kernel in a near future.

Mounting BitLock encrypted file systems can also be done on the Linux,
just in case you need to access files from your work computers harddrive.

Had been nice to see an in kernel exFat support too, but I doubt
microsoft has need of that in their Linux distributions, so I doubt they
will provide a driver.

--

//Aho

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Paul@21:1/5 to Java Jive on Sun Aug 15 19:24:22 2021

Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )

Java Jive wrote:

On 13/08/2021 20:28, Java Jive wrote:

I have the following lines in a shell script ...

while [ -n "${LINE}" ]
do
if [ -n "${LINE} ]
then
# Do processing
fi
done < "${DATA}"

.... and this works fine for all but two lines in the data file, which
contain accented characters. A file erroneously named with an e acute
needs to be renamed to have an e grave, and a filename containing an e
umlaut needs to be moved to a new location and given a new name.

Uggghhh! The reason for this disgust will become clear shortly!

This is a follow up question about character encodings ...

Previously I have released to my family two versions of the same archive
of family documents going back to the reign of Queen Anne, some items possibly a little earlier. These documents were scanned (1o for
original scan) and then put through four possible stages of
post-processing:
2n Contrast 'normalised' using pnnorm
3t Textcleaned
4nt n followed by 3
5tn t followed by n

For each document, the best result was copied into the main archive,
while the above preprocessing stages were left in an '_all'
sub-directory structure, with five subdirectories named as above, each
of which having beneath it a directory tree mirroring the main archive.

The main version of the archive, which most family members seem to have downloaded, only included the main archive and didn't include the _all subdirectory with all the pre-processing results, the full version
included this directory. IIRC, the former was compressed by WinZip from
the archive as it existed on a Windows PC at the time, but WinZip threw
a wobbly over the size of the full archive, so for that I had to use 7zip.

Now the crunch, when I unzip these on a Linux machine, I see different bastardisations of accented characters. So, for example where the full
7zip archive when extracted shows an e acute correctly in both a console
and a file manager listing ...
"Chat Botté, Le" [e is correctly acute]
... (if you're wondering, a French children's picture book version of apparently 'Puss In Boots'), while with the WinZip main archive a
console listing shows a very odd character sequence instead of the e
acute ...
"Chat Bott'$'\302\202'', Le"
... and a file manager listing has a graphic character resembling a 2x2 matrix, concerning which note that while \302 octal = \xC2 hex, and
\202 octal = \x82 hex, only the second of these and not the first
appears in the symbol:
|00|
|82|

My problem is that I can't find a search term to trap this strange
character to correct it, for example the following, and a few similar
that I've tried, don't work because they don't find the directory:
mv "Chat Bott'$'\302\202'', Le" "Chat Botté, Le"
mv Chat\ Bott\'$\'\\302\\202\'\',\ Le "Chat Botté, Le"

I could use a glob wildcard character such as '?', but currently all the filenames are within quotes, where globbing doesn't seem to work, and it would be a hell of a business removing the quotes, because many names in
the archive use many characters that would each need to be anticipated
and escaped for in an unquoted filename, such as spaces, ampersands, brackets, etc.

Can anyone suggest a sequence that will find the file, when put inside
quotes as the filename in the controlling data file mentioned previously
in the thread, so that it can just be treated like all the other lines?
As someone here suggested the data file is now stored as UTF-8 rather
than ANSI as it was formerly, and some example lines are given below in
a form for easier readability in a ng - in reality the fields are tab separated but here are separated by double spacing and have been further abbreviated to keep them from wrapping; leading symbols such as '+' and
'=' have special meanings for the program doing the work; and, yes, the commands are basically DOS commands which for Linux are translated to
their bash equivalents:

=ATTRIB -R "./F H/Close/Sts Mary & John Churchyard Monuments.pdf"
=RD "./F H /_all/1o/Blessig & Heyder"
REN "./Chat Bott'$'\302\202'', Le" "Chat Botté, Le"
MOVE "./Photo - D & M Close.png" "./Photos/D & M Close.png"
[etc]

https://stackoverflow.com/questions/4177783/xc3-xa9-and-other-codes/4177813#4177813

It looks like perhaps this "text string" for the filename,
went through some web encoding at some point. With a hex
editor, I can change C3 A9 to E9 hex, and the character in
the hex editor (on the right hand side) looks visually correct.

https://i.postimg.cc/TP57bLD9/C3-A9-to-E9.gif

You could do such an operation, in Perl, right on the
file system.

*********************** rename2.ps *************************
printf("this is a test\n");

$start = "Chat Bott";
$finish = ", Le";
$naughty1 = <\x{C3}\x{A9}> ;
$naughty2 = <\x{E9}> ;

$x = $start.$finish ;
$y = $start.$naughty1.$finish ;
$z = $start.$naughty2.$finish ;

open(OUT, ">>$x") || die("Cannot create X");
close(OUT);

open(OUT, ">>$y") || die("Cannot create Y");
close(OUT);

open(OUT, ">>$z") || die("Cannot create Z");
close(OUT);

use Cwd;

$c = getcwd ;

printf("Making a mess in %s\n", $c );

#rename( $y , $z );

exit(0);
*********************** end of rename2.ps *************************

I ran this in Windows 11, by double-clicking the file. I
could not run it using one of their terminals. I just thought
it was mildly amusing as to what the filenames looked like.

The idea of the script above, is you run it multiple times,
commenting out a line here or there, while you do your tests.
For example, comment out the creation of file $z and
enable the rename(y,z) command near the bottom, to see
if the created $y can be renamed to the presumed operational $z value.

https://i.postimg.cc/gksLyGFL/rename2-output.gif [Picture]

So far, I only tested it as copy/pasted above. I haven't
tested the rename.

Then, you'd need to pick up a recursive tree ("find-next-file")
type pattern, and look for a filename with $naughty1 in it,
and rename it somehow. Maybe something like one of the
examples here. You would probably need to look for a
substring of $naughty1, in the filenames returned.

https://stackoverflow.com/questions/5089680/how-to-find-files-folders-recursively-in-perl-script

File renaming, is the only thing I've done with Perl :-)
I'll never be a Perl person I guess.

Paul

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Paul@21:1/5 to Paul on Sun Aug 15 19:33:13 2021

Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )

Paul wrote:

I ran this in Windows 11

Now, before everyone gets on my case about where I ran it,
I needed to be able to see what the users see when they
unpack their 7zip in Windows, and whether the filename
looks as intended.

If I did the test purely in Linux, against an NTFS file
system, who knows whether the text string display would
look just like it does on Windows. I'm not a character
set expert and cannot predict what those look like on
the Linux side. It's unlikely at the moment, that
Linux will even mount that file system (MFTMIRR) :-/ Thanks
to Microsoft. Only Fedora could mount it without whining.

It's hardly easy to do anything in a heterogenous
environment now. Like pulling teeth with dull pliers.

Paul

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jak@21:1/5 to All on Sun Aug 15 22:27:02 2021

Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )

Il 15/08/2021 13:57, Java Jive ha scritto:

On 13/08/2021 20:28, Java Jive wrote:

I have the following lines in a shell script ...

while [ -n "${LINE}" ]
     do
         if [ -n "${LINE} ]
             then
                 # Do processing
         fi
     done < "${DATA}"

.... and this works fine for all but two lines in the data file, which
contain accented characters. A file erroneously named with an e acute
needs to be renamed to have an e grave, and a filename containing an e
umlaut needs to be moved to a new location and given a new name.

Uggghhh! The reason for this disgust will become clear shortly!

This is a follow up question about character encodings ...

Previously I have released to my family two versions of the same archive
of family documents going back to the reign of Queen Anne, some items possibly a little earlier. These documents were scanned (1o for
original scan) and then put through four possible stages of
post-processing:
    2n    Contrast 'normalised' using pnnorm
    3t    Textcleaned
    4nt    n followed by 3
    5tn    t followed by n

For each document, the best result was copied into the main archive,
while the above preprocessing stages were left in an '_all'
sub-directory structure, with five subdirectories named as above, each
of which having beneath it a directory tree mirroring the main archive.

The main version of the archive, which most family members seem to have downloaded, only included the main archive and didn't include the _all subdirectory with all the pre-processing results, the full version
included this directory. IIRC, the former was compressed by WinZip from
the archive as it existed on a Windows PC at the time, but WinZip threw
a wobbly over the size of the full archive, so for that I had to use 7zip.

Now the crunch, when I unzip these on a Linux machine, I see different bastardisations of accented characters. So, for example where the full
7zip archive when extracted shows an e acute correctly in both a console
and a file manager listing ...
    "Chat Botté, Le"    [e is correctly acute]
... (if you're wondering, a French children's picture book version of apparently 'Puss In Boots'), while with the WinZip main archive a
console listing shows a very odd character sequence instead of the e
acute ...
    "Chat Bott'$'\302\202'', Le"
... and a file manager listing has a graphic character resembling a 2x2 matrix, concerning which note that while \302 octal = \xC2 hex, and
\202 octal = \x82 hex, only the second of these and not the first
appears in the symbol:
    |00|
    |82|

My problem is that I can't find a search term to trap this strange
character to correct it, for example the following, and a few similar
that I've tried, don't work because they don't find the directory:
    mv "Chat Bott'$'\302\202'', Le"    "Chat Botté, Le"
    mv Chat\ Bott\'$\'\\302\\202\'\',\ Le "Chat Botté, Le"

I could use a glob wildcard character such as '?', but currently all the filenames are within quotes, where globbing doesn't seem to work, and it would be a hell of a business removing the quotes, because many names in
the archive use many characters that would each need to be anticipated
and escaped for in an unquoted filename, such as spaces, ampersands, brackets, etc.

Can anyone suggest a sequence that will find the file, when put inside
quotes as the filename in the controlling data file mentioned previously
in the thread, so that it can just be treated like all the other lines?
As someone here suggested the data file is now stored as UTF-8 rather
than ANSI as it was formerly, and some example lines are given below in
a form for easier readability in a ng - in reality the fields are tab separated but here are separated by double spacing and have been further abbreviated to keep them from wrapping; leading symbols such as '+' and
'=' have special meanings for the program doing the work; and, yes, the commands are basically DOS commands which for Linux are translated to
their bash equivalents:

=ATTRIB -R "./F H/Close/Sts Mary & John Churchyard Monuments.pdf"
=RD "./F H /_all/1o/Blessig & Heyder"
REN "./Chat Bott'$'\302\202'', Le" "Chat Botté, Le"
MOVE "./Photo - D & M Close.png" "./Photos/D & M Close.png"
[etc]

Hi,
you could use the find command looking for filenames as a regular
expression, then use the command you need on them.
In this example I search for files with the extension ".o", display the
name with the command 'echo' and display it again converted to
uppercase:

find . -iregex ".*\.o$" -exec bash -c "echo -n original: {} && echo \"
modified: {}\" | tr [a-z] [A-Z]}" \;

There should be everything you need.

cheers

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Java Jive@21:1/5 to jak on Mon Aug 16 10:23:23 2021

Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )

On 15/08/2021 22:27, jak wrote:

Il 15/08/2021 13:57, Java Jive ha scritto:

Can anyone suggest a sequence that will find the file, when put inside
quotes as the filename in the controlling data file mentioned
previously in the thread, so that it can just be treated like all the
other lines? As someone here suggested the data file is now stored as
UTF-8 rather than ANSI as it was formerly, and some example lines are
given below in a form for easier readability in a ng - in reality
the fields are tab separated but here are separated by double spacing
and have been further abbreviated to keep them from wrapping; leading
symbols such as '+' and '=' have special meanings for the program
doing the work; and, yes, the commands are basically DOS commands
which for Linux are translated to their bash equivalents:

=ATTRIB -R "./F H/Close/Sts Mary & John Churchyard Monuments.pdf"
=RD "./F H /_all/1o/Blessig & Heyder"
REN "./Chat Bott'$'\302\202'', Le" "Chat Botté, Le"
MOVE "./Photo - D & M Close.png" "./Photos/D & M Close.png"
[etc]

Hi,
you could use the find command looking for filenames as a regular
expression, then use the command you need on them.
In this example I search for files with the extension ".o", display the
name with the command 'echo' and display it again converted to
uppercase:

find . -iregex ".*\.o$" -exec bash -c "echo -n original: {} && echo \"
modified: {}\" | tr [a-z] [A-Z]}" \;

There should be everything you need.

Thanks but no, that doesn't work. I had considered, before the script
works through the data file, of running a pre-process to find and rename
all these characters, but neither find nor ls will actually find the
erroneous characters *DIRECTLY*. The best either can do is find the
characters either side, but that means I have to know in advance where
all the problems are, and I'm not sure yet that I do. Really, if I'm
going to go down that road, I need a way of searching the entire archive structure directly for affected files and renaming them, as a separate
process from working through the data file.

So, for example, this works because I'm specifying and finding the
neighbouring characters of one known instance, not because ls is finding
the oddball characters directly ...
ls Chat\ Bott?,\ Le | sed 's~\xc2\x82~é~g'
.... whereas these don't, with neither single nor double backslashes nor various other combinations that I've tried, because neither find nor ls
seem able to find the oddball characters directly:
find . -regex ".*\\xc2\\x82.*"
ls -R *\\xc2\\x82*
ls -R *'$'\\302\\202''*

--

Fake news kills!

I may be contacted via the contact address given on my website:
www.macfh.co.uk

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Java Jive@21:1/5 to Spiros Bousbouras on Mon Aug 16 15:33:15 2021

Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )

On 16/08/2021 13:47, Spiros Bousbouras wrote:

On Mon, 16 Aug 2021 10:23:23 +0100
Java Jive <java@evij.com.invalid> wrote:

So, for example, this works because I'm specifying and finding the
neighbouring characters of one known instance, not because ls is finding
the oddball characters directly ...
ls Chat\ Bott?,\ Le | sed 's~\xc2\x82~é~g'
... whereas these don't, with neither single nor double backslashes nor
various other combinations that I've tried, because neither find nor ls
seem able to find the oddball characters directly:
find . -regex ".*\\xc2\\x82.*"
ls -R *\\xc2\\x82*
ls -R *'$'\\302\\202''*

Try ls -R *$'\302\202'*

No luck with that either ...
ls: cannot access '*'$'\302\202''*': No such file or directory

I think the trouble with all these methods is that they are specifying a succession of two characters, where as far as unicode is concerned the
oddballs are single characters, so I fear they will never match, no
matter what magic incantation is used.

So I've been looking at putting in the following as a hack around. It's designed to search for wildcards in the file name coming from the data
file, and if one is found, escape all the other 'dodgy' characters in it
and use it without quotes, but, although everything *LOOKS* as though it
should work, it gives an error message at the final file testing if
statement:

Before, works except for filenames containing wildcard characters:

if [ -n "${Debug}" ]
then
echo "CE3 = ${CE3}"
fi
if [ "${CE3/./}" != "${CE3}" ]
then
# Is file spec
if [ ! -f "${CE3}" ] # Note quotes
then
if [ -n "${Debug}" ]
then
echo "WARNING - File '${CE3}' does not exist!"
fi
Result=1
CE2=""
fi
else
# Is path spec
if [ ! -d "${CE3}" ] # Note quotes
then
if [ -n "${Debug}" ]
then
echo "WARNING - Directory '${CE3}' does not exist!"
fi
Result=1
CE2=""
fi
fi

After, try to escape wildcard containing filenames:

# Need to remove any enclosing quotes at this stage
while [ "${CE3:0:1}" == "'" ] && [ "${CE3: -1:1}" == "'" ]
do
CE3="${CE3:1:${#CE3}-2}"
done
while [ "${CE3:0:1}" == "\"" ] && [ "${CE3: -1:1}" == "\"" ]
do
CE3="${CE3:1:${#CE3}-2}"
done
# Check for wildcard chars
if [ "${CE3/\?/}" != "${CE3}" ] || [ "${CE3/\*/}" != "${CE3}" ]
then
# Wildcards, cannot quote, so escape difficult characters
CE3=$(echo "${CE3}" | sed "s~$[ #&'(),;-]}$~\\\\\1~g")
else
# No wildcards, enclose in quotes
CE3="\"${CE3}\""
fi
if [ -n "${Debug}" ]
then
echo "CE3 = ${CE3}"
# Example output here seems correct, for example ...
# CE3 = ... /Newscuttings\ \-\ Wedding\ Of\ Zo?\ <Surname>.png
fi
if [ "${CE3/./}" != "${CE3}" ]
then
# Is file spec
if [ ! -f ${CE3} ] # Note no quotes
# ... but errors here: <scriptname>: line <num>: [: too many arguments
then
if [ -n "${Debug}" ]
then
echo "WARNING - File '${CE3}' does not exist!"
fi
Result=1
CE2=""
fi
else
# Is path spec
if [ ! -d ${CE3} ] # Note no quotes
# ... and here: <scriptname>: line <num>: [: too many arguments
then
if [ -n "${Debug}" ]
then
echo "WARNING - Directory '${CE3}' does not exist!"
fi
Result=1
CE2=""
fi
fi

--

Fake news kills!

I may be contacted via the contact address given on my website:
www.macfh.co.uk

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Java Jive@21:1/5 to Martin Gregorie on Mon Aug 16 17:28:06 2021

Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )

On 16/08/2021 16:58, Martin Gregorie wrote:

On Mon, 16 Aug 2021 15:33:15 +0100, Java Jive wrote:

No luck with that either ...
ls: cannot access '*'$'\302\202''*': No such file or directory

Might be worth writing a noddy Java program to see if it can resolve your problem character codes.

The Java 'char' primitive can hold multibyte character values. and the Character() class provides methods to recognise character types, lengths,
and non-Unicode characters.

But I can't be sure that any of the target machines will have Java,
Perl, or Python installed. This has to be achieved with what will
normally be installed on a Linux or MacOS box.

--

Fake news kills!

I may be contacted via the contact address given on my website:
www.macfh.co.uk

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jak@21:1/5 to All on Mon Aug 16 17:46:19 2021

Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )

Il 16/08/2021 11:23, Java Jive ha scritto:

On 15/08/2021 22:27, jak wrote:

Il 15/08/2021 13:57, Java Jive ha scritto:

Can anyone suggest a sequence that will find the file, when put
inside quotes as the filename in the controlling data file mentioned
previously in the thread, so that it can just be treated like all the
other lines? As someone here suggested the data file is now stored as
UTF-8 rather than ANSI as it was formerly, and some example lines are
given below in a form for easier readability in a ng - in reality
the fields are tab separated but here are separated by double spacing
and have been further abbreviated to keep them from wrapping; leading
symbols such as '+' and '=' have special meanings for the program
doing the work; and, yes, the commands are basically DOS commands
which for Linux are translated to their bash equivalents:

=ATTRIB -R "./F H/Close/Sts Mary & John Churchyard Monuments.pdf"
=RD "./F H /_all/1o/Blessig & Heyder"
REN "./Chat Bott'$'\302\202'', Le" "Chat Botté, Le"
MOVE "./Photo - D & M Close.png" "./Photos/D & M Close.png"
[etc]

Hi,
you could use the find command looking for filenames as a regular
expression, then use the command you need on them.
In this example I search for files with the extension ".o", display the
name with the command 'echo' and display it again converted to
uppercase:

  find . -iregex ".*\.o$" -exec bash -c "echo -n original: {} && echo
\"     modified: {}\" | tr [a-z] [A-Z]}" \;

There should be everything you need.

Thanks but no, that doesn't work. I had considered, before the script
works through the data file, of running a pre-process to find and rename
all these characters, but neither find nor ls will actually find the erroneous characters *DIRECTLY*. The best either can do is find the characters either side, but that means I have to know in advance where
all the problems are, and I'm not sure yet that I do. Really, if I'm
going to go down that road, I need a way of searching the entire archive structure directly for affected files and renaming them, as a separate process from working through the data file.

So, for example, this works because I'm specifying and finding the neighbouring characters of one known instance, not because ls is finding
the oddball characters directly ...
    ls Chat\ Bott?,\ Le | sed 's~\xc2\x82~é~g'
... whereas these don't, with neither single nor double backslashes nor various other combinations that I've tried, because neither find nor ls
seem able to find the oddball characters directly:
    find . -regex ".*\\xc2\\x82.*"
    ls -R *\\xc2\\x82*
    ls -R *'$'\\302\\202''*

Ok. I finally understood your problem (late age?). I tried to reproduce
your problem and in my opinion you could use this way:

add the -b option to the ls command; this will translate the bad
characters into octal sequence of text then this:

-rw-r--r-- 1 jak NONE 0 Aug 16 16:57 'foo'$'\302\202

$ ls -1 foo*
'foo' $ '\ 302 \ 202'

will become:
$ ls -1b foo*
foo\302\202

now you can search for it as if it were text. For example with the grep command:

$ ls -1b foo* | grep -F "\\302"
foo\302\202

$ ls -1b | grep -F "foo\\302\\202"
foo\302\202

I hope it helps you
cheers

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Martin Gregorie@21:1/5 to Java Jive on Mon Aug 16 16:58:25 2021

Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )

On Mon, 16 Aug 2021 15:33:15 +0100, Java Jive wrote:

No luck with that either ...
ls: cannot access '*'$'\302\202''*': No such file or directory

Might be worth writing a noddy Java program to see if it can resolve your problem character codes.

The Java 'char' primitive can hold multibyte character values. and the Character() class provides methods to recognise character types, lengths,
and non-Unicode characters.

--
--
Martin | martin at
Gregorie | gregorie dot org

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andy Burns@21:1/5 to Java Jive on Mon Aug 16 17:42:01 2021

Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )

Java Jive wrote:

console listing shows a very odd character sequence instead of the e
acute ...
"Chat Bott'$'\302\202'', Le"

Are you sure the filename is exactly as you say/think? What does

ls -b

show?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Paul@21:1/5 to Andy Burns on Mon Aug 16 19:46:56 2021

Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )

Andy Burns wrote:

Java Jive wrote:

console listing shows a very odd character sequence instead of the e
acute ...
"Chat Bott'$'\302\202'', Le"

Are you sure the filename is exactly as you say/think? What does

ls -b

show?

Using a Perl script, I created some examples.
File "Y" is the php-failure induced problem name the OP has.
File "Z" is the visually-correct one.

https://i.postimg.cc/gksLyGFL/rename2-output.gif

So you can create your own for a test.

*********************** rename2.ps *************************
printf("this is a test\n");

$start = "Chat Bott";
$finish = ", Le";
$naughty1 = <\x{C3}\x{A9}> ;
$naughty2 = <\x{E9}> ;

$x = $start.$finish ;
$y = $start.$naughty1.$finish ;
$z = $start.$naughty2.$finish ;

open(OUT, ">>$x") || die("Cannot create X");
close(OUT);

open(OUT, ">>$y") || die("Cannot create Y");
close(OUT);

open(OUT, ">>$z") || die("Cannot create Z");
close(OUT);

use Cwd;

$c = getcwd ;

printf("Making a mess in %s\n", $c );

#rename( $y , $z );

exit(0);
*********************** end of rename2.ps *************************

Paul

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Java Jive@21:1/5 to Martin Gregorie on Tue Aug 17 00:12:03 2021

ioe.org> <sfeit7$5ag$1@dont-email.me>
Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )

On 16/08/2021 21:47, Martin Gregorie wrote:

On Mon, 16 Aug 2021 17:28:06 +0100, Java Jive wrote:

On 16/08/2021 16:58, Martin Gregorie wrote:

On Mon, 16 Aug 2021 15:33:15 +0100, Java Jive wrote:

No luck with that either ...
ls: cannot access '*'$'\302\202''*': No such file or directory

Might be worth writing a noddy Java program to see if it can resolve
your problem character codes.

The Java 'char' primitive can hold multibyte character values. and the
Character() class provides methods to recognise character types,
lengths,
and non-Unicode characters.

But I can't be sure that any of the target machines will have Java,
Perl, or Python installed. This has to be achieved with what will
normally be installed on a Linux or MacOS box.

Does thet matter? I thought you were treating this archived article name sanitization as either a one-off activity of something that doesn't
happen regularly and, anyway that it was something that you did on your system before distributing the results round your family group.

No, I have to have the run one or other of the programs on the machine
of any family member who has already downloaded the first and, as I've
now discovered, faulty version of the archive.

As it happens I've just knocked up a bit of Java to see just what it can
do in the way of automated character translation, so if you'd care to
send me, martin@gregorie.org, a short file (100-500 chars max) containing
a mix of readable and non-readable example text, I'll run it through my
code.

Attaching it as a gzipped file should get it here without further
mangling.

Thanks, but I'm busy writing my own solution based on what I've already
posted.

--

Fake news kills!

I may be contacted via the contact address given on my website:
www.macfh.co.uk

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Java Jive@21:1/5 to Paul on Mon Aug 16 22:59:23 2021

Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )

On 16/08/2021 19:46, Paul wrote:

Andy Burns wrote:

Java Jive wrote:

console listing shows a very odd character sequence instead of the e
acute ...
"Chat Bott'$'\302\202'', Le"

Are you sure the filename is exactly as you say/think? What does

ls -b

show?

Thanks to you and 'jak' for suggesting this!

While still none of the following work ...
ls -R -b | grep '\xc2\x82'
ls -R -b | grep -E '\xc2\x82'
ls -R -b | grep '\uc282'
ls -R -b | grep -E '\uc282'
ls -R -b | grep '\u82c2'
ls -R -b | grep -E '\u82c2'
ls -R -b | grep '\uc282'
ls -R -b | grep -E '\uc282'
ls -R -b | grep '\u82c2'
ls -R -b | grep -E '\u82c2'
.... this at least finds all the files that I'm already aware of,
suggesting that I may know about all of them ...
ls -R -b | grep -E '\\[0-7]{3}'

There are 35 files or directories at fault, nearly all are e acute, but
there a couple of e umlaut and 6 files with both an e grave and an e
acute :-(

Now I have to devise a method of renaming them, in other words of
ensuring that the mv command will find them. I've just tried the
following manual command to see what happens (it'll wrap, but originally
it was all one command-line):

OLDIFS=${IFS}; IFS=$'\n'; for A in $(ls -1Rb * | grep -E
'(:|\\302\\202)'; do if [ "S{A: -1}" == ":" ]; then export
LASTDIR="${A/:/}; else pushd "${LASTDIR}"; mv ${A/\\302\\202/?} ${A/\\302\\202/é}; popd; fi; done; unset LASTDIR; IFS=${OLDIFS}

Guess what now! The files were renamed, but the slashes that were
supposed to escape the spaces were included in the name! FFS, HOW
INCONSISTENT IS THAT???!!! Why are the slashes successful in escaping
the spaces in the source name but getting included as part of the target
name? Alright, so I can programme around that, but I shouldn't have to,
the illogicality of it all is just maddening!

Using a Perl script, I created some examples.
File "Y" is the php-failure induced problem name the OP has.
File "Z" is the visually-correct one.

PHP was not involved, it was WinZip that created the problem, whereas 7z
did not, but for one thing, I didn't notice at the time, and for
another, people would have had to install software to handle *.7z files, whereas the ability to handle *.zip files is native to many/most/all
modern OSs.

https://i.postimg.cc/gksLyGFL/rename2-output.gif

So you can create your own for a test.

*********************** rename2.ps *************************
printf("this is a test\n");

$start = "Chat Bott";
$finish = ", Le";
$naughty1 = <\x{C3}\x{A9}> ;
$naughty2 = <\x{E9}> ;

I think this is suffering from the same problem that all the other
approaches have had, that you're creating two characters not one. BTW,
it's hex C2, followed by hex 82.

After some further thought, I remembered about the \u regular expression syntax. Being unsure of the correct byte order, I tried both, but
neither of the following work either, whereas logically I would have
thought that one of them should:
find . -regex ".*\uc282.*"
find . -regex ".*\u82c2.*"

But at least now there's hope, see above.

Tx again to all.

--

Fake news kills!

I may be contacted via the contact address given on my website:
www.macfh.co.uk

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Martin Gregorie@21:1/5 to Java Jive on Mon Aug 16 21:47:35 2021

ioe.org>
Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )

On Mon, 16 Aug 2021 17:28:06 +0100, Java Jive wrote:

On 16/08/2021 16:58, Martin Gregorie wrote:

On Mon, 16 Aug 2021 15:33:15 +0100, Java Jive wrote:

No luck with that either ...
ls: cannot access '*'$'\302\202''*': No such file or directory

Might be worth writing a noddy Java program to see if it can resolve
your problem character codes.

The Java 'char' primitive can hold multibyte character values. and the
Character() class provides methods to recognise character types,
lengths,
and non-Unicode characters.

But I can't be sure that any of the target machines will have Java,
Perl, or Python installed. This has to be achieved with what will
normally be installed on a Linux or MacOS box.

Does thet matter? I thought you were treating this archived article name sanitization as either a one-off activity of something that doesn't
happen regularly and, anyway that it was something that you did on your
system before distributing the results round your family group.

As it happens I've just knocked up a bit of Java to see just what it can
do in the way of automated character translation, so if you'd care to
send me, martin@gregorie.org, a short file (100-500 chars max) containing
a mix of readable and non-readable example text, I'll run it through my
code.

Attaching it as a gzipped file should get it here without further
mangling.

--
--
Martin | martin at
Gregorie | gregorie dot org

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jak@21:1/5 to All on Tue Aug 17 10:41:10 2021

Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )

Il 16/08/2021 23:59, Java Jive ha scritto:

On 16/08/2021 19:46, Paul wrote:

Andy Burns wrote:

Java Jive wrote:

console listing shows a very odd character sequence instead of the e
acute ...
     "Chat Bott'$'\302\202'', Le"

Are you sure the filename is exactly as you say/think? What does

ls -b

show?

Thanks to you and 'jak' for suggesting this!

While still none of the following work ...
    ls -R -b | grep '\xc2\x82'
    ls -R -b | grep -E '\xc2\x82'
    ls -R -b | grep '\uc282'
    ls -R -b | grep -E '\uc282'
    ls -R -b | grep '\u82c2'
    ls -R -b | grep -E '\u82c2'
    ls -R -b | grep '\uc282'
    ls -R -b | grep -E '\uc282'
    ls -R -b | grep '\u82c2'
    ls -R -b | grep -E '\u82c2'
... this at least finds all the files that I'm already aware of,
suggesting that I may know about all of them ...
    ls -R -b | grep -E '\\[0-7]{3}'

There are 35 files or directories at fault, nearly all are e acute, but
there a couple of e umlaut and 6 files with both an e grave and an e
acute :-(

Now I have to devise a method of renaming them, in other words of
ensuring that the mv command will find them. I've just tried the
following manual command to see what happens (it'll wrap, but originally
it was all one command-line):

OLDIFS=${IFS}; IFS=$'\n'; for A in $(ls -1Rb * | grep -E
'(:|\\302\\202)'; do if [ "S{A: -1}" == ":" ]; then export
LASTDIR="${A/:/}; else pushd "${LASTDIR}"; mv ${A/\\302\\202/?} ${A/\\302\\202/é}; popd; fi; done; unset LASTDIR; IFS=${OLDIFS}

Guess what now! The files were renamed, but the slashes that were
supposed to escape the spaces were included in the name! FFS, HOW INCONSISTENT IS THAT???!!! Why are the slashes successful in escaping
the spaces in the source name but getting included as part of the target name? Alright, so I can programme around that, but I shouldn't have to,
the illogicality of it all is just maddening!

Using a Perl script, I created some examples.
File "Y" is the php-failure induced problem name the OP has.
File "Z" is the visually-correct one.

PHP was not involved, it was WinZip that created the problem, whereas 7z
did not, but for one thing, I didn't notice at the time, and for
another, people would have had to install software to handle *.7z files, whereas the ability to handle *.zip files is native to many/most/all
modern OSs.

https://i.postimg.cc/gksLyGFL/rename2-output.gif

So you can create your own for a test.

*********************** rename2.ps *************************
printf("this is a test\n");

$start = "Chat Bott";
$finish = ", Le";
$naughty1 = <\x{C3}\x{A9}> ;
$naughty2 = <\x{E9}> ;

I think this is suffering from the same problem that all the other
approaches have had, that you're creating two characters not one. BTW,
it's hex C2, followed by hex 82.

After some further thought, I remembered about the \u regular expression syntax. Being unsure of the correct byte order, I tried both, but
neither of the following work either, whereas logically I would have
thought that one of them should:
    find . -regex ".*\uc282.*"
    find . -regex ".*\u82c2.*"

But at least now there's hope, see above.

Tx again to all.

try this way to rename your file with the strange name:

$ find . -iname `echo -e "foo\0302\0202"` -exec mv {} new_name \;

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jak@21:1/5 to All on Tue Aug 17 07:50:28 2021

Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )

Il 16/08/2021 23:59, Java Jive ha scritto:

On 16/08/2021 19:46, Paul wrote:

Andy Burns wrote:

Java Jive wrote:

console listing shows a very odd character sequence instead of the e
acute ...
     "Chat Bott'$'\302\202'', Le"

Are you sure the filename is exactly as you say/think? What does

ls -b

show?

Thanks to you and 'jak' for suggesting this!

While still none of the following work ...
    ls -R -b | grep '\xc2\x82'
    ls -R -b | grep -E '\xc2\x82'
    ls -R -b | grep '\uc282'
    ls -R -b | grep -E '\uc282'
    ls -R -b | grep '\u82c2'
    ls -R -b | grep -E '\u82c2'
    ls -R -b | grep '\uc282'
    ls -R -b | grep -E '\uc282'
    ls -R -b | grep '\u82c2'
    ls -R -b | grep -E '\u82c2'
... this at least finds all the files that I'm already aware of,
suggesting that I may know about all of them ...
    ls -R -b | grep -E '\\[0-7]{3}'

There are 35 files or directories at fault, nearly all are e acute, but
there a couple of e umlaut and 6 files with both an e grave and an e
acute :-(

Now I have to devise a method of renaming them, in other words of
ensuring that the mv command will find them. I've just tried the
following manual command to see what happens (it'll wrap, but originally
it was all one command-line):

OLDIFS=${IFS}; IFS=$'\n'; for A in $(ls -1Rb * | grep -E
'(:|\\302\\202)'; do if [ "S{A: -1}" == ":" ]; then export
LASTDIR="${A/:/}; else pushd "${LASTDIR}"; mv ${A/\\302\\202/?} ${A/\\302\\202/é}; popd; fi; done; unset LASTDIR; IFS=${OLDIFS}

Guess what now! The files were renamed, but the slashes that were
supposed to escape the spaces were included in the name! FFS, HOW INCONSISTENT IS THAT???!!! Why are the slashes successful in escaping
the spaces in the source name but getting included as part of the target name? Alright, so I can programme around that, but I shouldn't have to,
the illogicality of it all is just maddening!

Using a Perl script, I created some examples.
File "Y" is the php-failure induced problem name the OP has.
File "Z" is the visually-correct one.

PHP was not involved, it was WinZip that created the problem, whereas 7z
did not, but for one thing, I didn't notice at the time, and for
another, people would have had to install software to handle *.7z files, whereas the ability to handle *.zip files is native to many/most/all
modern OSs.

mmmumble...
.... winzip probably got it wrong when saving/restoring files between
systems that have different code pages. the "é" (e-acute), in fact, corresponds to the position 0x82 in the table cp863 (french codepage)
which is probably not the default in your system. To work around this
problem it is necessary to enable "Store Unicode filenames in Zip files"
in the "Advanced options" of WinZip. This can also be done on systems
that have WinZip integrated.

https://i.postimg.cc/gksLyGFL/rename2-output.gif

So you can create your own for a test.

*********************** rename2.ps *************************
printf("this is a test\n");

$start = "Chat Bott";
$finish = ", Le";
$naughty1 = <\x{C3}\x{A9}> ;
$naughty2 = <\x{E9}> ;

I think this is suffering from the same problem that all the other
approaches have had, that you're creating two characters not one. BTW,
it's hex C2, followed by hex 82.

After some further thought, I remembered about the \u regular expression syntax. Being unsure of the correct byte order, I tried both, but
neither of the following work either, whereas logically I would have
thought that one of them should:
    find . -regex ".*\uc282.*"
    find . -regex ".*\u82c2.*"

But at least now there's hope, see above.

Tx again to all.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andy Burns@21:1/5 to Java Jive on Tue Aug 17 08:55:03 2021

Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )

Java Jive wrote:

Thanks to you and 'jak' for suggesting this!

While still none of the following work ...

You could show us the "ls -b" output for your previous Chatt Botte
filename ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jasen Betts@21:1/5 to Java Jive on Tue Aug 17 14:54:43 2021

Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )

On 2021-08-16, Java Jive <java@evij.com.invalid> wrote:

On 16/08/2021 19:46, Paul wrote:

Andy Burns wrote:

Java Jive wrote:

console listing shows a very odd character sequence instead of the e
acute ...
"Chat Bott'$'\302\202'', Le"

That's a control character \u0082 "break permitted here"

Are you sure the filename is exactly as you say/think? What does

ls -b

show?

Thanks to you and 'jak' for suggesting this!

While still none of the following work ...
ls -R -b | grep '\xc2\x82'
ls -R -b | grep -E '\xc2\x82'

There's no chance of that working try fgrep instead, or double up
the backslashes.

what does "ls -b" show?

--
Jasen.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Java Jive@21:1/5 to Java Jive on Tue Aug 17 13:52:13 2021

Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )

On 15/08/2021 12:57, Java Jive wrote:

Can anyone suggest a sequence that will find the file, when put inside
quotes as the filename in the controlling data file mentioned previously
in the thread, so that it can just be treated like all the other lines?
As someone here suggested the data file is now stored as UTF-8 rather
than ANSI as it was formerly, and some example lines are given below in
a form for easier readability in a ng - in reality the fields are tab separated but here are separated by double spacing and have been further abbreviated to keep them from wrapping; leading symbols such as '+' and
'=' have special meanings for the program doing the work; and, yes, the commands are basically DOS commands which for Linux are translated to
their bash equivalents:

=ATTRIB -R "./F H/Close/Sts Mary & John Churchyard Monuments.pdf"
=RD "./F H /_all/1o/Blessig & Heyder"
REN "./Chat Bott'$'\302\202'', Le" "Chat Botté, Le"
MOVE "./Photo - D & M Close.png" "./Photos/D & M Close.png"
[etc]

I've completely fixed the problem with the following code inserted
before processing the data file. Thanks for all the help here that
enabled me to do this. It'll wrap of course, sorry can't help that,
beyond reducing the tabs to two spaces:

# Search for WinZip's botched accented characters
# in the main download of v1: MacFarlane-Main.zip
# 35 pathnames affected, botched characters are:
# Intended Stored incorrectly as
# Char Octal Hex
# é (acute) \302\202 \xC2\x82
# ë (diaeresis) \302\211 \xC2\x89
# è (grave) \302\212 \xC2\x8A
# Á (acute) µ

OLDIFS=${IFS} # Normally IFS=$' \t\n'
IFS=$'\n'
LASTREN=""
for A in $(ls -1bR | grep -E '(:|µ|\\[0-7]{3}\\[0-7]{3})')
do
if [ -n "${Debug}" ]
then
echo "A = \"${A}\""
fi
if [ "${A: -1}" == ":" ]
then
THISDIR="${A/:/}"
if [ "${THISDIR}" == "${LASTREN/ -> .*/}" ]
then
THISDIR="${LASTREN/.* -> /}"
fi
if [ -n "${Debug}" ]
then
echo "THISDIR = \"${THISDIR}\""
fi
else
SC="${A}"
DS="${A}"
while [ -n "$(echo \"${SC}\" | grep -E
'(µ|\\[0-7]{3}\\[0-7]{3})')" ]
do
case $(echo "${SC}" | sed -E 's~^.*(µ|\\[0-7]{3}\\[0-7]{3}).*$~\1~') in
"µ") # A acute
SC="${SC//µ/?}"
DS="${DS//µ/Á}"
;;
"\302\202") # e acute
SC="${SC//\\302\\202/?}"
DS="${DS//\\302\\202/é}"
;;
"\302\211") # e diaeresis
SC="${SC//\\302\\211/?}"
DS="${DS//\\302\\211/ë}"
;;
"\302\212") # e grave
SC="${SC//\\302\\212/?}"
DS="${DS//\\302\\212/è}"
;;
esac
done

DS="${DS//\\/}"
pushd "${THISDIR}"
echo "mv ${SC} \"${DS}\""
if [ -z "${Dummy}" ]
then
mv ${SC} "${DS}"
fi
popd

# Remember rename in case it's a directory containing others
LASTREN="${THISDIR}/${A//\\ / } -> ${THISDIR}/${DS}"
if [ -n "${Debug}" ]
then
echo "LASTREN = \"${LASTREN}\""
fi

fi
done
IFS=${OLDIFS}

--

Fake news kills!

I may be contacted via the contact address given on my website:
www.macfh.co.uk

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jak@21:1/5 to All on Tue Aug 17 23:37:10 2021

Subject: Re: Character Encoding (Was: while loop taking input from file via
iconv )

Il 17/08/2021 14:52, Java Jive ha scritto:

On 15/08/2021 12:57, Java Jive wrote:

Can anyone suggest a sequence that will find the file, when put inside
quotes as the filename in the controlling data file mentioned
previously in the thread, so that it can just be treated like all the
other lines? As someone here suggested the data file is now stored as
UTF-8 rather than ANSI as it was formerly, and some example lines are
given below in a form for easier readability in a ng - in reality
the fields are tab separated but here are separated by double spacing
and have been further abbreviated to keep them from wrapping; leading
symbols such as '+' and '=' have special meanings for the program
doing the work; and, yes, the commands are basically DOS commands
which for Linux are translated to their bash equivalents:

=ATTRIB -R "./F H/Close/Sts Mary & John Churchyard Monuments.pdf"
=RD "./F H /_all/1o/Blessig & Heyder"
REN "./Chat Bott'$'\302\202'', Le" "Chat Botté, Le"
MOVE "./Photo - D & M Close.png" "./Photos/D & M Close.png"
[etc]

I've completely fixed the problem with the following code inserted
before processing the data file. Thanks for all the help here that
enabled me to do this. It'll wrap of course, sorry can't help that,
beyond reducing the tabs to two spaces:

# Search for WinZip's botched accented characters
# in the main download of v1: MacFarlane-Main.zip
# 35 pathnames affected, botched characters are:
#    Intended    Stored incorrectly as
#    Char        Octal        Hex
#    é (acute)    \302\202    \xC2\x82
#    ë (diaeresis)    \302\211    \xC2\x89
#    è (grave)    \302\212    \xC2\x8A
#    Á (acute)    µ

OLDIFS=${IFS} # Normally IFS=$' \t\n'
IFS=$'\n'
LASTREN=""
for A in $(ls -1bR | grep -E '(:|µ|\\[0-7]{3}\\[0-7]{3})')
do
    if [ -n "${Debug}" ]
      then
        echo "A = \"${A}\""
    fi
    if [ "${A: -1}" == ":" ]
      then
        THISDIR="${A/:/}"
        if [ "${THISDIR}" == "${LASTREN/ -> .*/}" ]
          then
            THISDIR="${LASTREN/.* -> /}"
        fi
        if [ -n "${Debug}" ]
          then
            echo "THISDIR = \"${THISDIR}\""
        fi
      else
        SC="${A}"
        DS="${A}"
        while [ -n "$(echo \"${SC}\" | grep -E '(µ|\\[0-7]{3}\\[0-7]{3})')" ]
          do
            case $(echo "${SC}" | sed -E 's~^.*(µ|\\[0-7]{3}\\[0-7]{3}).*$~\1~') in
              "µ")         # A acute
                    SC="${SC//µ/?}"
                    DS="${DS//µ/Á}"
                    ;;
              "\302\202") # e acute
                    SC="${SC//\\302\\202/?}"
                    DS="${DS//\\302\\202/é}"
                    ;;
              "\302\211") # e diaeresis
                    SC="${SC//\\302\\211/?}"
                    DS="${DS//\\302\\211/ë}"
                    ;;
              "\302\212") # e grave
                    SC="${SC//\\302\\212/?}"
                    DS="${DS//\\302\\212/è}"
                    ;;
            esac
          done

        DS="${DS//\\/}"
        pushd "${THISDIR}"
        echo "mv ${SC} \"${DS}\""
        if [ -z "${Dummy}" ]
          then
            mv ${SC} "${DS}"
        fi
        popd

        # Remember rename in case it's a directory containing others
        LASTREN="${THISDIR}/${A//\\ / } -> ${THISDIR}/${DS}"
        if [ -n "${Debug}" ]
          then
            echo "LASTREN = \"${LASTREN}\""
        fi

    fi
done
IFS=${OLDIFS}

Just because I had also tried to write a version of the script shell:

These are the files I created for testing:

$ ls -1 jak/foo*
'jak/foo'$'\302\202'
'jak/foo'$'\302\202\302\202'
'jak/foo'$'\302\202\302\211'
'jak/foo'$'\302\212\302\202'
'jak/foo'$'\302\212\302\202''foo'

This is the result of the script:

$ ./renbadch
mv "./jak/foo\302\202" "./jak/fooé"
mv "./jak/foo\302\202\302\202" "./jak/fooéé"
mv "./jak/foo\302\202\302\211" "./jak/fooéë"
mv "./jak/foo\302\212\302\202" "./jak/fooèé"
mv "./jak/foo\302\212\302\202foo" "./jak/fooèéfoo"

This is the code:

#! /usr/bin/bash

regex='([^\\]*[^0-7]*)(\\[0-7]{3})(\\[0-7]{3})'

while read -r ll
do
orig=$ll
transl=""
while [[ $ll =~ $regex ]]
do
start=${BASH_REMATCH[1]}
goodch=$(printf %d ${BASH_REMATCH[3]:1})
newch=$(echo -e "\0${goodch}" | iconv -f 'CP863' -t
'UTF-8')
transl="${transl}${start}${newch}"
m=${BASH_REMATCH[0]}
ll=${ll##*"$m"}
done
echo "mv \"${orig}\" \"${transl}${ll}\""
done < <(find . -type f -exec ls -1b {} + | egrep '\\[0-7]{3}\\[0-7]{3}')

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	293
Nodes:	16 (2 / 14)
Uptime:	240:30:17
Calls:	6,624
Files:	12,173
Messages:	5,320,077

Character Encoding (Was: while loop taking input from file via ico

Who's Online

System Info