• Curious email anomaly

    From Newyana2@21:1/5 to All on Fri Sep 8 10:57:41 2023
    Just wondering if anyone's ever seen this before.

    =?utf-8?Q?IMG=5F0506.PNG?=

    That was a filename in email attachments. I saved
    them and then couldn't rename or delete them! I've
    never run into anything like this. Some kind of unicode
    corruption? They saved like so: 5F0506.PNG I
    edited the email source code like so: 5F0506.jpg.

    The sender was using gmail, no program listed in
    the header, and he mistakenly named the files PNG
    when they were actually JPG. I'm guessing he was
    probably doing gmail through Safari on a Mac, but
    I'm not sure.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Apd@21:1/5 to All on Fri Sep 8 16:44:30 2023
    "Newyana2" wrote:
    Just wondering if anyone's ever seen this before.

    Yes, particularly fields in Usenet message headers where UTF-8 gets hex-encoded.

    =?utf-8?Q?IMG=5F0506.PNG?=

    5F = hex for underscore.

    Without encoding: IMG_0506.png

    That was a filename in email attachments. I saved
    them and then couldn't rename or delete them!

    Strange.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Newyana2@21:1/5 to Apd on Fri Sep 8 19:32:43 2023
    "Apd" <not@all.invalid> wrote

    | > =?utf-8?Q?IMG=5F0506.PNG?=
    |
    | 5F = hex for underscore.
    |
    | Without encoding: IMG_0506.png
    |

    Ah. Thanks. I didn't think of that. But it makes no sense,
    since _ is within ASCII, so it's also proper UTF-8. and the
    whole thing still doesn't make sense. UTF-8 is not valid
    in the Windows file system as far as I know. The sender said he
    used something called "Spark". I figured it was probably
    some kinf of Apple shennanigans, but it seems to be a
    Windows program.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Apd@21:1/5 to All on Sat Sep 9 09:56:56 2023
    "Newyana2" wrote:
    "Apd" wrote
    | > =?utf-8?Q?IMG=5F0506.PNG?=
    |
    | 5F = hex for underscore.
    |
    | Without encoding: IMG_0506.png

    Ah. Thanks. I didn't think of that. But it makes no sense,
    since _ is within ASCII, so it's also proper UTF-8.

    Indeed. Just recently I saw an x-face in a Usenet message header that
    partially encoded some ASCII like this. Only a few non-alphabetic
    chars and not consistently. Of course, it completely broke it.

    and the whole thing still doesn't make sense. UTF-8 is not valid
    in the Windows file system as far as I know. The sender said he
    used something called "Spark". I figured it was probably
    some kinf of Apple shennanigans, but it seems to be a
    Windows program.

    It's not unusual to see internet messages (which include email) using
    UTF-8 when non-ASCII really is present. I think Windows tries to
    convert it to 1252 (or whatever codepage charset it's using these
    days). If it fails and inserts a '?' for a char it doesn't understand,
    that's going to cause problems with file names. It shouldn't have been
    an issue here.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Newyana2@21:1/5 to Apd on Sat Sep 9 08:08:17 2023
    "Apd" <not@all.invalid> wrote

    | > | > =?utf-8?Q?IMG=5F0506.PNG?=
    | > |

    | It's not unusual to see internet messages (which include email) using
    | UTF-8 when non-ASCII really is present. I think Windows tries to
    | convert it to 1252 (or whatever codepage charset it's using these
    | days). If it fails and inserts a '?' for a char it doesn't understand,
    | that's going to cause problems with file names. It shouldn't have been
    | an issue here.
    |

    Found it: https://en.wikipedia.org/wiki/MIME

    It's called Q-encoding. I'd never heard of this. Bizarre.
    It includes the text encoding designation within the filename
    field, and all those = and ? are part of the required format!

    So it seems there were two problems. Spark email mistakenly
    encoded _ and had to use Q-encoding, while my TBird seems
    to only partially recognize Q-encoding. It dropped everything
    except 0506.png, but it must have written a corrupt filename
    to disk, perhaps including = and ?, resulting in saving file names
    with "illegal" characters. (I've noticed that's often feasible. For
    example I can create a .htaccess file with VBScript but Explorer
    won't let me start a file name with a period.)

    So that might explain why I couldn't rename or delete the files.
    They were recorded with corrupt file names. I was able to delete
    them with File Assassin, but that program had failed to let me
    rename them. Which also makes sense, I guess, because the
    files were never locked -- only corrupted as file system entries.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Apd@21:1/5 to All on Sat Sep 9 14:53:48 2023
    "Newyana2" wrote:
    Found it: https://en.wikipedia.org/wiki/MIME

    It's called Q-encoding. I'd never heard of this. Bizarre.
    It includes the text encoding designation within the filename
    field, and all those = and ? are part of the required format!

    Yes, RFC 2047 refers.

    So it seems there were two problems. Spark email mistakenly
    encoded _ and had to use Q-encoding,

    Something I didn't know is that an underscore represents a space and
    so needs to be encoded. Normally, a space would be "=20" but they say
    it's for readability. There was mention in the RFC about underscores
    not passing through some mail gateways (I don't know how true that is nowadays), so perhaps the email program was encoding just to be safe.

    while my TBird seems
    to only partially recognize Q-encoding. It dropped everything
    except 0506.png, but it must have written a corrupt filename
    to disk, perhaps including = and ?, resulting in saving file names
    with "illegal" characters.

    Bad Tbird!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul@21:1/5 to All on Sat Sep 9 11:58:01 2023
    On 9/8/2023 7:32 PM, Newyana2 wrote:
    "Apd" <not@all.invalid> wrote

    | > =?utf-8?Q?IMG=5F0506.PNG?=
    |
    | 5F = hex for underscore.
    |
    | Without encoding: IMG_0506.png
    |

    Ah. Thanks. I didn't think of that. But it makes no sense,
    since _ is within ASCII, so it's also proper UTF-8. and the
    whole thing still doesn't make sense. UTF-8 is not valid
    in the Windows file system as far as I know. The sender said he
    used something called "Spark". I figured it was probably
    some kinf of Apple shennanigans, but it seems to be a
    Windows program.

    There is likely to be more than one program named Spark.

    https://spark.apache.org/

    *******

    I knew NTFS accepted wide characters, but not the details of what
    you stuff in them. The reason I have to know about this stuff,
    is when analyzing Registry entries, as some file system paths are stored in 16-bit mode
    in the Registry. The practice makes it damn hard to search for stuff.

    https://stackoverflow.com/questions/2050973/what-encoding-are-filenames-in-ntfs-stored-as

    "NTFS stores filenames in UTF-16, however fopen is using ANSI (not UTF-8).

    In order to use an UTF16-encoded file name you will need to use the Unicode versions
    of the file open calls. Do this by defining UNICODE and _UNICODE in your project.
    Then use the CreateFile call or the wfopen call."

    I saw in a ProcMon trace once, the usage of a file opening option,
    which seemed to be "open the file but *delete* the file when you close it". Basically a "read and delete" kind of semantic. A properly crafted wfopen command, just might be enough to delete it :-) Deleting in NTFS is not difficult,
    and a single byte in the 1024 byte $MFT carries the info that the file is deleted
    and that the $MFT entry can be "reused, any time it is convenient for you". There
    is no procedure in NTFS, to shrink or consolidate the $MFT, so filenames remain visible until you "create" enough files, to reuse all the unused $MFT entries.

    And I don't think changing languages would help. For one filesystem issue,
    I was able to use Perl to make a correction. But the handling of anything
    other than ANSI, is likely to be just as convoluted as the StackOverflow description.

    It seems the NTFS file system calls, just don't have enough sanitization in them.
    At a guess. Someone else recently had a problem, where a filename definitely violated
    a cardinal NTFS rule, and of course, the user could not rename or delete either,
    because as soon as the illegal filename was presented to file explorer, file explorer
    said "here, let me fix this for you, by removing the illegal portion", and then of course the result is "file not found". And that's why you're not able to rename or delete, is it *does* do the sanitizing when it is inconvenient to do it.
    But *does not* do the sanitizing, for the "browser wedges file system" cases :-/
    Some kind of subroutine call browsers are using, seems to be bad for your situation.

    *******

    You can try deleting the file in question, using the short file name
    in a Command Prompt windows. As that name may have fewer representation issues.

    del somename.ext

    You would need to look up, how to get the short names to show (if they exist). The short name is effectively an alias.

    *******

    And I'm still chuckling here, as I DID find a way to make illegal filenames
    on NTFS :-) (Removing an illegal file, may still have its challenges,
    but I cannot reproduce your issue exactly, unless I can find a way to
    duplicate it.)

    It turns out, that in Linux,

    [fuse filesystem ntfs.3g]

    sudo mount -o windows_names,rw /dev/sda1 /mnt # This passes a mount option to sanitize
    # filenames. This prevents "mistakes".

    sudo mount -o rw /dev/sda1 /mnt # This is UNPROTECTED naming.
    # Used to make the following picture

    You can see I had fun, by putting a "dot" on the end of a filename.

    I ran CHKDSK in Windows, and it does not do a damn thing about that file.

    [Picture]

    https://i.postimg.cc/DZS8LbY4/illegal-filename-via-knoppix531.gif

    https://linux.die.net/man/8/ntfs-3g

    "windows_names
    This option prevents files, directories and extended attributes
    to be created with a name not allowed by windows
    "

    *******

    Why did I use Knoppix-531 DVD ?

    There was no kernel level NTFS driver back then. Only NTFS-3G
    existed, and it was ready to use from the DVD.

    You click the disk icon on the desktop. From Terminal (icon on taskbar)

    cat /etc/mtab

    and that will show the options list for an ordinary mount. The context
    menu has an option for "mount read/write" as normally Knoppix 531 safe-mounts disks in read-only mode. In any case, you will notice that Knoppix
    does not have "windows_names" in the options list.

    At some point, the developers made "windows_names" the default, and,
    they did not provide a "windows_names=No" option or similar. This means
    the in-kernel NTFS mount on a modern (Ubuntu 23.04), would already be
    enforcing valid NTFS filenames.

    But, the ntfs-3g still exists, and on Ubuntu, it is already installed

    gnome-disks (or just "disks" maybe) # This utility allows discovering names for things

    sudo /sbin/mount.ntfs-3g -o rw /dev/sda1 /mnt

    cd /mnt
    ls
    ...
    rm "funny-named-thing.ext" # This is the challenging part.

    cd ~
    sudo umount /mnt # Put away partition, before shutdown.

    Paul

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Newyana2@21:1/5 to Paul on Sat Sep 9 13:37:10 2023
    "Paul" <nospam@needed.invalid> wrote

    | There is likely to be more than one program named Spark.
    |
    | https://spark.apache.org/
    |
    That's something else. I only find one Spark for
    email, made by Readdle. There's no X-Mailer or
    UserAgent field in the header. Looking around I see
    that identifying the sending program has become rare.


    | I knew NTFS accepted wide characters, but not the details of what
    | you stuff in them. The reason I have to know about this stuff,
    | is when analyzing Registry entries, as some file system paths are stored
    in 16-bit mode
    | in the Registry. The practice makes it damn hard to search for stuff.
    |
    | https://stackoverflow.com/questions/2050973/what-encoding-are-filenames-in-ntfs-stored-as
    |
    | "NTFS stores filenames in UTF-16, however fopen is using ANSI (not
    UTF-8).
    |

    Are you sitting down?... I'm on FAT32. Cuts down on the
    nonsense and complications. Permissions are impossible to
    enforce.

    I didn't know that about NTFS file names. I've never run
    into problems. But Windows has been mainly unicode for
    a long time. I mostly work with VB6, which converts it
    automatically. And Windows Script Host? I can't think of
    any software that doesn't transfer seamlessly between
    FAT32 and NTFS. I would have thought that Windows
    would just manage that.

    |
    | You can try deleting the file in question, using the short file name
    | in a Command Prompt windows. As that name may have fewer representation issues.
    |

    File Assassin did it. I think the problem was just
    that the stored file name in the file system probably
    didn't match what Explorer saw.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Newyana2@21:1/5 to Apd on Sat Sep 9 13:41:05 2023
    "Apd" <not@all.invalid> wrote

    | Something I didn't know is that an underscore represents a space and
    | so needs to be encoded. Normally, a space would be "=20" but they say
    | it's for readability. There was mention in the RFC about underscores
    | not passing through some mail gateways (I don't know how true that is
    | nowadays), so perhaps the email program was encoding just to be safe.
    |

    I think it's just a bug. An underscore is often
    used instead of a space, where a space can't be
    used, like a URL. But it's not a space character.
    And there's no problem with space characters in
    ASCII. It's not necessary in email for a file name.
    The name is in quotes. So filename: "kids at beach.jpg"
    would be no problem. There's no reason at all to
    be encoding the file name. On the other hand, there's
    also no reason that TBird couldn't handle it. So it's
    a screw-up on both ends.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Apd@21:1/5 to All on Sat Sep 9 22:56:09 2023
    "Newyana2" wrote:
    "Apd" wrote
    | Something I didn't know is that an underscore represents a space and
    | so needs to be encoded. Normally, a space would be "=20" but they say
    | it's for readability. There was mention in the RFC about underscores
    | not passing through some mail gateways (I don't know how true that is
    | nowadays), so perhaps the email program was encoding just to be safe.

    I think it's just a bug. An underscore is often
    used instead of a space, where a space can't be
    used, like a URL. But it's not a space character.
    And there's no problem with space characters in
    ASCII.

    Sure, but I was thinking it started out as an underscore, not a space,
    so the mail agent decided to encode it as per RFC comments about
    gateways.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frank Slootweg@21:1/5 to Newyana2@invalid.nospam on Sun Sep 10 15:18:21 2023
    Newyana2 <Newyana2@invalid.nospam> wrote:
    "Apd" <not@all.invalid> wrote

    | > | > =?utf-8?Q?IMG=5F0506.PNG?=
    | > |

    | It's not unusual to see internet messages (which include email) using
    | UTF-8 when non-ASCII really is present. I think Windows tries to
    | convert it to 1252 (or whatever codepage charset it's using these
    | days). If it fails and inserts a '?' for a char it doesn't understand,
    | that's going to cause problems with file names. It shouldn't have been
    | an issue here.
    |

    Found it: https://en.wikipedia.org/wiki/MIME

    It's called Q-encoding. I'd never heard of this. Bizarre.
    It includes the text encoding designation within the filename
    field, and all those = and ? are part of the required format!

    So it seems there were two problems. Spark email mistakenly
    encoded _ and had to use Q-encoding, while my TBird seems
    to only partially recognize Q-encoding. It dropped everything
    except 0506.png, but it must have written a corrupt filename
    to disk, perhaps including = and ?, resulting in saving file names
    with "illegal" characters. (I've noticed that's often feasible. For
    example I can create a .htaccess file with VBScript but Explorer
    won't let me start a file name with a period.)

    So that might explain why I couldn't rename or delete the files.
    They were recorded with corrupt file names. I was able to delete
    them with File Assassin, but that program had failed to let me
    rename them. Which also makes sense, I guess, because the
    files were never locked -- only corrupted as file system entries.

    You later mentioned that you use FAT32. The 'corrupt file names' issue
    is probably related to that, as on my (Windows 11) system, with NTFS,
    File Explorer *can* delete and rename the example (Q-encoded) file name.

    As to why the file name got Q-encoded in the first place:

    I suspect that at some time, the file name was used in some e-mail
    header - probably in Subject; - and for some reason (see Apd's
    responses), some mailer somewhere thought the file name should be
    Q-encoded (perhaps indeed because of the underscore). Once encoded,
    nobody should see the encoded form, only the decoded form. BUT if somone
    would copy and paste the file name from the header, the clipboard would probably contain the encoded name, resulting in the havoc you
    experienced.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Newyana2@21:1/5 to Frank Slootweg on Sun Sep 10 13:21:41 2023
    "Frank Slootweg" <this@ddress.is.invalid> wrote

    | You later mentioned that you use FAT32. The 'corrupt file names' issue
    | is probably related to that, as on my (Windows 11) system, with NTFS,
    | File Explorer *can* delete and rename the example (Q-encoded) file name.
    |
    Yes. Paul mentioned that NTFS is more sophisticated.
    I didn't test to see what the file might have been named
    on another system. There were 3 different issues: Spark
    writing an unnecessary, arguably corrupted file name.
    TBird apparently not properly parsing that name. Then
    Windows allowing the name to be recorded differently
    from what Explorer saw.

    | As to why the file name got Q-encoded in the first place:
    |
    | I suspect that at some time, the file name was used in some e-mail
    | header - probably in Subject; - and for some reason (see Apd's
    | responses), some mailer somewhere thought the file name should be
    | Q-encoded (perhaps indeed because of the underscore). Once encoded,
    | nobody should see the encoded form, only the decoded form. BUT if somone
    | would copy and paste the file name from the header, the clipboard would
    | probably contain the encoded name, resulting in the havoc you
    | experienced.

    There was no excuse for the encoding, except that Spark
    was composing in UTF-8 and it's "legal" to encode it. The
    mystery is why TBird, or something, dropped out the underscore.
    An underscore is a perfectly legit ASCII character. But somehow
    Explorer ended up not showing it. That made me curious whether
    there's a way to directly read the file system, but I'm not
    aware of such a tool. I'm curious how the file was recorded.
    Since File Assassin could delete it but enable me to rename or
    delete, I'm guessing the name was corrupted between the file
    system and Explorer. Maybe it was recorded as including an = sign,
    for example.

    I just sent myself an image with underscore and a space. It
    came through normally:

    Content-Type: image/jpeg;
    name="_e-device spying.jpg"
    Content-Transfer-Encoding: base64
    Content-Disposition: attachment;
    filename="_e-device spying.jpg"

    I'm curious how common this Q-encoding is. I've never
    seen it before. I'd never heard of it. It's clearly "legal",
    but I've even written email software and never saw such
    a thing. There's no possible reason for it except to transmit
    characters that don't exist in ASCII. Even then, it would
    be converted on most systems. That is, if you send me
    something like a file with a Chinese character then I'd
    probably receive something like ~1/4.jpg if it worked at
    all.

    Which raises the question of unicode on Windows. Windows
    has been unicode-16 for many years, but that's different
    from UTF-8, using 2 bytes for all characters. I'm not sure
    Explorer is capable, or Windows itself capable, of handling
    a UTF-8 file name if there are characters not allowed in
    Explorer.

    It's funny how quickly character encoding gets confusing.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frank Slootweg@21:1/5 to Newyana2@invalid.nospam on Sun Sep 10 20:05:25 2023
    Newyana2 <Newyana2@invalid.nospam> wrote:
    "Frank Slootweg" <this@ddress.is.invalid> wrote
    [...]
    | As to why the file name got Q-encoded in the first place:
    |
    | I suspect that at some time, the file name was used in some e-mail
    | header - probably in Subject; - and for some reason (see Apd's
    | responses), some mailer somewhere thought the file name should be
    | Q-encoded (perhaps indeed because of the underscore). Once encoded,
    | nobody should see the encoded form, only the decoded form. BUT if somone
    | would copy and paste the file name from the header, the clipboard would
    | probably contain the encoded name, resulting in the havoc you
    | experienced.

    There was no excuse for the encoding, except that Spark
    was composing in UTF-8 and it's "legal" to encode it. The
    mystery is why TBird, or something, dropped out the underscore.
    An underscore is a perfectly legit ASCII character. But somehow
    Explorer ended up not showing it. That made me curious whether
    there's a way to directly read the file system, but I'm not
    aware of such a tool. I'm curious how the file was recorded.
    Since File Assassin could delete it but enable me to rename or
    delete, I'm guessing the name was corrupted between the file
    system and Explorer. Maybe it was recorded as including an = sign,
    for example.

    I just sent myself an image with underscore and a space. It
    came through normally:

    Content-Type: image/jpeg;
    name="_e-device spying.jpg"
    Content-Transfer-Encoding: base64
    Content-Disposition: attachment;
    filename="_e-device spying.jpg"

    I'm curious how common this Q-encoding is. I've never
    seen it before. I'd never heard of it. It's clearly "legal",
    but I've even written email software and never saw such
    a thing. There's no possible reason for it except to transmit
    characters that don't exist in ASCII. Even then, it would
    be converted on most systems. That is, if you send me
    something like a file with a Chinese character then I'd
    probably receive something like ~1/4.jpg if it worked at
    all.

    As I said, the Q-encoding is relevant to and possibly justified in
    e-mail *headers*, for example in 'Subject:'. A header must be ASCII,
    because any MIME headers define the encoding and charset of the *body*,
    not of the headers.

    This is nicely explained in the MIME page you referenced:

    <https://en.wikipedia.org/wiki/MIME#Encoded-Word>

    and specifically the example in

    <https://en.wikipedia.org/wiki/MIME#Difference_between_Q-encoding_and_quoted-printable>

    So as I mentioned, my suspicion is that the Q-encoded file name was
    probably in some header, probably the 'Subject:' header.

    Remains the question, *why* it was Q-encoded as all the characters in
    the file name are normal printing characters? But as Apd mentioned,
    perhaps the underscore ('_') is a printable character, but still an
    exception on some systems, so it was encoded, just to be on the safe
    side.

    Just for kicks, I used Thunderbird to send myself a message with
    "Subject: IMG_0506.PNG", but when viewing the Message Source, I saw that
    the name in the message was *not* encoded. So I could not confirm my
    suspicion (but also not disprove it).

    Which raises the question of unicode on Windows. Windows
    has been unicode-16 for many years, but that's different
    from UTF-8, using 2 bytes for all characters. I'm not sure
    Explorer is capable, or Windows itself capable, of handling
    a UTF-8 file name if there are characters not allowed in
    Explorer.

    It's funny how quickly character encoding gets confusing.

    As your reference says, Q-encoding is similar to 'quoted-printable'.
    That latter term was often qualified as 'quoted-unreadable'! :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)