• UTF-8 mail encoding procedure?

    From Tuxedo@21:1/5 to All on Wed Sep 15 22:19:10 2021
    Hello,

    How can I process the input of an HTML form in UTF-8 with CGI and pass it through a MIME-Lite's sending procedure intact?

    I would like all contents, including mailheaders (Subject, Reply-to and
    From headers to be UTF-8 compatible. So far, I only managed to print a
    form's input to the browser but not encode it correctly through the mail procedure.

    For example, "Commenter" (as in below 'From' string) could become "Σχολιαστής" in case a commenter happened to input in his or her name in
    localised characters in a relevant input form field.

    --------- commenter.cgi ----------

    #!/usr/bin/perl -w

    use CGI;
    use MIME::Lite;
    use Encode qw(encode encode_utf8 );
    use utf8;

    $query = new CGI;

    $comments = $query->param('comments');

    # If I collect a UTF-8 charset subject line it becomes
    # goobledegook once mailed

    $subject_line = $query->param('subject');

    # but if I define a UTF-8 character string here it works
    # in the subject line of the resulting mail

    # $subject_line = "μερικές ελληνικές λέξεις";


    MIME::Lite->send ("sendmail", "/usr/bin/sendmail -t -oi");

    $msg = MIME::Lite->new (
    From => "\"Commenter\" <no-reply\@example.com>",
    To => 'comments@example.com',
    Type =>'multipart/mixed',
    Subject => encode( 'MIME-Header', $subject_line)
    );

    $body = "Comments: $comments";

    $msg->attach (Type =>'text/plain; carset=utf-8', Data => $body);

    $msg->send ();

    # either way, utf-8 character input print in the browser

    print "Content-type: text/html\n\n";
    print $comments;

    ------------- comment.html ------------

    <form ENCTYPE="multipart/form-data" method="post" action="comment.cgi">

    <input type="text" name="subject">

    <textarea name="comments"></textarea>

    <input type="submit" name="send" value="Submit">


    --------------

    Many thanks for any tips on the correct UTF-8 mail process.

    Tuxedo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Eli the Bearded@21:1/5 to tuxedo@mailinator.net on Wed Sep 15 22:36:26 2021
    In comp.lang.perl.misc, Tuxedo <tuxedo@mailinator.net> wrote:
    --------- commenter.cgi ----------

    #!/usr/bin/perl -w

    use CGI;
    use MIME::Lite;
    use Encode qw(encode encode_utf8 );
    use utf8;

    Versions? I have Perl v5.24.3 handy.

    $query = new CGI;

    $comments = $query->param('comments');

    # If I collect a UTF-8 charset subject line it becomes
    # goobledegook once mailed

    $subject_line = $query->param('subject');

    # but if I define a UTF-8 character string here it works
    # in the subject line of the resulting mail

    # $subject_line = "μερικές ελληνικές λέξεις";

    That's curious. I'd look at what encoding your query string has.

    MIME::Lite->send ("sendmail", "/usr/bin/sendmail -t -oi");

    $msg = MIME::Lite->new (
    From => "\"Commenter\" <no-reply\@example.com>",
    To => 'comments@example.com',
    Type =>'multipart/mixed',
    Subject => encode( 'MIME-Header', $subject_line)
    );

    My Encode module does not document a 'MIME-Header' encoding. I use MIME:EncWords for that.

    use MIME::EncWords qw( encode_mimeword );

    ...
    Subject => encode_mimeword( $subject_line, 'B', 'UTF-8')

    Where "B" is for base64 and 'Q' woult be 'quoted-printable'.

    $body = "Comments: $comments";

    $msg->attach (Type =>'text/plain; carset=utf-8', Data => $body);

    $msg->send ();

    # either way, utf-8 character input print in the browser

    print "Content-type: text/html\n\n";
    print $comments;


    I also use taint checking on CGI. You'll need to clean up the PATH,
    etc, for that.

    Elijah
    ------
    didn't check versions of the modules

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tuxedo@21:1/5 to Eli the Bearded on Thu Sep 16 07:41:43 2021
    Eli the Bearded wrote:

    In comp.lang.perl.misc, Tuxedo <tuxedo@mailinator.net> wrote:
    --------- commenter.cgi ----------

    #!/usr/bin/perl -w

    use CGI;
    use MIME::Lite;
    use Encode qw(encode encode_utf8 );
    use utf8;

    Versions? I have Perl v5.24.3 handy.

    Perl itself is v5.10.1 and it can't easily be updated on the target machine.

    MIME::Lite is an ancient 1.147 version. The other modules are core modules
    in the Perl 5.10.1 version as far as I know.

    MIME::Lite is in an external directory and I haven't updated it since it required updating dependency modules, which I think in turns requires installing various dependency modules.


    $query = new CGI;

    $comments = $query->param('comments');

    # If I collect a UTF-8 charset subject line it becomes
    # goobledegook once mailed

    $subject_line = $query->param('subject');

    # but if I define a UTF-8 character string here it works
    # in the subject line of the resulting mail

    # $subject_line = "μερικές ελληνικές λέξεις";


    It just reads some greek words. Odd that does not show in your end. My news reader and news posting window is set to UTF-8.

    I try again: μερικές ελληνικές λέξεις. Does anyone see the Greek UTF-8
    characters?

    That's curious. I'd look at what encoding your query string has.

    MIME::Lite->send ("sendmail", "/usr/bin/sendmail -t -oi");

    $msg = MIME::Lite->new (
    From => "\"Commenter\" <no-reply\@example.com>",
    To => 'comments@example.com',
    Type =>'multipart/mixed',
    Subject => encode( 'MIME-Header', $subject_line)
    );

    My Encode module does not document a 'MIME-Header' encoding. I use MIME:EncWords for that.

    use MIME::EncWords qw( encode_mimeword );

    ...
    Subject => encode_mimeword( $subject_line, 'B', 'UTF-8')

    Where "B" is for base64 and 'Q' woult be 'quoted-printable'.

    Thanks for the tips. I will give MIME::EncWords a try.

    Another tool I've used for something a similar is
    Email::MIME::RFC2047::Encoder


    $body = "Comments: $comments";

    $msg->attach (Type =>'text/plain; carset=utf-8', Data => $body);

    $msg->send ();

    # either way, utf-8 character input print in the browser

    print "Content-type: text/html\n\n";
    print $comments;


    I also use taint checking on CGI. You'll need to clean up the PATH,
    etc, for that.

    I'm not sure where the UTF-8 conversion fails in the mail or CGI.


    Elijah
    ------
    didn't check versions of the modules

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Otto J. Makela@21:1/5 to Tuxedo on Thu Sep 16 11:40:29 2021
    Tuxedo <tuxedo@mailinator.net> wrote:

    Mime-Version: 1.0
    Content-Type: text/plain; charset="UTF-8"
    Content-Transfer-Encoding: 8Bit
    [...]

    It just reads some greek words. Odd that does not show in your end.
    My news reader and news posting window is set to UTF-8.

    I try again: μερικές ελληνικές λέξεις. Does anyone see the Greek UTF-8
    characters?

    It is correctly formatted, and does show correctly here.
    --
    /* * * Otto J. Makela <om@iki.fi> * * * * * * * * * */
    /* Phone: +358 40 765 5772, ICBM: N 60 10' E 24 55' */
    /* Mail: Mechelininkatu 26 B 27, FI-00100 Helsinki */
    /* * * Computers Rule 01001111 01001011 * * * * * * */

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Otto J. Makela@21:1/5 to Tuxedo on Thu Sep 16 11:47:54 2021
    Tuxedo <tuxedo@mailinator.net> wrote:

    Mime-Version: 1.0
    Content-Type: text/plain; charset="UTF-8"
    Content-Transfer-Encoding: 8Bit
    User-Agent: KNode/4.14.10
    [...]
    # $subject_line = "μερικές ελληνικές λέξεις";

    Eli the Bearded <*@eli.users.panix.com> wrote:

    Mime-Version: 1.0
    Content-Type: text/plain; charset="UTF-8"
    User-Agent: Vectrex rn 2.1 (beta)
    [...]
    In comp.lang.perl.misc, Tuxedo <tuxedo@mailinator.net> wrote:
    # $subject_line = "μεÏικές ελληνικές λέξεις";

    That's curious. I'd look at what encoding your query string has.

    I saw Tuxedo's UTF-8 characters correctly, it seems somewhere on
    the way to you their encoding was borken?

    --
    /* * * Otto J. Makela <om@iki.fi> * * * * * * * * * */
    /* Phone: +358 40 765 5772, ICBM: N 60 10' E 24 55' */
    /* Mail: Mechelininkatu 26 B 27, FI-00100 Helsinki */
    /* * * Computers Rule 01001111 01001011 * * * * * * */

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Eli the Bearded@21:1/5 to tuxedo@mailinator.net on Thu Sep 16 19:09:52 2021
    In comp.lang.perl.misc, Tuxedo <tuxedo@mailinator.net> wrote:
    It just reads some greek words. Odd that does not show in your end. My news reader and news posting window is set to UTF-8.

    I saw the Greek originally, but I had an editor hiccup that clearly
    screwed that up. Sorry. That's also why I picked B instead of Q for
    the encoding. Q is best suited for mostly ASCII content like French
    or German.

    $msg = MIME::Lite->new (
    From => "\"Commenter\" <no-reply\@example.com>",
    To => 'comments@example.com',
    Type =>'multipart/mixed',
    Subject => encode( 'MIME-Header', $subject_line)
    );

    My Encode module does not document a 'MIME-Header' encoding. I use MIME:EncWords for that.

    use MIME::EncWords qw( encode_mimeword );

    ...
    Subject => encode_mimeword( $subject_line, 'B', 'UTF-8')

    Where "B" is for base64 and 'Q' woult be 'quoted-printable'.

    Thanks for the tips. I will give MIME::EncWords a try.

    Another tool I've used for something a similar is Email::MIME::RFC2047::Encoder

    I don't know that one, but from the name it's doing the same thing.
    RFC-2047 defines "MIME encoded words" for putting non-ASCII content
    into 7-bit clean mail headers.

    I'm not sure where the UTF-8 conversion fails in the mail or CGI.

    Try adding some logging. Sometimes for CGI stuff I find it easiest
    to open my own log file and write to that.

    I see from other follow-ups this is Perl 5.10.x. I have a 5.10.1 here,
    and I tried the code, but I don't have MIME::Lite or MIME::EncWords
    for that install.

    Elijah
    ------
    only willing to try so hard to duplicate an environment

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Eric Pozharski@21:1/5 to Tuxedo on Fri Sep 17 09:24:12 2021
    with <shth5p$caq$1@solani.org> Tuxedo wrote:
    Hello,

    How can I process the input of an HTML form in UTF-8 with CGI and pass it through a MIME-Lite's sending procedure intact?

    *SKIP*
    use Encode qw(encode encode_utf8 );
    use utf8;

    $query = new CGI;

    $comments = $query->param('comments');

    # If I collect a UTF-8 charset subject line it becomes
    # goobledegook once mailed

    $subject_line = $query->param('subject');

    # but if I define a UTF-8 character string here it works
    # in the subject line of the resulting mail

    # $subject_line = "μερικές ελληνικές λέξεις";

    This suggests (because 'use utf8') MIME-Lite is fine. Anyway,


    MIME::Lite->send ("sendmail", "/usr/bin/sendmail -t -oi");

    $msg = MIME::Lite->new (
    From => "\"Commenter\" <no-reply\@example.com>",
    To => 'comments@example.com',
    Type =>'multipart/mixed',
    Subject => encode( 'MIME-Header', $subject_line)

    Wow! Encode can do 'MIME-Header'?! I see the light!

    *SKIP*
    $msg->attach (Type =>'text/plain; carset=utf-8', Data => $body);

    That looks like copy-paste, but "carset"?

    Anyway, as you see for yourself: if you pass non-latin1 contents
    properly stored in Perl's internal encoding (due 'use utf8') to
    MIME-Lite (which is Perl's internal encoding aware, apparently) you are
    fine. I don't remember 5.10 now (and digging through Changes isn't
    feasable), *if* you'd be younger (like 5.14) I'd suggest to replace 'use
    utf8' with 'use feature qw/ unicode_strings /' insted (but it might be
    not an option).

    Anyway, I suggest, (unless you absolutely need 'use utf8' for something)
    drop 'use utf8' and add 'use Encode qw/ decode_utf8 /'. What you need
    is *decoding* strings that come out of CGI.pm. Apparently, CGI.pm
    doesn't decode whatever comes from network, turns out that's you who has
    to do it (decoding). Better yet, 'use Encode qw/ decode /', figure out
    what encoding was with the request that CGI.pm dealt with and then
    decode properly (there are more encodings outside than just UTF-8).

    *CUT*

    --
    Torvalds' goal for Linux is very simple: World Domination
    Stallman's goal for GNU is even simpler: Freedom

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tuxedo@21:1/5 to Eli the Bearded on Fri Sep 17 16:56:47 2021
    Eli the Bearded wrote:

    In comp.lang.perl.misc, Tuxedo <tuxedo@mailinator.net> wrote:
    It just reads some greek words. Odd that does not show in your end. My
    news reader and news posting window is set to UTF-8.

    I saw the Greek originally, but I had an editor hiccup that clearly
    screwed that up. Sorry. That's also why I picked B instead of Q for
    the encoding. Q is best suited for mostly ASCII content like French
    or German.

    $msg = MIME::Lite->new (
    From => "\"Commenter\" <no-reply\@example.com>",
    To => 'comments@example.com',
    Type =>'multipart/mixed',
    Subject => encode( 'MIME-Header', $subject_line)
    );

    My Encode module does not document a 'MIME-Header' encoding. I use
    MIME:EncWords for that.

    use MIME::EncWords qw( encode_mimeword );

    ...
    Subject => encode_mimeword( $subject_line, 'B', 'UTF-8')

    Where "B" is for base64 and 'Q' woult be 'quoted-printable'.

    Thanks for the tips. I will give MIME::EncWords a try.

    Another tool I've used for something a similar is
    Email::MIME::RFC2047::Encoder

    I don't know that one, but from the name it's doing the same thing.
    RFC-2047 defines "MIME encoded words" for putting non-ASCII content
    into 7-bit clean mail headers.

    I'm not sure where the UTF-8 conversion fails in the mail or CGI.

    Try adding some logging. Sometimes for CGI stuff I find it easiest
    to open my own log file and write to that.

    I see from other follow-ups this is Perl 5.10.x. I have a 5.10.1 here,
    and I tried the code, but I don't have MIME::Lite or MIME::EncWords
    for that install.

    Elijah
    ------
    only willing to try so hard to duplicate an environment

    Thanks for the tips and feedback.

    I try to generate a message in UTF-8 as submitted via a form and pass it through a mail sending procedure to an own address, which could include possible UTF-8 in mail headers; the name part in From, To, Subject and everything in the message body that may also receive various UTF-8
    characters.

    UTF-8 may also be needed in the (From) email address part to be compatible
    with IDN strings. I think the input will need to be converted into Punycode domain representations to work via any email module and while passing
    through email address syntax checking.

    While returning valid UTF-8 via CGI and into a browser is simple, I'm not
    sure which mail sending module may best serve the purpose.

    After all, MIME::Lite is depreciated. Alternately, I've used Mail::Sender
    for a different application in the past but which is now also depreciated.
    Both offer easy ways to include inline attachments, HTML and plain text alternatives in case it will be needed. I will try with Mail::Sender unless someone has another recommendation.

    Tuxedo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tuxedo@21:1/5 to Eric Pozharski on Fri Sep 17 17:16:54 2021
    Eric Pozharski wrote:

    ...


    *SKIP*
    $msg->attach (Type =>'text/plain; carset=utf-8', Data => $body);

    That looks like copy-paste, but "carset"?

    I'm not sure where I got that from but yes, it's likely copy-paste :-)


    Anyway, as you see for yourself: if you pass non-latin1 contents
    properly stored in Perl's internal encoding (due 'use utf8') to
    MIME-Lite (which is Perl's internal encoding aware, apparently) you are
    fine. I don't remember 5.10 now (and digging through Changes isn't feasable), *if* you'd be younger (like 5.14) I'd suggest to replace 'use utf8' with 'use feature qw/ unicode_strings /' insted (but it might be
    not an option).

    Anyway, I suggest, (unless you absolutely need 'use utf8' for something)
    drop 'use utf8' and add 'use Encode qw/ decode_utf8 /'. What you need
    is *decoding* strings that come out of CGI.pm. Apparently, CGI.pm
    doesn't decode whatever comes from network, turns out that's you who has
    to do it (decoding). Better yet, 'use Encode qw/ decode /', figure out
    what encoding was with the request that CGI.pm dealt with and then
    decode properly (there are more encodings outside than just UTF-8).


    Thanks for the above comments. I think they also highlight compatibility
    issues I have with some other applications where I've so-far resorted to
    using HTML entities as a cumbersome workaround to CGI generated HTML output.

    Tuxedo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Grant Taylor@21:1/5 to Tuxedo on Fri Sep 17 09:44:54 2021
    On 9/15/21 2:19 PM, Tuxedo wrote:
    I would like all contents, including mailheaders (Subject, Reply-to
    and From headers to be UTF-8 compatible. So far, I only managed
    to print a form's input to the browser but not encode it correctly
    through the mail procedure.

    I'm late to the party, but I wanted to add the following comment:

    Email headers use different (and I believe incompatible) encoding than
    the MIME body of the email.

    I'd have to go back and (re)read the pertinent RFCs for how to correctly
    encode non-ASCII characters in email headers. But I'm quite certain
    that traditional MIME encoding methods will /not/ work.



    --
    Grant. . . .
    unix || die

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tuxedo@21:1/5 to Grant Taylor on Fri Sep 17 20:45:38 2021
    Grant Taylor wrote:

    On 9/15/21 2:19 PM, Tuxedo wrote:
    I would like all contents, including mailheaders (Subject, Reply-to
    and From headers to be UTF-8 compatible. So far, I only managed
    to print a form's input to the browser but not encode it correctly
    through the mail procedure.

    I'm late to the party, but I wanted to add the following comment:

    Email headers use different (and I believe incompatible) encoding than
    the MIME body of the email.

    I just managed to submit data through CGI and send it through MIME::Lite
    with UTF-8 intact but so for not including the subject line of the mail message. Thank you for pointing this out.
    For other parts, I realise my previous mistake was simply forgetting to
    declare utf-8 above the HTML form:
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

    I did not show the full code of comments.html in my original post so no one could have known.


    I'd have to go back and (re)read the pertinent RFCs for how to correctly encode non-ASCII characters in email headers. But I'm quite certain
    that traditional MIME encoding methods will /not/ work.

    I will test with Email::MIME::RFC2047::Encoder

    Tuxedo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Eli the Bearded@21:1/5 to gtaylor@tnetconsulting.net on Fri Sep 17 19:00:52 2021
    In comp.lang.perl.misc, Grant Taylor <gtaylor@tnetconsulting.net> wrote:
    I'm late to the party, but I wanted to add the following comment:

    Email headers use different (and I believe incompatible) encoding than
    the MIME body of the email.

    RFC2047 MIME "encoded words". It's very close to regular MIME encoding,
    but not 100% the same. Each word looks like:

    =?${charset}?${encoding}?${encoded_bit}?=

    $charset = 'utf-8'; # this and next are case insensitive
    $encoding = 'B'; # base64 (easier) or 'Q' for quasi-quoted printable $encoded_bit = base64(encode($charset, $string));

    I'd have to go back and (re)read the pertinent RFCs for how to correctly encode non-ASCII characters in email headers. But I'm quite certain
    that traditional MIME encoding methods will /not/ work.

    My top of the head recollection is encoded words add special rules for whitespace in quoted printable and have some added rules about what
    whitespace between each word means. It's enough of a nuissance that I
    prefer to use a module rather than roll-my-own from the lower MIME
    encoding parts.

    Actual in-the-wild Subject: nasty example I have saved in comments:

    =?utf-8?b?QXR0ZW50aW9uISBJbXBvcnRhbnQgUGFyZW50IEFubm91bmNlbWVudHMgZm9yIHRoZSB3ZWVrIG9mIE5vdmVtYmVyIDE0LTE5dGguIApQbGVhc2UgUmVhZCBDYXJlZnVsbHkh?=

    You might think, oh, it's nasty because it's plain ASCII that's been
    base64 encoded to make it unreadable to non-MIME aware readers (eg,
    grep with a procmail.log file) or because it is so long instead of
    being broken up in to several shorter MIME encoded words. No, those
    are reasons it's ugly, not nasty. The nasty bit is there's a new line in
    there, so if you decode it as is, you cna break message header parsing
    that assumes properly formatted continued lines.

    Elijah
    ------
    now normalizes whitespace

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Eric Pozharski@21:1/5 to Tuxedo on Sat Sep 18 12:11:21 2021
    with <si2876$1on$1@solani.org> Tuxedo wrote:
    Eric Pozharski wrote:

    vvvvvv
    $msg->attach (Type =>'text/plain; carset=utf-8', Data => $body);
    ^^^^^^
    That looks like copy-paste, but "carset"?
    I'm not sure where I got that from but yes, it's likely copy-paste :-)

    What about "carset" then?

    *CUT*

    --
    Torvalds' goal for Linux is very simple: World Domination
    Stallman's goal for GNU is even simpler: Freedom

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tuxedo@21:1/5 to Eric Pozharski on Sun Sep 19 11:44:45 2021
    Eric Pozharski wrote:

    with <si2876$1on$1@solani.org> Tuxedo wrote:
    Eric Pozharski wrote:

    vvvvvv
    $msg->attach (Type =>'text/plain; carset=utf-8', Data => $body);
    ^^^^^^
    That looks like copy-paste, but "carset"?
    I'm not sure where I got that from but yes, it's likely copy-paste :-)

    What about "carset" then?

    I'm not sure what you mean?

    *CUT*

    Meanwhile, I tested a sending procedure instead of MIME-Lite, namely Mail::Sender but have the same difficultly with UTF-8 for email transmission for data going through CGI.

    I can however transmit a string intact via mail if it's hard-coded in the
    perl script:

    use Mail::Sender;
    use utf8;

    $subject = "μερικές ελληνικές λέξεις";

    my $sender = new Mail::Sender;

    from => $from_email,
    to => $to_email,
    subject => $subject,
    charset => 'utf-8',
    });

    $sender->Close();

    But if passed through a CGI form, like this:

    use CGI;
    use utf8;
    use Email::MIME::RFC2047::Encoder;

    $subject = $query->param('subject');

    my $utf8_subject_encoder = Email::MIME::RFC2047::Encoder->new;
    my $utf8_encoded_subject = $utf8_subject_encoder->encode_text($subject);

    from => $from_email,
    to => $to_email,
    subject => $utf8_encoded_subject,
    charset => 'utf-8',
    });

    $sender->Close();

    ... the subject will show something like follows in a resulting email
    subject line:

    μερικές ελληνικές λέξεις


    The form collecting the "ρικές ελληνικές λέξεις" string uses <meta http-
    equiv="Content-Type" content="text/html; charset=utf-8">

    And the proper "ρικές ελληνικές λέξεις" will print fine on the output of the
    CGI generated HTML result page after being passed through a form.

    The output page has <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

    It just won't mail for some mysterious reason, maybe relating to CGI.

    Use "Email::MIME::RFC2047::Encoder;" is meant to encode for email headers as far as I understand.

    Yet, I can pass "με ρικές ελληνικές λέξεις" into the subject line of an
    email without the Encoder procedure, as long as I declare 'use utf8;' at the top of the script. As said, only if the the string is literally coded into
    the perl script and not passed as a variable through CGI, it will also work
    to email intact.

    The correct UTF-8 characters will display fine on a CGI result page whether hard-coded in the script or passed through a form.

    The result was the same with MIME-Lite, so it's not the mailer that's the issue. I'm not sure exactly what is.

    Tuxedo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tuxedo@21:1/5 to Tuxedo on Sun Sep 19 18:16:25 2021
    Tuxedo wrote:

    Eric Pozharski wrote:

    with <si2876$1on$1@solani.org> Tuxedo wrote:
    Eric Pozharski wrote:

    vvvvvv
    $msg->attach (Type =>'text/plain; carset=utf-8', Data => $body);
    ^^^^^^
    That looks like copy-paste, but "carset"?
    I'm not sure where I got that from but yes, it's likely copy-paste :-)

    What about "carset" then?

    I'm not sure what you mean?

    *CUT*

    Meanwhile, I tested a sending procedure instead of MIME-Lite, namely Mail::Sender but have the same difficultly with UTF-8 for email
    transmission for data going through CGI.

    I can however transmit a string intact via mail if it's hard-coded in the perl script:

    use Mail::Sender;
    use utf8;

    $subject = "μερικές ελληνικές λέξεις";

    my $sender = new Mail::Sender;

    from => $from_email,
    to => $to_email,
    subject => $subject,
    charset => 'utf-8',
    });

    $sender->Close();

    But if passed through a CGI form, like this:

    use CGI;
    use utf8;
    use Email::MIME::RFC2047::Encoder;

    $subject = $query->param('subject');

    my $utf8_subject_encoder = Email::MIME::RFC2047::Encoder->new;
    my $utf8_encoded_subject = $utf8_subject_encoder->encode_text($subject);

    from => $from_email,
    to => $to_email,
    subject => $utf8_encoded_subject,
    charset => 'utf-8',
    });

    $sender->Close();

    ... the subject will show something like follows in a resulting email
    subject line:

    μερικές ελληνικές λέξεις


    The form collecting the "ρικές ελληνικές λέξεις" string uses <meta http-
    equiv="Content-Type" content="text/html; charset=utf-8">

    And the proper "ρικές ελληνικές λέξεις" will print fine on the output of
    the CGI generated HTML result page after being passed through a form.

    The output page has <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

    It just won't mail for some mysterious reason, maybe relating to CGI.

    Use "Email::MIME::RFC2047::Encoder;" is meant to encode for email headers
    as far as I understand.

    Yet, I can pass "με ρικές ελληνικές λέξεις" into the subject line of an
    email without the Encoder procedure, as long as I declare 'use utf8;' at
    the top of the script. As said, only if the the string is literally coded into the perl script and not passed as a variable through CGI, it will
    also work to email intact.

    The correct UTF-8 characters will display fine on a CGI result page
    whether hard-coded in the script or passed through a form.

    The result was the same with MIME-Lite, so it's not the mailer that's the issue. I'm not sure exactly what is.

    Tuxedo

    My issue can be reduced to a difference in the submitted form data compared with the fixed typed-in string in my perl code, although both flavors of
    UTF-8 characters appear identical in a browser window through Perl.

    One works to email and the other does not. For example, I test with a simple HTML form submit:

    <!DOCTYPE html>
    <html><head>
    <title></title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    </head>

    <body>

    <form ENCTYPE="multipart/form-data" method="post" action="compare.pl">

    <input type="text" name="subject" size="30" value="μερικές ελληνικές
    λέξεις">

    <input type="submit" value="Submit">

    </form>
    </body>
    </html>

    And it's submitted to the following compare.pl script:

    #!/usr/bin/perl -w

    use CGI;
    use utf8;
    use Email::MIME::RFC2047::Encoder;

    my $fixed_subject;


    # Only the following passed directly through an email
    # subject intact:

    $fixed_subject = "μερικές ελληνικές λέξεις";

    my $query = new CGI;

    # This value will display correctly in a web browser
    # but not after having been sent in a subject
    # line of an email via Mime-Lite or other:

    my $submitted_subject = $query->param('subject');


    # The following $utf8_encoded_submitted_subject will not display correctly
    # in a browser or email subject line:

    my $utf8_submitted_subject_encoder = Email::MIME::RFC2047::Encoder->new;
    my $utf8_encoded_submitted_subject = $utf8_submitted_subject_encoder- >encode_text($submitted_subject);

    print "Content-type: text/html\n\n";
    print "<!DOCTYPE html>\n";
    print "<html><head>\n";
    print "<title>Compare</title>\n";
    print "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">\n";
    print "</head>\n";
    print "<body>\n";
    print "\$fixed_subject: $fixed_subject\n";
    print "<hr>";
    print "\$submitted_subject: $submitted_subject\n";
    print "<hr>";
    print "\$utf8_encoded_submitted_subject: $utf8_encoded_submitted_subject\n";

    print "</body></html>\n";


    I leave out the email code here but as said the $fixed_subject typed
    directly into the perl code works in a subject line of a mail transmission through Mime::Lite or Mail::Sender while the $submitted_subject that was corrected as a form value through CGI does not.

    What exactly has happens to $submitted_subject in the process and how can it
    be made identical to the $fixed_subject string?

    In a browser, the $fixed_subject prints as:
    μερικές ελληνικές λέξεις

    And the $submitted_subject prints the same:
    μερικές ελληνικές λέξεις

    The $utf8_encoded_submitted_subject prints as:

    =?utf-8?Q?=c3=8e=c2=bc=c3=8e=c2=b5=c3=8f=c2=81=c3=8e=c2=b9=c3=8e=c2=ba?= =?utf-8?Q?=c3=8e=c2=ad=c3=8f=c2=82_=c3=8e=c2=b5=c3=8e=c2=bb=c3=8e=c2=bb?= =?utf-8?Q?=c3=8e=c2=b7=c3=8e=c2=bd=c3=8e=c2=b9=c3=8e=c2=ba=c3=8e=c2=ad?= =?utf-8?Q?=c3=8f=c2=82_=c3=8e=c2=bb=c3=8e=c2=ad=c3=8e=c2=be=c3=8e=c2=b5?= =?utf-8?Q?=c3=8e=c2=b9=c3=8f=c2=82?=

    If I send the "μερικές ελληνικές λέξεις" characters in the subject of an
    email using Thunderbird, they displays fine in the email program.
    Thunderbird's specific subject line source code appears as follows:

    =?UTF-8?B?zrzOtc+BzrnOus6tz4IgzrXOu867zrfOvc65zrrOrc+CIM67zq3Ovs61?=
    =?UTF-8?B?zrnPgg==?=

    The source of the $fixed_subject line of the perl generated mail looks as follows:

    =?utf-8?Q?=ce=bc=ce=b5=cf=81=ce=b9=ce=ba=ce=ad=cf=82_=ce=b5=ce=bb=ce=bb?= =?utf-8?Q?=ce=b7=ce=bd=ce=b9=ce=ba=ce=ad=cf=82_=ce=bb=ce=ad=ce=be=ce=b5?= =?utf-8?Q?=ce=b9=cf=82?=

    So the $fixed_subject displays fine. How can the $submitted_subject string
    be be made or preserved identical? After all, it's the same set of
    characters but with somewhat different encoding or copying in perl I guess.

    Thanks in advance for any suggestions.

    Tuxedo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tuxedo@21:1/5 to Tuxedo on Sun Sep 19 19:42:07 2021
    Tuxedo wrote:

    Tuxedo wrote:

    Eric Pozharski wrote:

    with <si2876$1on$1@solani.org> Tuxedo wrote:
    Eric Pozharski wrote:

    vvvvvv
    $msg->attach (Type =>'text/plain; carset=utf-8', Data => $body);
    ^^^^^^
    That looks like copy-paste, but "carset"?
    I'm not sure where I got that from but yes, it's likely copy-paste :-)

    What about "carset" then?

    I'm not sure what you mean?

    *CUT*

    Meanwhile, I tested a sending procedure instead of MIME-Lite, namely
    Mail::Sender but have the same difficultly with UTF-8 for email
    transmission for data going through CGI.

    I can however transmit a string intact via mail if it's hard-coded in the
    perl script:

    use Mail::Sender;
    use utf8;

    $subject = "μερικές ελληνικές λέξεις";

    my $sender = new Mail::Sender;

    from => $from_email,
    to => $to_email,
    subject => $subject,
    charset => 'utf-8',
    });

    $sender->Close();

    But if passed through a CGI form, like this:

    use CGI;
    use utf8;
    use Email::MIME::RFC2047::Encoder;

    $subject = $query->param('subject');

    my $utf8_subject_encoder = Email::MIME::RFC2047::Encoder->new;
    my $utf8_encoded_subject = $utf8_subject_encoder->encode_text($subject);

    from => $from_email,
    to => $to_email,
    subject => $utf8_encoded_subject,
    charset => 'utf-8',
    });

    $sender->Close();

    ... the subject will show something like follows in a resulting email
    subject line:

    μερικές ελληνικές λέξεις


    The form collecting the "ρικές ελληνικές λέξεις" string uses <meta http-
    equiv="Content-Type" content="text/html; charset=utf-8">

    And the proper "ρικές ελληνικές λέξεις" will print fine on the output of
    the CGI generated HTML result page after being passed through a form.

    The output page has <meta http-equiv="Content-Type" content="text/html;
    charset=utf-8">

    It just won't mail for some mysterious reason, maybe relating to CGI.

    Use "Email::MIME::RFC2047::Encoder;" is meant to encode for email headers
    as far as I understand.

    Yet, I can pass "με ρικές ελληνικές λέξεις" into the subject line of an
    email without the Encoder procedure, as long as I declare 'use utf8;' at
    the top of the script. As said, only if the the string is literally coded
    into the perl script and not passed as a variable through CGI, it will
    also work to email intact.

    The correct UTF-8 characters will display fine on a CGI result page
    whether hard-coded in the script or passed through a form.

    The result was the same with MIME-Lite, so it's not the mailer that's the
    issue. I'm not sure exactly what is.

    Tuxedo

    My issue can be reduced to a difference in the submitted form data
    compared with the fixed typed-in string in my perl code, although both flavors of UTF-8 characters appear identical in a browser window through Perl.

    One works to email and the other does not. For example, I test with a
    simple HTML form submit:

    <!DOCTYPE html>
    <html><head>
    <title></title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    </head>

    <body>

    <form ENCTYPE="multipart/form-data" method="post" action="compare.pl">

    <input type="text" name="subject" size="30" value="μερικές ελληνικές
    λέξεις">

    <input type="submit" value="Submit">

    </form>
    </body>
    </html>

    And it's submitted to the following compare.pl script:

    #!/usr/bin/perl -w

    use CGI;
    use utf8;
    use Email::MIME::RFC2047::Encoder;

    my $fixed_subject;


    # Only the following passed directly through an email
    # subject intact:

    $fixed_subject = "μερικές ελληνικές λέξεις";

    my $query = new CGI;

    # This value will display correctly in a web browser
    # but not after having been sent in a subject
    # line of an email via Mime-Lite or other:

    my $submitted_subject = $query->param('subject');


    # The following $utf8_encoded_submitted_subject will not display correctly
    # in a browser or email subject line:

    my $utf8_submitted_subject_encoder = Email::MIME::RFC2047::Encoder->new;
    my $utf8_encoded_submitted_subject = $utf8_submitted_subject_encoder-
    encode_text($submitted_subject);

    print "Content-type: text/html\n\n";
    print "<!DOCTYPE html>\n";
    print "<html><head>\n";
    print "<title>Compare</title>\n";
    print "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">\n";
    print "</head>\n";
    print "<body>\n";
    print "\$fixed_subject: $fixed_subject\n";
    print "<hr>";
    print "\$submitted_subject: $submitted_subject\n";
    print "<hr>";
    print "\$utf8_encoded_submitted_subject:
    $utf8_encoded_submitted_subject\n";

    print "</body></html>\n";


    I leave out the email code here but as said the $fixed_subject typed
    directly into the perl code works in a subject line of a mail transmission through Mime::Lite or Mail::Sender while the $submitted_subject that was corrected as a form value through CGI does not.

    I meant ... the $submitted_subject that was *submitted* as a form value
    through CGI does not.

    What exactly has happens to $submitted_subject in the process and how can
    it be made identical to the $fixed_subject string?

    As Eric Pozharski pointed earlier I think it's necessary to decode what
    comes through CGI and re-encode correctly for use in email headers and
    perhaps somewhat differently for the email body.

    How can this be done with the single form submit field CGI example and with
    a fairly old Perl version (5.10.1) and its relatively old set of modules?
    Maybe with 'use Encode qw / decode_utf8 /'.

    Thanks again for any suggestions and example code bits if possible.

    Tuxedo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Eric Pozharski@21:1/5 to Tuxedo on Sun Sep 19 12:25:08 2021
    with <si6tgg$cq6$1@solani.org> Tuxedo wrote:
    Eric Pozharski wrote:
    with <si2876$1on$1@solani.org> Tuxedo wrote:
    Eric Pozharski wrote:

    vvvvvv
    $msg->attach (Type =>'text/plain; carset=utf-8', Data => $body);
    ^^^^^^
    That looks like copy-paste, but "carset"?
    I'm not sure where I got that from but yes, it's likely copy-paste
    :-)
    What about "carset" then?
    I'm not sure what you mean?

    OK, follow me on this. If it's copy-paste then it's (likely) your
    running code. In your running code you have this

    Type =>'text/plain; carset=utf-8'

    This 'carset' can't be right.

    *SKIP*
    It just won't mail for some mysterious reason, maybe relating to CGI.
    *SKIP*
    The result was the same with MIME-Lite, so it's not the mailer that's
    the issue. I'm not sure exactly what is.

    Told you so. Now you're back to $square{zero}

    [[ out of order ]]
    $subject = $query->param('subject');

    Just pouring modules and/or pragmas in your code is mad science.
    Please, try this way:

    # use utf8;
    use Encode qw/ decode_utf8 /;
    # ... boilerplate ...
    $subject = decode_utf8( $query->param( 'subject' ));

    --
    Torvalds' goal for Linux is very simple: World Domination
    Stallman's goal for GNU is even simpler: Freedom

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tuxedo@21:1/5 to Eric Pozharski on Mon Sep 20 06:27:42 2021
    Eric Pozharski wrote:

    ...

    That looks like copy-paste, but "carset"?

    Anyway, as you see for yourself: if you pass non-latin1 contents
    properly stored in Perl's internal encoding (due 'use utf8') to
    MIME-Lite (which is Perl's internal encoding aware, apparently) you are
    fine. I don't remember 5.10 now (and digging through Changes isn't feasable), *if* you'd be younger (like 5.14) I'd suggest to replace 'use utf8' with 'use feature qw/ unicode_strings /' insted (but it might be
    not an option).

    Anyway, I suggest, (unless you absolutely need 'use utf8' for something)
    drop 'use utf8' and add 'use Encode qw/ decode_utf8 /'.

    It I drop 'use utf8;' and replace it with:

    use Encode qw/ decode_utf8 /;

    The fixed characters that were typed in directly in the perl script ("μερικές ελληνικές λέξεις") become "μερικές ελληνικές λέ
    ξεις" when passed through the email procedure in a subject line while
    the source code of the subject line in the resulting email appears as
    follows:

    Subject: =?utf-8?Q?=C3=8E=C2=BC=C3=8E=C2=B5=C3=8F=C2=81=C3=8E=C2=B9=C3=8E=C2=BA=C3=8E=C2=AD=C3=8F=C2=82=20?==?utf-8?Q?=C3=8E=C2=B5=C3=8E=C2=BB=C3=8E=C2=BB=C3=8E=C2=B7=C3=8E=C2=BD=C3=8E=C2=B9=C3=8E=C2=BA=C3=8E=C2=AD=C3=8F=C2=82=20?==?utf-8?Q?=C3=8E=C2=BB=C3=8E=C2=
    AD=C3=8E=C2=BE=C3=8E=C2=B5=C3=8E=C2=B9=C3=8F=C2=82?=

    As for the form submitted submitted "μερικές ελληνικές λέξεις", it appears
    the same (""μερικές ελληνικές λέξεις") when 'use
    Encode qw/ decode_utf8 /;' is in place, and as above in the source.

    What can be done to properly decode/encode user-submitted UTF-8 data in a
    way that the data can be the same as if typed directly in the perl code so
    it can pass through email?

    The:
    use feature qw/ unicode_strings /;
    .. caused and error on the perl version I have.

    What you need
    is *decoding* strings that come out of CGI.pm. Apparently, CGI.pm
    doesn't decode whatever comes from network, turns out that's you who has
    to do it (decoding). Better yet, 'use Encode qw/ decode /', figure out
    what encoding was with the request that CGI.pm dealt with and then
    decode properly (there are more encodings outside than just UTF-8).

    How exacly can 'use Encode qw/ decode /' figure in perl what encoding was
    used when it's user-submitted via CGI.pm? It can be any set of UTF-8 characters. On the HTML form I define:
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

    Thanks,
    Tuxedo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tuxedo@21:1/5 to Tuxedo on Mon Sep 20 14:46:52 2021
    Tuxedo wrote:

    ...


    What you need
    is *decoding* strings that come out of CGI.pm. Apparently, CGI.pm
    doesn't decode whatever comes from network, turns out that's you who has
    to do it (decoding). Better yet, 'use Encode qw/ decode /', figure out
    what encoding was with the request that CGI.pm dealt with and then
    decode properly (there are more encodings outside than just UTF-8).

    How exacly can 'use Encode qw/ decode /' figure in perl what encoding was used when it's user-submitted via CGI.pm? It can be any set of UTF-8 characters. On the HTML form I define:
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

    Thanks,
    Tuxedo

    It was indeed CGI that changed things.

    I finally got it working with sending input by email in UTF-8 intact format
    by adding '-utf8' to the CGI call.

    use CGI '-utf8';

    This made the Greek words work in both mail (in headers and body) and when generated onto a results page on a web browser.

    But for some reason, that also turned French accented characters into little black symbols with question marks in a web browser (although the email
    result worked, both in headers and body).

    Maybe it doesn't display the same for everyone here but it's basically just
    the "ç", as with any caractères accentués :

    Quelques mots fran�ais

    But if within the perl script I then declare:

    binmode(STDOUT, ':utf8');

    Now the Greek words work, the French accented words work, the €-sign and hopefully more or less everything else.

    Thanks for your everyone's comments which in all led to me in the right direction.

    Tuxedo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Eric Pozharski@21:1/5 to Tuxedo on Mon Sep 20 12:53:24 2021
    with <si8v9s$lka$1@solani.org> Tuxedo wrote:
    Eric Pozharski wrote:

    *SKIP*
    Anyway, as you see for yourself: if you pass non-latin1 contents
    properly stored in Perl's internal encoding (due 'use utf8') to
    MIME-Lite (which is Perl's internal encoding aware, apparently) you
    are fine. I don't remember 5.10 now (and digging through Changes
    isn't feasable), *if* you'd be younger (like 5.14) I'd suggest to
    replace 'use utf8' with 'use feature qw/ unicode_strings /' insted
    (but it might be not an option).

    Anyway, I suggest, (unless you absolutely need 'use utf8' for
    something) drop 'use utf8' and add 'use Encode qw/ decode_utf8 /'.

    It I drop 'use utf8;' and replace it with:
    use Encode qw/ decode_utf8 /;
    *SKIP*

    I have disturbing feeling that you don't realise important distinction. 'utf8.pm' (per 'use utf8;') is a *pragma* (so are 'strict.pm',
    'feature.pm', 'bytes.pm' and so on). Pragmas alter behaviour of perl
    when compiling *your* script. As you have observed by yourserlf (no
    'use utf8' and now your fancy strings (in *your* script!) result in
    garbage).

    'Encode.pm' is a *module* (for purists, yes, calling 'Encode.pm' a
    "module" is a stretch and huge one). A module is just an addition to
    your toolbox -- no more, no less. A hammer (or drill, or 3D-printer, or nuclear reactor) without application will patiently sit where you've put
    it (until it rots). *Without* application.

    (I expect it to be rush, but whatever) Throwing random shit on your
    screen is not a way to go through life.

    What can be done to properly decode/encode user-submitted UTF-8 data
    in a way that the data can be the same as if typed directly in the
    perl code so it can pass through email?

    'decode_utf8' *must* be applied to whatever strings of *bytes* (or
    'octets' might be more relevant) are taken out of 'CGI.pm' to make them
    strings of *characters*. Bytes are not characters, characters are not
    bytes -- it's a Perl thing.

    The: use feature qw/ unicode_strings /; .. caused and error on the
    perl version I have.

    So my memories aren't faulty (in regard of 'unicode_strings').

    What you need is *decoding* strings that come out of CGI.pm.
    Apparently, CGI.pm doesn't decode whatever comes from network, turns
    out that's you who has to do it (decoding). Better yet, 'use Encode
    qw/ decode /', figure out what encoding was with the request that
    CGI.pm dealt with and then decode properly (there are more encodings
    outside than just UTF-8).

    How exacly can 'use Encode qw/ decode /' figure in perl what encoding
    was used when it's user-submitted via CGI.pm? It can be any set of
    UTF-8 characters. On the HTML form I define:
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

    Per "$string = decode(ENCODING, $octets [, CHECK])", 'decode' can't, it
    must be told what encoding "$octets" are in. 'decode_utf8' is already
    told encoding is utf-8, and decodes in-place.

    As of "http-equiv", it might be an improvement, I guess. Like,
    permissive applicaton takes whatever remote (through 'CGI.pm') has sent,
    asks 'CGI.pm' what remote suggests encoding is, and decodes
    appropriately. Repressive application tells remote to send encoded in
    utf-8, decodes, and if it (decoding) fails throws input away (yup,
    decoding might fail).

    Still, 'CGI.pm' won't decode your inputs automagically.

    p.s. Also, I'm no way 'CGI.pm' expert, but my perldoc-fu is good
    enough,.. But I'd rather execute restraint ;)

    --
    Torvalds' goal for Linux is very simple: World Domination
    Stallman's goal for GNU is even simpler: Freedom

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)