• Printing UTF8 (Unicode)

    From David Newall@21:1/5 to All on Fri Jan 21 21:56:44 2022
    Copy: glaukon.ariston@gmail.com (Glaukon)

    Hello All,

    I've written some PostScript to allow me to print UTF8-encoded strings:

    (UTF-8 Encoded String.....) utfshow

    I'm happy to send you the full source, or, if appropriate, publish it
    here; however, the exposition below includes everything you should need.

    I use a UTF-8 decoder which was written (in C) by Bjoern Hoehrmann (see http://bjoern.hoehrmann.de/utf-8/decoder/dfa/):

    %/ Copyright (c) 2008-2010 Bjoern Hoehrmann <bjoern@hoehrmann.de>
    %/ See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.

    /UTF8_ACCEPT 0 def
    /UTF8_REJECT 12 def

    /utf8d [
    %/ The first part of the table maps bytes to character classes that
    %/ to reduce the size of the transition table and create bitmasks.
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
    7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
    8 8 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
    10 3 3 3 3 3 3 3 3 3 3 3 3 4 3 3 11 6 6 6 5 8 8 8 8 8 8 8 8 8 8 8

    %/ The second part is a transition table that maps a combination
    %/ of a state of the automaton and a character class to a state.
    0 12 24 36 60 96 84 12 12 12 48 72 12 12 12 12 12 12 12 12 12 12 12 12
    12 0 12 12 12 12 12 0 12 0 12 12 12 24 12 12 12 12 12 24 12 24 12 12
    12 12 12 12 12 12 12 24 12 12 12 12 12 24 12 12 12 12 12 12 12 24 12 12
    12 12 12 12 12 12 12 36 12 36 12 12 12 36 12 12 12 12 12 36 12 36 12 12
    12 36 12 12 12 12 12 12 12 12 12 12
    ] def

    % codep state byte decode codep' state'
    /decode {
    utf8d 1 index get % type
    % codep state byte type
    2 index UTF8_ACCEPT ne % state not UTF8_ACCEPT?
    { exch 16#3F and 4 -1 roll 6 bitshift or }
    { dup neg 16#FF exch bitshift 3 -1 roll and 4 -1 roll pop }
    ifelse % state type codep'
    3 1 roll add 256 add utf8d exch get % codep' state'
    } def

    %***************************************************************************/


    I also use a table which Adobe published ("UNICODE translation table for non-ASCII characters"), which they say is for going from a glyph name to
    a Unicode codepoint. I (ab)use it in the reverse direction. I turned
    it into a dictionary keyed on the codepoint.

    The table is currently at https://github.com/adobe-type-tools/agl-aglfn.
    Some codepoints have multiple possible glyph names, so the dictionary
    has an array of potential glyph names for each codepoint. Finally,
    fonts often have glyphs named /uniHHHH, where HHHH is the codepoint.

    I converted the table to PS using awk:

    BEGIN{FS="[; ]"}
    {
    for(i=2; i<=NF; i++) {
    if(!($i in h)) {h[$i]=++n;v[n]=$i}
    g[$i]=g[$i]"/"$1
    }
    }
    END{
    print "/unicode <<"
    for(i=1;i<=n;i++) print "\t16#"v[i]"["g[v[i]]"/uni"toupper(v[i])"]"
    print ">> def"
    }

    Adobe's table is turned into this:

    /unicode <<
    16#0041[/A/uni0041]
    16#00C6[/AE/uni00C6]
    ...
    16#305A[/zuhiragana/uni305A]
    16#30BA[/zukatakana/uni30BA]
    def

    The crux of printing Unicode code points is to find which of the
    possible glyphs the current font defines. I search currentfont's
    CharStrings.

    % look for one of the glyphs in fontdict's CharStrings
    % [/glyph ...] fontdict chooseglyph /glyph true
    % false
    /chooseglyph {
    /CharStrings get exch % the glyphs defined in fontdict
    false 3 1 roll % assume we don't find a glyph
    % false CharStrings [glyphs]
    { 2 copy known {true 4 2 roll exch pop exit}{pop} ifelse } forall
    pop % remove CharStrings
    } def

    I've noticed that Symbol sometimes contains glyphs that other fonts
    don't, so, if I don't find a glyph in currentfont I look through Symbol.

    I thought it might be a good idea to also try ZapfDingbats. In
    retrospect, that might be a red herring.

    Adobe also publish a table like the Unicode table, giving the names of
    that font's glyphs. It's at the same place, and converts using the same
    awk:

    /zapf <<
    16#275E[/a100/uni275E]
    16#2761[/a101/uni2761]
    ...
    16#275D[/a99/uni275D]
    16#2720[/a9/uni2720]
    def


    This is the code which prints a unicode code point (or .notdef if a
    glyph cannot be found):

    % SPDX-License-Identifier: LGPL-2.1-or-later
    %
    % Copyright (c) 2022 by davidnewall.com. All rights reserved.

    % print a single unicode codepoint:
    % integer unicodeshow -
    /unicodeshow {
    % load array of known glyph names for this code point
    unicode 1 index known
    {unicode exch get} % array of possible glyphs
    { pop []} % unknown code point
    ifelse
    {
    dup currentfont chooseglyph { glyphshow exit } if
    dup /ZapfDingbats findfont chooseglyph {
    currentfont exch /ZapfDingbats currentfontsize selectfont
    glyphshow setfont exit } if
    dup /Symbol findfont chooseglyph {
    currentfont exch /Symbol currentfontsize selectfont
    glyphshow setfont exit } if
    /.notdef glyphshow exit
    } loop
    pop
    } def


    I get the current font size using this:

    /currentfontsize {
    currentfont dup /OrigFont get
    2 { /FontMatrix get 3 get exch } repeat div
    } bind def


    Finally (at last!), to print a UTF-8 string:

    /utfshow {
    UTF8_ACCEPT 0 UTF8_ACCEPT % prev codep current
    4 -1 roll {
    decode
    dup UTF8_ACCEPT eq { 1 index unicodeshow } if
    dup UTF8_REJECT eq {
    (%% Bad UTF-8 sequence\n) print pop
    UTF8_ACCEPT /.notdef glyphshow } if
    3 -1 roll pop dup 3 1 roll % prev = current
    } forall
    pop pop pop
    } def


    Regards,

    David

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Carlos@21:1/5 to All on Fri Jan 21 14:23:03 2022
    David Newall <davidn@davidnewall.com>:

    Hello All,

    I've written some PostScript to allow me to print UTF8-encoded
    strings:

    This is great!

    [...]
    % print a single unicode codepoint:
    % integer unicodeshow -
    /unicodeshow {
    [...]
    /utfshow {
    UTF8_ACCEPT 0 UTF8_ACCEPT % prev codep current
    4 -1 roll {
    decode
    dup UTF8_ACCEPT eq { 1 index unicodeshow } if
    [...]

    Doesn't "x glyphshow y glyphshow" lose the kerning between x and y?
    (I'm not really sure)

    If it does, an alternative could be to create a (probably composite)
    temporary font out of the characters used in the string and "show" a
    reencoded string using that font. Too complicated though :)

    Carlos.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Newall@21:1/5 to Carlos on Sat Jan 22 12:27:49 2022
    On 22/1/22 12:23 am, Carlos wrote:
    David Newall <davidn@davidnewall.com>:
    I've written some PostScript to allow me to print UTF8-encoded
    strings:

    This is great!

    Thank you. It seemed a problem which needed to be solved. I hope I've
    made a start that's good enough to criticize.

    Doesn't "x glyphshow y glyphshow" lose the kerning between x and y?
    (I'm not really sure)

    PostScript doesn't automatically kern. There are operators you can use
    to do that, but it is something you have to do.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Newall@21:1/5 to David Newall on Sun Jan 23 13:31:54 2022
    On 21/1/22 9:56 pm, David Newall wrote:
    I've written some PostScript to allow me to print UTF8-encoded strings

    There was an error in unicodeshow. I wasn't attempting /uniXXXX for
    codepoints that weren't in Adobe's table.

    Apparently it's also not uncommon to use /uXXXX through /uXXXXXX (4 to 6
    hex digits), so I check for those, too.

    % integer unicodeshow - show glyph for unicode code point
    /unicodeshow {
    % load array of known glyph names for this code point, supplemented
    % with /uXXXXXX (4 - 6 hex chars) and /uniXXXX (when codepoint fits
    % in 4 hex chars)
    [
    unicode 2 index known {unicode 2 index get aload pop} if
    % convert number to hex for /uXXXX.. and /uniXXXX
    (0000000) 6 counttomark 1 add index % string index number
    {
    % number must fit in 6 hex digits
    1 index 0 eq {
    pop pop pop
    /.error where {pop .error} {signalerror} ifelse
    } if
    dup 0 eq { pop exit } if
    3 copy 16 mod dup 9 gt { 55 } { 48 } ifelse add put
    16 idiv exch 1 sub exch
    } loop
    % require min 4 hex digits
    dup 2 gt { -1 3 { 1 index exch 16#30 put } for 2 } if
    % /uXXXX - /uXXXXXX
    2 copy 7 1 index sub getinterval dup 0 16#75 put cvn 3 1 roll
    % /uniXXXX
    2 eq { dup 0 (uni) putinterval dup cvn exch } if
    pop
    ] exch pop
    %[(candidates)2 index]== pstack(---)==
    dup currentfont chooseglyph not { /.notdef } if glyphshow
    pop
    } bind def

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Newall@21:1/5 to David Newall on Sun Jan 23 14:10:12 2022
    Hi All,

    I'm soliciting opinions...

    On 21/1/22 9:56 pm, David Newall wrote:
    I've written some PostScript to allow me to print UTF8-encoded strings
    ...
    I also use a table which Adobe published ("UNICODE translation table for non-ASCII characters"), which they say is for going from a glyph name to
    a Unicode codepoint.  I (ab)use it in the reverse direction.  I turned
    it into a dictionary keyed on the codepoint.

    Many (most?) fonts have glyphs which aren't in Adobe's table, or which
    are named differently. Fontforge can write a table of glyphs in a font
    and their corresponding codepoints. Using that table, unicodeshow looks
    more like this:

    % lookup a unicode codepoint (int) in a list of known glyphs (dict)
    % and display the glyph found.
    % dict int unicodeshow -
    /unicodeshow {
    2 copy known { get } { pop pop /.notdef } ifelse glyphshow
    } bind def

    While this looks much neater, it requires pre-generating a dictionary
    for each font used.

    I can't decide which approach is better.

    I'm not delighted by needing to add a dictionary that's specific to the
    current font to utfshow and unicodeshow because it feels wrong.

    I suppose whatever fonts are used to print unicode will be embedded in
    the PS, so I could add the table to each font's dictionary. I wonder if
    that would cause confusion to anybody reading the code:

    /unicodeshow { % int unicodeshow -
    currentfont /unicode 2 copy known not {
    pop pop /unicodeshow cvx /invalidfont
    /.error where {pop .error} {signalerror} ifelse
    } if
    get exch 2 copy known { get } { pop pop /.notdef } ifelse glyphshow
    } bind def

    Maybe that's not so awful.

    Opinions? Would adding to a font dictionary going to break things?
    (I'm looking at you, Acrobat and Distiller.)

    Regards,

    David

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Carlos@21:1/5 to All on Sun Jan 23 13:56:10 2022
    V Sun, 23 Jan 2022 14:10:12 +1100
    David Newall <davidn@davidnewall.com> napsáno:

    Hi All,

    I'm soliciting opinions...

    On 21/1/22 9:56 pm, David Newall wrote:
    I've written some PostScript to allow me to print UTF8-encoded
    strings ...
    I also use a table which Adobe published ("UNICODE translation
    table for non-ASCII characters"), which they say is for going from
    a glyph name to a Unicode codepoint.  I (ab)use it in the reverse direction.  I turned it into a dictionary keyed on the codepoint.
    Many (most?) fonts have glyphs which aren't in Adobe's table, or which
    are named differently. Fontforge can write a table of glyphs in a
    font and their corresponding codepoints. Using that table,
    unicodeshow looks more like this:

    % lookup a unicode codepoint (int) in a list of known glyphs (dict)
    % and display the glyph found.
    % dict int unicodeshow -
    /unicodeshow {
    2 copy known { get } { pop pop /.notdef } ifelse glyphshow
    } bind def

    While this looks much neater, it requires pre-generating a dictionary
    for each font used.

    I can't decide which approach is better.

    I think if a font has a mapping between unicode points and glyphs that
    you can extract (with Fontforge or whatever), then it surely also has
    uni/u aliases. The Adobe table is for older fonts that don't have them,
    so it's the only lookup table you need.

    I'm not delighted by needing to add a dictionary that's specific to
    the current font to utfshow and unicodeshow because it feels wrong.

    Also, having to pre-process the files to insert the tables is not good.

    [...]
    Opinions? Would adding to a font dictionary going to break things?
    (I'm looking at you, Acrobat and Distiller.)

    Don't know about that, I only use Ghostscript. But if the reason to add
    a lookup is speed, a possible optimization could be not to call
    unicodeshow on each codepoint, but identify string intervals where all
    bytes are either <= 127 or > 127. Call show on the former, and utfshow
    on the latter.

    C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Carlos@21:1/5 to All on Sun Jan 23 13:35:11 2022
    V Sun, 23 Jan 2022 13:31:54 +1100
    David Newall <davidn@davidnewall.com> napsáno:

    On 21/1/22 9:56 pm, David Newall wrote:
    I've written some PostScript to allow me to print UTF8-encoded
    strings

    There was an error in unicodeshow. I wasn't attempting /uniXXXX for codepoints that weren't in Adobe's table.

    Apparently it's also not uncommon to use /uXXXX through /uXXXXXX (4
    to 6 hex digits), so I check for those, too.

    Adobe's table (or one similar to it) is included in Ghostscript (AdobeGlyphList), and maybe other interpreters, too.

    Here's an old snippet that gets a glyph name (or uniXXXX) based on its
    code:

    /RevList AdobeGlyphList length dict dup begin
    AdobeGlyphList { exch def } forall
    end def
    % code -- (uniXXXX)
    /uniX { 16 6 string cvrs dup length 7 exch sub exch
    (uni0000) 7 string copy dup 4 2 roll putinterval } def
    % font code -- glyphname
    /unitoname { dup RevList exch known
    { RevList exch get }
    { uniX cvn } ifelse
    exch /CharStrings get 1 index known not
    { pop /.notdef } if
    } def

    (It doesn't contemplate several names per code... I thought it was a
    1-1 relationship.)

    If you know you are dealing with modern fonts that include the uni/u
    aliases, you can get rid of the Adobe table lookup altogether... You
    don't need the canonical glyph names for those fonts.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From luser droog@21:1/5 to David Newall on Mon Jan 24 08:33:13 2022
    On Saturday, January 22, 2022 at 9:10:23 PM UTC-6, David Newall wrote:

    Opinions? Would adding to a font dictionary going to break things?
    (I'm looking at you, Acrobat and Distiller.)

    Regards,

    David

    I don't see how that could be a problem unless the additions conflict
    with existing names. It's possible that findfont will give you a dictionary without write access. But you could copy everything into a new dictionary
    and then call `definefont` on that and you should be good to go. (Take
    care *not* to copy the /UniqueID key since definefont will want to
    generate a new one.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From luser droog@21:1/5 to Carlos on Mon Jan 24 08:37:58 2022
    On Sunday, January 23, 2022 at 6:56:13 AM UTC-6, Carlos wrote:
    V Sun, 23 Jan 2022 14:10:12 +1100
    David Newall <dav...@davidnewall.com> napsáno:

    [...]
    Opinions? Would adding to a font dictionary going to break things?
    (I'm looking at you, Acrobat and Distiller.)
    Don't know about that, I only use Ghostscript. But if the reason to add
    a lookup is speed, a possible optimization could be not to call
    unicodeshow on each codepoint, but identify string intervals where all
    bytes are either <= 127 or > 127. Call show on the former, and utfshow
    on the latter.

    C.

    Or if speed is not a problem, you could implement a replacement for
    kshow instead of show. Then the whole show family can easily be built
    off of that.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Newall@21:1/5 to luser droog on Wed Jan 26 15:06:39 2022
    On 25/1/22 3:33 am, luser droog wrote:
    On Saturday, January 22, 2022 at 9:10:23 PM UTC-6, David Newall wrote:

    Opinions? Would adding to a font dictionary going to break things?
    (I'm looking at you, Acrobat and Distiller.)

    I don't see how that could be a problem unless the additions conflict
    with existing names. It's possible that findfont will give you a dictionary without write access. But you could copy everything into a new dictionary
    and then call `definefont` on that and you should be good to go.

    Thanks. I can't see how it could, either, but I have little experience
    with actual Adobe software, as I use Ghostscript for almost all of my PostScript work.

    I might have been unclear in "adding to a font dictionary". I'm not contemplating /name findfont { modify } definefont, but fontforge font;
    awk '...' font.g2n; vi font.t42.

    Regards,

    David

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Newall@21:1/5 to Carlos on Wed Jan 26 14:59:09 2022
    Hi Carlos,

    Thanks for your very useful feedback.

    I will say, up-front, that using Adobe Glyph List (glyphlist.txt found
    at https://github.com/adobe-type-tools/agl-aglfn) is often sufficient, depending on what unicode values need to be painted and what font is to
    be used. But I want to do better than "often".

    I'm using https://antofthy.gitlab.io/info/data/utf8-demo.txt to test my
    code. It's coverage is ... extensive (and my current code seems to work
    for all of it -- font withstanding.)


    On 23/1/22 11:35 pm, Carlos wrote:
    Adobe's table (or one similar to it) is included in Ghostscript (AdobeGlyphList), and maybe other interpreters, too.

    I didn't know about AdobeGlyphList. The one in Ghostscript (9.50) has
    multiple names for some unicode values. Converseley Adobe Glyph List (glyphlist.txt found at //github.com/adobe-type-tools/agl-aglfn)
    contains multiple values for some names.

    No font is guaranteed to use any of these names and many fonts that I've examined use different names for unicode values (and different values
    for some names.)

    If you know you are dealing with modern fonts that include the uni/u
    aliases, you can get rid of the Adobe table lookup altogether... You
    don't need the canonical glyph names for those fonts.

    No font that I've examined includes uni/u names for every glyph, or even
    for most glyphs.

    One can't rely on any pre-determined glyph name, nor any pre-determined
    lookup table. What a mess.


    On 23/1/22 11:56 pm, Carlos wrote:
    I think if a font has a mapping between unicode points and glyphs that
    you can extract (with Fontforge or whatever), then it surely also has
    uni/u aliases. The Adobe table is for older fonts that don't have them,
    so it's the only lookup table you need.

    I wish that were true, but it's not.

    After your comment about older fonts, I examined Courier, a Type 1 font (https://web.archive.org/web/20010617080950/http://www.ctan.org/tex-archive/fonts/psfonts/courier/).
    The CharStrings array breaks my
    assumptions and my code completely fails.

    I'm not delighted by needing to add a dictionary that's specific to
    the current font to utfshow and unicodeshow because it feels wrong.

    Also, having to pre-process the files to insert the tables is not good.

    I completely agree. I don't like it. I want to be able to use any font without preprocessing, but I can't see how.


    a possible optimization could be not to call
    unicodeshow on each codepoint, but identify string intervals where all
    bytes are either <= 127 or > 127. Call show on the former, and utfshow
    on the latter.

    Agreed. Ps2pdf slows down dramatically with large number of glyphshows. https://antofthy.gitlab.io/info/data/utf8-demo.txt, which is 50K, takes
    4 minutes to process using utf8show and ps2pdf. The utf8-decode phase
    takes 20ms and Ghostscript takes 510ms.

    For anyone interested, https://davidnewall/software/utf8show. It's
    still a work-in-progress.

    David

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Carlos@21:1/5 to David Newall on Thu Feb 10 15:05:37 2022
    On Wed, 26 Jan 2022 14:59:09 +1100
    David Newall <davidn@davidnewall.com> wrote:
    No font is guaranteed to use any of these names and many fonts that
    I've examined use different names for unicode values (and different
    values for some names.)

    If you know you are dealing with modern fonts that include the uni/u aliases, you can get rid of the Adobe table lookup altogether... You
    don't need the canonical glyph names for those fonts.

    No font that I've examined includes uni/u names for every glyph, or
    even for most glyphs.

    One can't rely on any pre-determined glyph name, nor any
    pre-determined lookup table. What a mess.

    Well, that's disappointing...

    After your comment about older fonts, I examined Courier, a Type 1
    font (https://web.archive.org/web/20010617080950/http://www.ctan.org/tex-archive/fonts/psfonts/courier/).
    The CharStrings array breaks my assumptions and my code completely
    fails.

    What assumptions?

    C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Newall@21:1/5 to Carlos on Wed Feb 16 13:55:22 2022
    On 11/2/22 01:05, Carlos wrote:
    After your comment about older fonts, I examined Courier, a Type 1
    font
    (https://web.archive.org/web/20010617080950/http://www.ctan.org/tex-archive/fonts/psfonts/courier/).
    The CharStrings array breaks my assumptions and my code completely
    fails.
    What assumptions?

    The issue wasn't type 1 fonts, after all, that was just the thread I
    pulled at. The issue is CharStrings. Not all fonts have one. In
    particular, type 3 fonts don't. Type 3 fonts have a BuildGlyph or
    BuildChar procedure which often use a CharProcs dictionary, but that's
    not guaranteed.

    I now taking the position that a font must have CharStrings or CharProcs
    to be used with this body of code. In practice that's unlikely to be a problem.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)