Forum: >>> Magnum BBS <<<

Printing UTF8 (Unicode)

From David Newall@21:1/5 to All on Fri Jan 21 21:56:44 2022

Copy: glaukon.ariston@gmail.com (Glaukon)

Hello All,

I've written some PostScript to allow me to print UTF8-encoded strings:

(UTF-8 Encoded String.....) utfshow

I'm happy to send you the full source, or, if appropriate, publish it
here; however, the exposition below includes everything you should need.

I use a UTF-8 decoder which was written (in C) by Bjoern Hoehrmann (see http://bjoern.hoehrmann.de/utf-8/decoder/dfa/):

%/ Copyright (c) 2008-2010 Bjoern Hoehrmann <bjoern@hoehrmann.de>
%/ See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.

/UTF8_ACCEPT 0 def
/UTF8_REJECT 12 def

/utf8d [
%/ The first part of the table maps bytes to character classes that
%/ to reduce the size of the transition table and create bitmasks.
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
8 8 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
10 3 3 3 3 3 3 3 3 3 3 3 3 4 3 3 11 6 6 6 5 8 8 8 8 8 8 8 8 8 8 8

%/ The second part is a transition table that maps a combination
%/ of a state of the automaton and a character class to a state.
0 12 24 36 60 96 84 12 12 12 48 72 12 12 12 12 12 12 12 12 12 12 12 12
12 0 12 12 12 12 12 0 12 0 12 12 12 24 12 12 12 12 12 24 12 24 12 12
12 12 12 12 12 12 12 24 12 12 12 12 12 24 12 12 12 12 12 12 12 24 12 12
12 12 12 12 12 12 12 36 12 36 12 12 12 36 12 12 12 12 12 36 12 36 12 12
12 36 12 12 12 12 12 12 12 12 12 12
] def

% codep state byte decode codep' state'
/decode {
utf8d 1 index get % type
% codep state byte type
2 index UTF8_ACCEPT ne % state not UTF8_ACCEPT?
{ exch 16#3F and 4 -1 roll 6 bitshift or }
{ dup neg 16#FF exch bitshift 3 -1 roll and 4 -1 roll pop }
ifelse % state type codep'
3 1 roll add 256 add utf8d exch get % codep' state'
} def

%***************************************************************************/

I also use a table which Adobe published ("UNICODE translation table for non-ASCII characters"), which they say is for going from a glyph name to
a Unicode codepoint. I (ab)use it in the reverse direction. I turned
it into a dictionary keyed on the codepoint.

The table is currently at https://github.com/adobe-type-tools/agl-aglfn.
Some codepoints have multiple possible glyph names, so the dictionary
has an array of potential glyph names for each codepoint. Finally,
fonts often have glyphs named /uniHHHH, where HHHH is the codepoint.

I converted the table to PS using awk:

BEGIN{FS="[; ]"}
{
for(i=2; i<=NF; i++) {
if(!($i in h)) {h[$i]=++n;v[n]=$i}
g[$i]=g[$i]"/"$1
}
}
END{
print "/unicode <<"
for(i=1;i<=n;i++) print "\t16#"v[i]"["g[v[i]]"/uni"toupper(v[i])"]"
print ">> def"
}

Adobe's table is turned into this:

/unicode <<
16#0041[/A/uni0041]
16#00C6[/AE/uni00C6]
...
16#305A[/zuhiragana/uni305A]
16#30BA[/zukatakana/uni30BA]

def

The crux of printing Unicode code points is to find which of the
possible glyphs the current font defines. I search currentfont's
CharStrings.

% look for one of the glyphs in fontdict's CharStrings
% [/glyph ...] fontdict chooseglyph /glyph true
% false
/chooseglyph {
/CharStrings get exch % the glyphs defined in fontdict
false 3 1 roll % assume we don't find a glyph
% false CharStrings [glyphs]
{ 2 copy known {true 4 2 roll exch pop exit}{pop} ifelse } forall
pop % remove CharStrings
} def

I've noticed that Symbol sometimes contains glyphs that other fonts
don't, so, if I don't find a glyph in currentfont I look through Symbol.

I thought it might be a good idea to also try ZapfDingbats. In
retrospect, that might be a red herring.

Adobe also publish a table like the Unicode table, giving the names of
that font's glyphs. It's at the same place, and converts using the same
awk:

/zapf <<
16#275E[/a100/uni275E]
16#2761[/a101/uni2761]
...
16#275D[/a99/uni275D]
16#2720[/a9/uni2720]

def

This is the code which prints a unicode code point (or .notdef if a
glyph cannot be found):

% SPDX-License-Identifier: LGPL-2.1-or-later
%
% Copyright (c) 2022 by davidnewall.com. All rights reserved.

% print a single unicode codepoint:
% integer unicodeshow -
/unicodeshow {
% load array of known glyph names for this code point
unicode 1 index known
{unicode exch get} % array of possible glyphs
{ pop []} % unknown code point
ifelse
{
dup currentfont chooseglyph { glyphshow exit } if
dup /ZapfDingbats findfont chooseglyph {
currentfont exch /ZapfDingbats currentfontsize selectfont
glyphshow setfont exit } if
dup /Symbol findfont chooseglyph {
currentfont exch /Symbol currentfontsize selectfont
glyphshow setfont exit } if
/.notdef glyphshow exit
} loop
pop
} def

I get the current font size using this:

/currentfontsize {
currentfont dup /OrigFont get
2 { /FontMatrix get 3 get exch } repeat div
} bind def

Finally (at last!), to print a UTF-8 string:

/utfshow {
UTF8_ACCEPT 0 UTF8_ACCEPT % prev codep current
4 -1 roll {
decode
dup UTF8_ACCEPT eq { 1 index unicodeshow } if
dup UTF8_REJECT eq {
(%% Bad UTF-8 sequence\n) print pop
UTF8_ACCEPT /.notdef glyphshow } if
3 -1 roll pop dup 3 1 roll % prev = current
} forall
pop pop pop
} def

Regards,

David

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Carlos@21:1/5 to All on Fri Jan 21 14:23:03 2022

David Newall <davidn@davidnewall.com>:

Hello All,

I've written some PostScript to allow me to print UTF8-encoded
strings:

This is great!

[...]

% print a single unicode codepoint:
% integer unicodeshow -
/unicodeshow {

[...]

/utfshow {
UTF8_ACCEPT 0 UTF8_ACCEPT % prev codep current
4 -1 roll {
decode
dup UTF8_ACCEPT eq { 1 index unicodeshow } if

[...]

Doesn't "x glyphshow y glyphshow" lose the kerning between x and y?
(I'm not really sure)

If it does, an alternative could be to create a (probably composite)
temporary font out of the characters used in the string and "show" a
reencoded string using that font. Too complicated though :)

Carlos.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Newall@21:1/5 to Carlos on Sat Jan 22 12:27:49 2022

On 22/1/22 12:23 am, Carlos wrote:

David Newall <davidn@davidnewall.com>:

I've written some PostScript to allow me to print UTF8-encoded
strings:

This is great!

Thank you. It seemed a problem which needed to be solved. I hope I've
made a start that's good enough to criticize.

Doesn't "x glyphshow y glyphshow" lose the kerning between x and y?
(I'm not really sure)

PostScript doesn't automatically kern. There are operators you can use
to do that, but it is something you have to do.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Newall@21:1/5 to David Newall on Sun Jan 23 13:31:54 2022

On 21/1/22 9:56 pm, David Newall wrote:

I've written some PostScript to allow me to print UTF8-encoded strings

There was an error in unicodeshow. I wasn't attempting /uniXXXX for
codepoints that weren't in Adobe's table.

Apparently it's also not uncommon to use /uXXXX through /uXXXXXX (4 to 6
hex digits), so I check for those, too.

% integer unicodeshow - show glyph for unicode code point
/unicodeshow {
% load array of known glyph names for this code point, supplemented
% with /uXXXXXX (4 - 6 hex chars) and /uniXXXX (when codepoint fits
% in 4 hex chars)
[
unicode 2 index known {unicode 2 index get aload pop} if
% convert number to hex for /uXXXX.. and /uniXXXX
(0000000) 6 counttomark 1 add index % string index number
{
% number must fit in 6 hex digits
1 index 0 eq {
pop pop pop
/.error where {pop .error} {signalerror} ifelse
} if
dup 0 eq { pop exit } if
3 copy 16 mod dup 9 gt { 55 } { 48 } ifelse add put
16 idiv exch 1 sub exch
} loop
% require min 4 hex digits
dup 2 gt { -1 3 { 1 index exch 16#30 put } for 2 } if
% /uXXXX - /uXXXXXX
2 copy 7 1 index sub getinterval dup 0 16#75 put cvn 3 1 roll
% /uniXXXX
2 eq { dup 0 (uni) putinterval dup cvn exch } if
pop
] exch pop
%[(candidates)2 index]== pstack(---)==
dup currentfont chooseglyph not { /.notdef } if glyphshow
pop
} bind def

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Newall@21:1/5 to David Newall on Sun Jan 23 14:10:12 2022

Hi All,

I'm soliciting opinions...

On 21/1/22 9:56 pm, David Newall wrote:

I've written some PostScript to allow me to print UTF8-encoded strings
...
I also use a table which Adobe published ("UNICODE translation table for non-ASCII characters"), which they say is for going from a glyph name to
a Unicode codepoint. I (ab)use it in the reverse direction. I turned
it into a dictionary keyed on the codepoint.

Many (most?) fonts have glyphs which aren't in Adobe's table, or which
are named differently. Fontforge can write a table of glyphs in a font
and their corresponding codepoints. Using that table, unicodeshow looks
more like this:

% lookup a unicode codepoint (int) in a list of known glyphs (dict)
% and display the glyph found.
% dict int unicodeshow -
/unicodeshow {
2 copy known { get } { pop pop /.notdef } ifelse glyphshow
} bind def

While this looks much neater, it requires pre-generating a dictionary
for each font used.

I can't decide which approach is better.

I'm not delighted by needing to add a dictionary that's specific to the
current font to utfshow and unicodeshow because it feels wrong.

I suppose whatever fonts are used to print unicode will be embedded in
the PS, so I could add the table to each font's dictionary. I wonder if
that would cause confusion to anybody reading the code:

/unicodeshow { % int unicodeshow -
currentfont /unicode 2 copy known not {
pop pop /unicodeshow cvx /invalidfont
/.error where {pop .error} {signalerror} ifelse
} if
get exch 2 copy known { get } { pop pop /.notdef } ifelse glyphshow
} bind def

Maybe that's not so awful.

Opinions? Would adding to a font dictionary going to break things?
(I'm looking at you, Acrobat and Distiller.)

Regards,

David

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Carlos@21:1/5 to All on Sun Jan 23 13:56:10 2022

V Sun, 23 Jan 2022 14:10:12 +1100
David Newall <davidn@davidnewall.com> naps�no:

Hi All,

I'm soliciting opinions...

On 21/1/22 9:56 pm, David Newall wrote:

I've written some PostScript to allow me to print UTF8-encoded
strings ...
I also use a table which Adobe published ("UNICODE translation
table for non-ASCII characters"), which they say is for going from
a glyph name to a Unicode codepoint.� I (ab)use it in the reverse direction.� I turned it into a dictionary keyed on the codepoint.

Many (most?) fonts have glyphs which aren't in Adobe's table, or which
are named differently. Fontforge can write a table of glyphs in a
font and their corresponding codepoints. Using that table,
unicodeshow looks more like this:

% lookup a unicode codepoint (int) in a list of known glyphs (dict)
% and display the glyph found.
% dict int unicodeshow -
/unicodeshow {
2 copy known { get } { pop pop /.notdef } ifelse glyphshow
} bind def

While this looks much neater, it requires pre-generating a dictionary
for each font used.

I can't decide which approach is better.

I think if a font has a mapping between unicode points and glyphs that
you can extract (with Fontforge or whatever), then it surely also has
uni/u aliases. The Adobe table is for older fonts that don't have them,
so it's the only lookup table you need.

I'm not delighted by needing to add a dictionary that's specific to
the current font to utfshow and unicodeshow because it feels wrong.

Also, having to pre-process the files to insert the tables is not good.

[...]

Opinions? Would adding to a font dictionary going to break things?
(I'm looking at you, Acrobat and Distiller.)

Don't know about that, I only use Ghostscript. But if the reason to add
a lookup is speed, a possible optimization could be not to call
unicodeshow on each codepoint, but identify string intervals where all
bytes are either <= 127 or > 127. Call show on the former, and utfshow
on the latter.

C.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Carlos@21:1/5 to All on Sun Jan 23 13:35:11 2022

V Sun, 23 Jan 2022 13:31:54 +1100
David Newall <davidn@davidnewall.com> naps�no:

On 21/1/22 9:56 pm, David Newall wrote:

I've written some PostScript to allow me to print UTF8-encoded
strings

There was an error in unicodeshow. I wasn't attempting /uniXXXX for codepoints that weren't in Adobe's table.

Apparently it's also not uncommon to use /uXXXX through /uXXXXXX (4
to 6 hex digits), so I check for those, too.

Adobe's table (or one similar to it) is included in Ghostscript (AdobeGlyphList), and maybe other interpreters, too.

Here's an old snippet that gets a glyph name (or uniXXXX) based on its
code:

/RevList AdobeGlyphList length dict dup begin
AdobeGlyphList { exch def } forall
end def
% code -- (uniXXXX)
/uniX { 16 6 string cvrs dup length 7 exch sub exch
(uni0000) 7 string copy dup 4 2 roll putinterval } def
% font code -- glyphname
/unitoname { dup RevList exch known
{ RevList exch get }
{ uniX cvn } ifelse
exch /CharStrings get 1 index known not
{ pop /.notdef } if
} def

(It doesn't contemplate several names per code... I thought it was a
1-1 relationship.)

If you know you are dealing with modern fonts that include the uni/u
aliases, you can get rid of the Adobe table lookup altogether... You
don't need the canonical glyph names for those fonts.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From luser droog@21:1/5 to David Newall on Mon Jan 24 08:33:13 2022

On Saturday, January 22, 2022 at 9:10:23 PM UTC-6, David Newall wrote:

Opinions? Would adding to a font dictionary going to break things?
(I'm looking at you, Acrobat and Distiller.)

Regards,

David

I don't see how that could be a problem unless the additions conflict
with existing names. It's possible that findfont will give you a dictionary without write access. But you could copy everything into a new dictionary
and then call `definefont` on that and you should be good to go. (Take
care *not* to copy the /UniqueID key since definefont will want to
generate a new one.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From luser droog@21:1/5 to Carlos on Mon Jan 24 08:37:58 2022

On Sunday, January 23, 2022 at 6:56:13 AM UTC-6, Carlos wrote:

V Sun, 23 Jan 2022 14:10:12 +1100
David Newall <dav...@davidnewall.com> napsáno:

[...]

Opinions? Would adding to a font dictionary going to break things?
(I'm looking at you, Acrobat and Distiller.)

Don't know about that, I only use Ghostscript. But if the reason to add
a lookup is speed, a possible optimization could be not to call
unicodeshow on each codepoint, but identify string intervals where all
bytes are either <= 127 or > 127. Call show on the former, and utfshow
on the latter.

C.

Or if speed is not a problem, you could implement a replacement for
kshow instead of show. Then the whole show family can easily be built
off of that.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Newall@21:1/5 to luser droog on Wed Jan 26 15:06:39 2022

On 25/1/22 3:33 am, luser droog wrote:

On Saturday, January 22, 2022 at 9:10:23 PM UTC-6, David Newall wrote:

Opinions? Would adding to a font dictionary going to break things?
(I'm looking at you, Acrobat and Distiller.)

I don't see how that could be a problem unless the additions conflict
with existing names. It's possible that findfont will give you a dictionary without write access. But you could copy everything into a new dictionary
and then call `definefont` on that and you should be good to go.

Thanks. I can't see how it could, either, but I have little experience
with actual Adobe software, as I use Ghostscript for almost all of my PostScript work.

I might have been unclear in "adding to a font dictionary". I'm not contemplating /name findfont { modify } definefont, but fontforge font;
awk '...' font.g2n; vi font.t42.

Regards,

David

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Newall@21:1/5 to Carlos on Wed Jan 26 14:59:09 2022

Hi Carlos,

Thanks for your very useful feedback.

I will say, up-front, that using Adobe Glyph List (glyphlist.txt found
at https://github.com/adobe-type-tools/agl-aglfn) is often sufficient, depending on what unicode values need to be painted and what font is to
be used. But I want to do better than "often".

I'm using https://antofthy.gitlab.io/info/data/utf8-demo.txt to test my
code. It's coverage is ... extensive (and my current code seems to work
for all of it -- font withstanding.)

On 23/1/22 11:35 pm, Carlos wrote:

Adobe's table (or one similar to it) is included in Ghostscript (AdobeGlyphList), and maybe other interpreters, too.

I didn't know about AdobeGlyphList. The one in Ghostscript (9.50) has
multiple names for some unicode values. Converseley Adobe Glyph List (glyphlist.txt found at //github.com/adobe-type-tools/agl-aglfn)
contains multiple values for some names.

No font is guaranteed to use any of these names and many fonts that I've examined use different names for unicode values (and different values
for some names.)

If you know you are dealing with modern fonts that include the uni/u
aliases, you can get rid of the Adobe table lookup altogether... You
don't need the canonical glyph names for those fonts.

No font that I've examined includes uni/u names for every glyph, or even
for most glyphs.

One can't rely on any pre-determined glyph name, nor any pre-determined
lookup table. What a mess.

On 23/1/22 11:56 pm, Carlos wrote:

I think if a font has a mapping between unicode points and glyphs that
you can extract (with Fontforge or whatever), then it surely also has
uni/u aliases. The Adobe table is for older fonts that don't have them,
so it's the only lookup table you need.

I wish that were true, but it's not.

After your comment about older fonts, I examined Courier, a Type 1 font (https://web.archive.org/web/20010617080950/http://www.ctan.org/tex-archive/fonts/psfonts/courier/).
The CharStrings array breaks my
assumptions and my code completely fails.

I'm not delighted by needing to add a dictionary that's specific to
the current font to utfshow and unicodeshow because it feels wrong.

Also, having to pre-process the files to insert the tables is not good.

I completely agree. I don't like it. I want to be able to use any font without preprocessing, but I can't see how.

a possible optimization could be not to call
unicodeshow on each codepoint, but identify string intervals where all
bytes are either <= 127 or > 127. Call show on the former, and utfshow
on the latter.

Agreed. Ps2pdf slows down dramatically with large number of glyphshows. https://antofthy.gitlab.io/info/data/utf8-demo.txt, which is 50K, takes
4 minutes to process using utf8show and ps2pdf. The utf8-decode phase
takes 20ms and Ghostscript takes 510ms.

For anyone interested, https://davidnewall/software/utf8show. It's
still a work-in-progress.

David

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Carlos@21:1/5 to David Newall on Thu Feb 10 15:05:37 2022

On Wed, 26 Jan 2022 14:59:09 +1100
David Newall <davidn@davidnewall.com> wrote:

No font is guaranteed to use any of these names and many fonts that
I've examined use different names for unicode values (and different
values for some names.)

If you know you are dealing with modern fonts that include the uni/u aliases, you can get rid of the Adobe table lookup altogether... You
don't need the canonical glyph names for those fonts.

No font that I've examined includes uni/u names for every glyph, or
even for most glyphs.

One can't rely on any pre-determined glyph name, nor any
pre-determined lookup table. What a mess.

Well, that's disappointing...

After your comment about older fonts, I examined Courier, a Type 1
font (https://web.archive.org/web/20010617080950/http://www.ctan.org/tex-archive/fonts/psfonts/courier/).
The CharStrings array breaks my assumptions and my code completely
fails.

What assumptions?

C.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Newall@21:1/5 to Carlos on Wed Feb 16 13:55:22 2022

On 11/2/22 01:05, Carlos wrote:

After your comment about older fonts, I examined Courier, a Type 1
font
(https://web.archive.org/web/20010617080950/http://www.ctan.org/tex-archive/fonts/psfonts/courier/).
The CharStrings array breaks my assumptions and my code completely
fails.

What assumptions?

The issue wasn't type 1 fonts, after all, that was just the thread I
pulled at. The issue is CharStrings. Not all fonts have one. In
particular, type 3 fonts don't. Type 3 fonts have a BuildGlyph or
BuildChar procedure which often use a CharProcs dictionary, but that's
not guaranteed.

I now taking the position that a font must have CharStrings or CharProcs
to be used with this body of code. In practice that's unlikely to be a problem.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Fri Apr 19 14:04:19 2024
  from Wales, Uk via Telnet
- Richard
  Fri Apr 19 12:43:01 2024
  from Leeds, Uk via SSH
- Bob Worm
  Fri Apr 19 09:15:26 2024
  from Wales, Uk via Telnet
- Bob Worm
  Fri Apr 19 08:49:01 2024
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	293
Nodes:	16 (2 / 14)
Uptime:	221:14:40
Calls:	6,623
Calls today:	5
Files:	12,171
Messages:	5,318,094

Printing UTF8 (Unicode)

Who's Online

Recent Visitors

System Info