Forum: >>> Magnum BBS <<<

"almost all programming can be viewed as an exercise in caching"

From Robert Prins@21:1/5 to All on Tue Mar 8 23:58:51 2016

XPost: comp.lang.asm.x86, comp.lang.pascal.misc, comp.lang.pl1

Yes, the line that has been in Terje Mathisen's email signature for about a zillion years...

But how far can you actually go caching data?

As an example I will use the development of the program I use to extract all kinds of weird, wonderful and mostly just totally useless statistics from the data I collect when I'm hitchhiking.

Let's start at the beginning...

Sometime, like well over 20 years ago, the timestamps of the oldest files that I
still have is 1994-03-26, I wrote a Turbo Pascal (V3.01a) program to help me to process some of my hitchhike statistics. The program was tiny, just a 49kb .COM file, and did what it had to do. In the years that followed it slowly grew, because I added more and more tables, (Driver: "Have you ever calculated this or
that?" Me: "No, but it might be interesting...") and the last (60th) Turbo Pascal V3.01a version, with files dated @ 1996-08-05, came in at a hefty 54kb. :)

As a side note, in my day-to-day job I work as a PL/I programmer on z/OS, so I also translated the Pascal code to PL/I and the oldest (of currently 35) versions written in that language dates back to 1995-10-22.

From version 46 (1995-07-30) the code contained lots of inline (the old type inline, with hexadecimal codes) code, and from version 51 (1995-10-13) yours truly started adding 0x66 (386) prefixes to it. The last TP 3.01a version (60) saw the light on 1996-08-05, when ...

A fellow contractor at KLM gave me his no longer being used version of TP 6.

Given that my carefully byte counted TP 3.01a inline code sometimes crossed actual Pascal code, the first TP 6 version (obviously) didn't run, and I had to convert all inline code back to normal Pascal, ouch!

The last (53rd) TP 6 version saw the light on 2008-10-06, and the reason for abandoning Turbo Pascal was the fact that it could no longer cope with the amount of data, at the time I had hitched 2,206 rides and the input file had reached a size of 502kb, way too much to handle in the 16-bit DOS environment, even using using a special CONFIG.SYS menu and AUTOEXEC.BAT that left most of memory free, and a unit that extended the TP heap into unused UMBs.

I had to change again, and having been a lifelong user of Pascal (and its rather
more functional grandparent, PL/I), I choose to go for Virtual Pascal, a 32-bit clone of Turbo/Borland Pascal. I did think about FreePascal at the time, and actually gave it a try, but in the end (don't ask me for the then "Why?") it didn't cut it, and having tried it again several times in the past few years, it
still doesn't cut it, the problem *seems* to be that FP uses a different philosophy with regards to alignment of data, which even causes the "pure" Pascal version to generate code that eventually results in a GPF.

Obviously, having recovered from the TP 3.01a "inline" to TP 6 conversion fiasco, I had made sure that nothing of this nature would ever occur again, by keeping two parallel versions of the TP 6 source, a "pure" Pascal one, and one with inline assembler, and other than having to change a few registers in one not entirely "pure" Pascal routine, some integers to longint's and, although this happened somewhat later and wasn't trivial, all Borland 6-byte REALs to IEEE doubles, the first Virtual Pascal version was nearly identical to the last TP6 version. (I'm currently working on the 93rd version coded in VP)

Besides the fact that VP is a 32-bit compiler, which means I will never run out of space again, it's a real clone of the Borland software, which means that the generated code isn't, to say it very politely, much better than the TP 6 (AD 1990) or even TP 3.01a (AD 1985) code, but unlike the Borland compilers, it actually produces an assembler listing that shows the generated code and as a result I've spent quite a bit of time over the last few years hacking code from Pascal to inline assembler. I usually let the compiler do the heavy lifting when
it comes to new code, but once the Pascal code is working, the routine is duplicated between conditional compilation tags, and, using the original assembler listing as a starting point, converted into inline assembler, which in
many, if not most, cases now only vaguely resembles the original code, I try to keep data in registers, avoid the use of the VP RTL where possible, and if ebx, esi and edi, the inter-procedural to-be-saved registers aren't used in the calling chain, I will remove the saving and restoring of them, or save them in the calling procedure, rather than in the called one. The result of all of this fiddling? A program that is around 35% faster and nearly 25% smaller that the pure Pascal version... In (slightly outdated, I started writing this post in December 2015) figures, with the running time averaged over 10 runs?

Inline assembler version: 76,288 bytes, running time 0:00:00.23 sec
Pure Pascal version : 101,376 bytes, running time 0:00:00.35 sec

Yes, all this work for a program that runs in a fraction of a second...

So why this posting?

In the past years I've posted a few snippets of the program:

Subject: Optimization problem
Message-ID: <85akp6F4bpU1@mid.individual.net>
Date: Sun, 16 May 2010 18:28:26 +0000

Thread ran to 70 postings, and in the end I actually found the bug in the code sent to me by Paul Green way back in May 1998, which does not require any sorting, although it (currently) does an excessive(?) amount of allocating and freeing of memory via the RTL - 858 allocates and frees at the last run of the equivalent PL/I program. (This could, in theory and obviously in practice too, be avoided by pre-allocating, once, an array to hold the data)

Subject: x86 to x86-64 conversion issue
Message-ID: <9lppukFgb0U1@mid.individual.net>
Date: Mon, 26 Dec 2011 01:29:22 +0000

Thread ran to 17 postings, I'm using Steven Fürst's solution, due to lack of support for anything over MMX in Virtual Pascal :(

Subject: Replacing single byte in doubleword
Message-ID: <9v7tt9Fn4vU1@mid.individual.net>
Date: Wed, 18 Apr 2012 15:29:19 +0200

Tread ran to 12 postings, I'm using Terje's solution :)

So I'm now going to be a heretic, and abandon x86 in favour of PL/I, as that makes it a bit easier to show the data-structures I'm working with.

When I read in the data, from a simple codepage 437 spaced-out CSV file (currently just over 1Mb, and containing the data for the 3,725 recorded rides I've had since 1980-06-16T07:47), all non-meta-data ends up in a linked-list, that looked like this in the first PL/I program, with a second copy below, the way it looks now. For those who are not familiar with PL/I, a quick explanation about datatypes:

ptr : pointer (32 bit)

char(n) : character string of length(n)

fixed bin(n): basically an integer of n bits with a 1-bit sign. Newer versions of PL/I allow the attribute "unsigned". No restrictions on n (up to 63/64), but in practice only n=7/15/31/64 (signed) and n=8/16/32/64 are used, other values should lead to immediate dismissal...

fixed(p,q) : short for "fixed dec(p,9), fixed point decimal, p total digits, q fractional digits, storage requirements trunc((p+1)/2) bytes, the last four bits
are used for the sign.

bit(n) : bit string of n bits, best used with the attribute "aligned", which makes it occupy one byte.

The 'init (whatever)' will initialize the various fields to "whatever" upon allocation of the structure, a very neat feature!

"sysnull()" is a PL/I built-in function that returns one (0x00000000) of the two
PL/I equivalents of "nil" - the other (original AD 1964, from the days System/360 used 24-bit addressing) built-in is "null()", which returns 0xFF000000.

From LIFT001, 1994-10-22:

dcl 1 lift_list based(lift_ptr),
2 lift_nxt ptr init (sysnull()),
2 lift_prv ptr init (prv_ptr),
2 trip fixed (3) init (liftrec.trip),
2 ride fixed (3) init (liftrec.ride),
2 day char (3)
init (translate(liftrec.day, ' ', ',')),
2 km fixed (7,1) init (liftrec.km),
2 time fixed (7) init (0),
2 v fixed (5,1) init (liftrec.v),
2 vx fixed (15,12) init (0),
2 nat char (3)
init (translate(liftrec.nat, ' ', ',')),
2 type char (4)
init (translate(liftrec.type, ' ', ',')),
2 country char (3)
init (translate(liftrec.country, ' ', ',')),
2 wtime fixed (7) init (-1),
2 split char (1) init (liftrec.split),
2 date,
3 year fixed (5) init (liftrec.date.year),
3 month fixed (3) init (liftrec.date.month),
3 day fixed (3) init (liftrec.date.day),
2 jdn fixed (7) init (-1),
2 ferry bit (1)
init ((liftrec.ferry = ':')) aligned,
2 padding char (35) init ((35)' '),
2 ll_nxt ptr init (sysnull()),
2 tr_nxt ptr init (sysnull()),
2 tc_nxt ptr init (sysnull()),
2 tl_nxt ptr init (sysnull()),
2 cl_nxt ptr init (sysnull()),
2 nl_nxt ptr init (sysnull()),
2 yr_nxt ptr init (sysnull()),
2 yc_nxt ptr init (sysnull());

From LIFT035, 2016-01-01:

dcl 1 lift_list based(lift_ptr),
2 lift_nxt ptr init (sysnull()),
2 trip fixed bin (31) init (liftrec.trip),
2 ride fixed bin (31) init (liftrec.ride),
2 lift_prv ptr init (prv_ptr),
2 id fixed bin (31) init (0),
2 day fixed (3) init (-1),
2 cday char (3) init (liftrec.cday),
2 km fixed (9,1) init (liftrec.km),
2 time fixed (9) init (0),
2 vx fixed (13,9) init (0),
2 nat char (4)
init (translate(liftrec.nat, ' ', ',')),
2 type char (4)
init (translate(liftrec.type, ' ', ',')),
2 cnty char (4)
init (translate(liftrec.cnty, ' ', ',')),
2 ferry bit (1)
init ((liftrec.ferry = ':')) aligned,
2 wferry bit (1)
init ((liftrec.dt = ':')) aligned,
2 wtime fixed bin (31) init (-1),
2 #wa fixed bin (31) init (-1),
2 date,
3 year fixed bin (31) init (liftrec.date.year),
3 month fixed bin (31) init (liftrec.date.month),
3 day fixed bin (31) init (liftrec.date.day),
2 dow fixed bin (31) init (-1),
2 dtime fixed (5) init (-1),
2 atime fixed (5) init (-1),
2 split char (1) init (liftrec.split),
2 cross bit (1)
init ((liftrec.dt = 'x')) aligned,
2 etime fixed (5) init (0),
2 itime fixed (5) init (0),
2 jdn fixed (7) init (-1),
2 aitime fixed (5) init (-1),
2 odo fixed (7,1) init (liftrec.odo),
2 deploc char (43) init (liftrec.deploc),
2 arrloc char (43) init (liftrec.arrloc),
2 padding char (253) init ((253)' '),
2 wid fixed bin (31) init (0),
2 #sd fixed bin (31) init (0),
2 #ld_l fixed bin (31) init (0),
2 wift_nxt ptr init (sysnull()),
2 wift_skip ptr init (sysnull()),
2 owner ptr init (sysnull()),
2 ll_nxt ptr init (sysnull()),
2 tr_nxt ptr init (sysnull()),
2 tc_nxt ptr init (sysnull()),
2 tl_nxt ptr init (sysnull()),
2 xty_ptr ptr init (sysnull()),
2 cl_nxt ptr init (sysnull()),
2 xco_ptr ptr init (sysnull()),
2 nl_nxt ptr init (sysnull()),
2 xna_ptr ptr init (sysnull()),
2 yr_nxt ptr init (sysnull()),
2 yc_nxt ptr init (sysnull());

Other than the fact that the second structure contains a bit more "real" data, it also contains six additional pointers, so let's take all of the pointers out and explain how they are used (to cache things):

First of all we've got the two "standard" pointers for a linked list, to the next and previous items:

- lift_nxt : the address of the next item in the list of ALL rides
- lift_prv : the address of the previous item in the list of ALL rides

To understand some of the pointers below, here's a tiny bit of the reduced-to-interesting-bits-only, input file, compressed avoid line-wraps. The fields end up in a structure called "liftrec", which is basically an overlay over the line, and their meanings are:

a : the trip
bb : the ride in this trip
ccc : the "day" in this trip (or meta-day, or n-th split)
ddddd : the distance for this ride (or partial ride split)
eeee : the pure driving time for this ride
fffff : the rounded average speed for this ride (ddddd/eeee)
gg : the nationality of the driver
hh : the type of the driver (Male/Female/Truck/Van/etc) or type of split ii : the country (or meta-country) for this ride
jjjj : the waiting time for this ride
k : the type of split
LLxLL : departure time of the ride (x may contain meta-data)
mmmmm : arrival time of the ride
nnnnnnnnnn: departure date of the ride, only on first ride of a day

a bb ccc ddddd eeee fffff gg hh ii jjjj k LLxLL mmmmm nnnnnnnnnn
1, 1, 1, 3.2,0.05, 38.4,NL,V, NL, , , 7.47, 7.52,1980-06-16
1, 2, 1,135.0,1.13,111.0,NL,-, NL,0.20, , 8.12, 9.25,
1, 3, 1, 12.0,0.18, 40.0,NL,-, NL,0.33, , 9.58,10.28,
1, 3,! 1, , , , , !C, , ,!,10.08,10.20,
1, 4, 1, 31.0,0.37, 50.3,D, -, *, , ,10.49,11.33,
1, 4, 1a, 23.0,0.24, ,D, -, NL,0.21,*, , ,
1, 4, 1b, 8.0,0.13, ,D, -, D, ,*, , ,
1, 4,! 1, , , , , !B,NL, ,!,11.13,11.20,
1, 5, 1, 19.0,0.16, 71.3,D, -, D, 0.28, ,12.01,12.17,
.
.
.
1,45, 12, 34.0,0.16,127.5,S, -, S, , , 7.25, 7.41,1980-07-06
1,46, 12, 25.0,0.20, 75.0,S, V, S, 5.09, ,12.50,13.10,
1,47, #,924.0,9.30, 97.3,N, -, *, , ,13.41, ,
1,47, 12,687.0,6.30, ,N, -, , ,#, , ,
1,47, 13,237.0,3.00, ,N, -, , ,#, , 3.15,1980-07-07 1,47,12a,157.0,1.40, ,N, -, S, 0.31,*, , ,
1,47,12b,202.0,2.00, ,N, -, DK, ,*, , ,
1,47,12c,413.0,4.02, ,N, -, D, ,*, , ,
1,47,13d,152.0,1.48, ,N, -, NL, ,*, , ,
1,47,! 1, , , , , !B,S, ,!,15:21,17.00,
1,47,! 2, , , , , !B,DK, ,!,19:00,21.00,
1,47,! 3, , , , , , , ,!,21.10,21.15,
1,47,! 4, , , , , , , ,!,23.55, 0.05,
1,47,! 5, , , , , !B,D, ,!, 1.17, 1.27,
2, 1, 1,

More info about the format of the input file, and all its (too) many idiosyncrasies can be found at <http://hitchwiki.org/en/Keeping_statistics>. The
manual for the Pascal programs can be found at <http://hitchwiki.org/en/User:Prino/pascal>. The programs, their sources and input data can be found on my Google Drive at <https://goo.gl/EN3xeN>. You're likely to be only interested in the four .RAR files and maybe in lift & v100+.ods and LibreOffice spreadsheet that contains some graphs. If you want to play with the PL/I version (also there as lift.pli) shout and I'll provide the Compile, Link & Go JCL, or a z/OS load module (Enterprise PL/I V4.3 ARCH(10)) and some "Go" JCL. Input file here is lift.dat (extracted from liftdat.zip) uploaded to a RECFM=VB,LRECL=259 format. Note that the Pascal programs contain both "pure" Pascal as well as "pure inline assembler" versions of most procedures, but that my comments are somewhat sparse...

============
- owner : the address of first item for this ride

This is the first "caching" pointer, pointing to the first item of a ride, i.e. if a ride consists of more than one input record, this pointer will point to the
first item for the ride, avoiding having to use a loop using lift_prv to go back
to the first item. In the above lift_prv in the "1,47,! 5" item would point to the "1,47, #" item

============
- ll_nxt : the address of the next item in the list of ALL rides, skipping any
records containing data relating to day or country splits

This pointer is used to hop from "1, 4, 1" to "1, 5, 1", skipping the three items for ride 4 where field "k" <> ' ', avoiding a forward loop using lift_nxt

============
- tr_nxt : the address of the next ride for the same trip
- tc_nxt : the address of the next ride for the same trip, skipping any records containing data relating to day or country splits

These two pointers, anchored in a not-shown list containing data-per-trip allow immediate access from a trip to the rides in that trip. tr_nxt is analogous to lift_nxt, in that it points to the next item, tc_nxt, like ll_nxt skips "k" <> '
' items

============
- yr_nxt : the address of the next ride for the same year
- yc_nxt : the address of the next ride for the same year, skipping any records containing data relating to day or country splits (currently unused)

These two pointers are similar to tr_nxt and tc_nxt, but connect the items in the lift_list to their corresponding year

============
- tl_nxt : the address of the next ride for the same type of driver or vehicle
- cl_nxt : the address of the next ride for the same country, skipping any records containing data relating to day splits and records containing multiple country totals
- nl_nxt : the address of the next ride for the same nationality of the driver

These three pointers, anchored in (again) not shown lists of Types, Countries and Nationalities build sub-lists of rides with the same Type, Country or Nationality, i.e. tl_nxt in "1, 1, 1" would point to the next item where "hh" =
'V' (the not shown "1,13, 2". Using these pointers I can move through the various sublists of Types, Countries, and Nationalities without using a loop with lift_nxt and a check for the same Type, Country or Nationality.

============
- wift_nxt : the address of the next item in the list of waits
- wift_skip: the address of the next item in the list of waits for a specific type of wait (all/trip/country/year)

These are actually not really "caching" pointers, they were added recently to allow me to use the lift_list items in lieu of a separate list of items that have to do with waiting info.

============
- xty_ptr : the address of the type_list for the type of this ride
- xco_ptr : the address of the cnty_list for the country of this ride

- xna_ptr : the address of the nat_list for the nationality of this ride

During the process of reading in the input file, I create, in-line, four lists,

1) obviously a list of all rides, plus
2) a list of trips, plus
3) a list of years, plus
4) a list of (calendar) days

I also build strings of Types, and Countries and Nationalities, the latter two in alphabetic order. After reading in the entire input file, these strings are parsed and I build lists of Types, Countries, and Nationalities.

During the very first pass through the entire list (generating the "Summary" output data), the xty/co/na_ptr pointers are linked to the correct type/cnty/nat_list items, so that on subsequent passes (generating output per trip and per year), it is no longer necessary to scan the entire type/cnty/nat_list from the top. For what it's worth, access to these lists while generating the "Summary" output is also cached, given that once I'm in a country, it's pretty likely that the next ride will be in the same country, so rather than scanning the cnty_list from the top, I first compare the lift_list.cnty with the current one on the cnty_list.

Again, all this work for a program that runs in less than a second...

Now I have two questions:

1) Without actually looking at the code (it's as mentioned before a bit (too) sparsely commented), but with what I've written above, can anyone think of more ways to squeeze a few more micro-seconds out of the code, or ways to cache even more accesses. I'm especially interested in a way to speed up access to the correct item in the nat_list if I do not catch a cached nationality. I currently
have 87 nationalities, and although the code, here as PL/I:

process_nationality: proc;
/* xn?_ptr will only be set in summary processing */
if ^@summary then
nat_ptr = lift_list.xna_ptr; /* Direct hit! */
else
if s_na_ptr -> nat_list.nat ^= lift_list.nat then
do;
/* No match, scan from top or middle :( */
if lift_list.nat >= nat_mid then
nat_ptr = nat_mptr;
else
nat_ptr = nat_top;

do nat_ptr = nat_ptr repeat nat_nxt
while(nat_ptr ^= sysnull())
until(nat_list.nat = lift_list.nat);
end;

s_na_ptr = nat_ptr;
end;
else
nat_ptr = s_na_ptr;

nat_list.#c = nat_list.#c + 1;
nat_list.km = nat_list.km + lift_list.km;
nat_list.time = nat_list.time + lift_list.time;
end process_nationality;

is for all but the summary very efficient, my "feeble" attempt to halve at least
the number of iterations in the inner loop by testing for a halfway item still result in substantial statement (PL/I) & cycle (x86) counts. Given that there are less than 256 different nationalities in this planet, some smart way of hashing a four-byte nationality code into a unique index would be very nice. I know there is hashing code in my disassembly of Borland's TP6 code that does this to look up keywords and opcodes, but I cannot find it right now, it's probably still on one of many uncataloged floppy disks... However, I do have doubts about the speed of this, compared to the very simple inner loop code of

@02:
cmp [eax + offset nat_list.s_nat], edx
je @03

mov eax, [eax] { nat_list.nat_nxt is first element in list! }
test eax, eax
jnz @02 { there must be a match, so no test, and just jmp? }

@03:
mov nat_ptr, eax

An alternative would be to move the pointers to the nat_list items into an array
and do a binary search, which should yield a result in never more than 8 compares...

To end this ridiculously long post, some final notes:

1) I'd love to get some feedback on the quality (or lack of it) of my code. I'm well aware that most people in this group would rather write code than look at the code of others, but maybe...

2) And for the third time, all this work for a program that runs in less than a second? So a little background information... Demand for PL/I programmers on z/OS is low. After having lost my job more than six years ago, I've had about a dozen interviews, gatecrashed two German GSE PL/I+COBOL meetings, all without much result. In order not to go completely mad, you can do only so much around the house, my wife occasionally, in a friendly way, kicks me out onto the street
with a "Go HH'ing for a few days, I need some rest (but don't get into cars with
young blonde women...)" and when I come back, have entered the data into "lift.dat", have run "allhitch.bat", I usually spend a bit of time looking at the code, and while doing so after my trip in January, I realised that processing the input file led to some code needlessly being executed multiple times, in the above segment of the input file, the first seven "1,47,..." entries are all for a Norwegian male driver (the late Roger Albertsen, who played for FC Den Haag at the time), so why bother checking to see if Norway and
"male" have already been added to the strings of nationalities and types more than once? Looking at the code not having looked at it in some cases for quite a
long period will, at least for me, sometimes make me realize that it can be done
differently, and differently usually means better, and the following is probably
the strongest example of this,

3) While working on the PL/I code sometime in late 2012/early 2013, I realized that cleaning up linked lists could be done far more efficiently if PL/I would allow me to allocate storage in an "AREA" variable, but in discrete amounts. Using such an approach would allow the entire linked list (or far more complex allocated at run-time tree/graph/etc structures) to be cleared in a single PL/I "AREA=EMPTY();" statement, which translates into a single(!) z/OS assembler instruction to move 16-bytes of storage into the control part of the AREA variable. A bit of simple detective work revealed the undocumented layout of the
control part of the AREA, allowing me to test my hypothesis, and on 26 February 2013 I entered an RFE (Request For Enhancement) on the IBM website, see <https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=31632>. My request made it into Enterprise PL/I V4.4, which was released a mere seven months later, in September 2013... In other words, by thinking about what I've been doing, and about how it can be done better, Fortune 1000 companies will (potentially) be saving substantial amounts of CPU charges in the years to come.

To finally end this posting with the hope that no-one of you will mind if I, on occasion, come back with more questions on how to shave a few more milliseconds from my program. ;)

Robert
--
Robert AH Prins
robert(a)prino(d)org

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	296
Nodes:	16 (2 / 14)
Uptime:	13:32:24
Calls:	6,645
Files:	12,190
Messages:	5,326,844

"almost all programming can be viewed as an exercise in caching"

Who's Online

System Info