Forum: >>> Magnum BBS <<<

On the ugly Heapsort implementation which becomes beautiful in VFX Fort

From =?UTF-8?Q?Micha=C5=82_Kasprzak?=@21:1/5 to All on Sat Aug 13 13:58:11 2022

Hello Waldek, Stephen and others!

Not so long ago my compatriot professor Waldek Hebisch "has coded" the Forth version of the Heap Sort Algorithm.

http://www.math.uni.wroc.pl/~hebisch/prog/taxi_hs.fs

Heap Sort is a beautiful algorithm which learning will give a lot to each of you! You will learn what "binary trees" are, when they are "complete", how to represent a binary tree as an ordinary array, why this array should be indexed from 1, and what
relates this binary index to moving through the tree. You will learn what "heap condition" a complete binary tree must satisfy in order to be a "heap". Finally, you will learn the surprising non-obvious thing, why you are creating a heap of arbitrary
data in O(n) pessimistic time, which is faster than you thought!
The Heap Sort Algorithm sorts in pessimistic O(n*log n) time which is better than QuickSort which does it in O(n^2). Admittedly Merge Sort has this time too, but Heap Sort sorts in place and Merge Sort needs extra second memory.

You can INCLUDE the above file after downloading it and after SET-SIZE to 2GB and reloading 64-bit VFX Forth or I will paste 2 words here:

: sift
( rra ra ir ii )
DUP 2 * 1+ ( rra ra ir ii jj )
BEGIN
>R OVER R@ SWAP R> OVER OVER ( rra ra ir ii jj ir jj ir jj )
> WHILE ( rra ra ir ii jj ir jj )
1+ > IF ( rra ra ir ii jj )
>R >R OVER R> SWAP R> SWAP OVER ( rra ra ir ii jj ra jj )
CELLS + DUP @ SWAP CELL+ @ < IF ( rra ra ir ii jj )
1 +
THEN
THEN ( rra ra ir ii jj )
>R >R >R OVER OVER R> ROT ROT R> ROT ROT R@
ROT ROT R> ( rra ra ir ii jj rra ra jj )
CELLS OVER + @ ROT ( rra ra ir ii jj ra rr_jj rra )
OVER < IF ( rra ra ir ii jj ra rr_jj )
>R >R OVER CELLS R> + R> SWAP !
SWAP DROP DUP DUP + 1+
ELSE
DROP DROP DROP ( rra ra ir ii )
OVER 1 +
THEN ( rra ra ir ii jj )
REPEAT
DROP DROP DROP SWAP DROP ( rra ra ii )
CELLS + !
;

: heapsort ( ra n )
0 OVER 1 - 2 / DO
OVER OVER OVER ( ra n ra n ra )
I CELLS + @ ROT ROT I ( ra n rra ra n I )
sift
-1 +LOOP
1 - 1 SWAP DO ( ra )
DUP DUP I CELLS + ( ra ra ra+I )
DUP @ ( ra ra ra+I rra )
SWAP ROT ( ra rra ra+I ra )
@ SWAP ! ( ra rra )
I 1 = IF
OVER !
ELSE
OVER I 0 ( rra ra I 0 )
sift
THEN
-1 +LOOP
;

As everybody can see the above code is ugly. Alas, this code is even an example of why not to use Forth. Admit, the code discourages you.

You probably want to see how it works. Let's create an array of 10 numbers and sort it:

create example 13 , 7 , 9 , 12 , 1 , 8 , 2 , 99 , 14 , 3 ,
: print 10 0 do example i cells + @ . loop ;
print
example 10 heapsort
print

Note 1: The word heapsort incorrectly leaves an address on the stack. There is no DROP before ; in heapsort.
There is even more junk on the stack after all the rama_taxis is done.

Question to Waldek: But basically your heapsort works so it could be used all over the world. The question of copyright arises. Do you give your heapsort code to the community for free and agree to any use of your heapsort, even without mentioning that
it is you who "has coded" it?

Question to all of you: Can you rewrite this algorithm so that it becomes Forth's pride and not Forth's insult?
Maybe you might want to use variables or locals or arrange human comments?
You don't have to stay in CORE, but even if you want to stay like Waldek, then VARIABLE and ( ) are rather in Core, right Waldek? I don't know where this fear of using variables by Forthers comes from.

But when we use VFX Forth, the above code becomes beautiful.
It's because of Stephen who designed the VFX code generator.
Be sure to disassemble words sift and heapsort to see this beauty!

dasm sift
( 00977880 488BD3 ) MOV RDX, RBX
( 00977883 48D1E3 ) SHL RBX, # 1
( 00977886 48FFC3 ) INC RBX
( 00977889 488D6DF8 ) LEA RBP, [RBP+-08]
( 0097788D 48895500 ) MOV [RBP], RDX
( 00977891 90 ) NOP
( 00977892 90 ) NOP
( 00977893 90 ) NOP
( 00977894 90 ) NOP
( 00977895 90 ) NOP
( 00977896 90 ) NOP
( 00977897 90 ) NOP
( 00977898 53 ) PUSH RBX
( 00977899 488B1C24 ) MOV RBX, [RSP]
( 0097789D 5A ) POP RDX
( 0097789E 483B5508 ) CMP RDX, [RBP+08]
( 009778A2 488D6DF0 ) LEA RBP, [RBP+-10]
( 009778A6 488B4D18 ) MOV RCX, [RBP+18]
( 009778AA 48894D00 ) MOV [RBP], RCX
( 009778AE 48895D08 ) MOV [RBP+08], RBX
( 009778B2 488BDA ) MOV RBX, RDX
( 009778B5 0F8DC1000000 ) JNL/GE 0097797C
( 009778BB 48FFC3 ) INC RBX
( 009778BE 483B5D00 ) CMP RBX, [RBP]
( 009778C2 488B5D08 ) MOV RBX, [RBP+08]
( 009778C6 488D6D10 ) LEA RBP, [RBP+10]
( 009778CA 0F8D2A000000 ) JNL/GE 009778FA
( 009778D0 53 ) PUSH RBX
( 009778D1 48FF7500 ) PUSH QWORD [RBP]
( 009778D5 5B ) POP RBX
( 009778D6 5A ) POP RDX
( 009778D7 488BCA ) MOV RCX, RDX
( 009778DA 48C1E103 ) SHL RCX, # 03
( 009778DE 48034D10 ) ADD RCX, [RBP+10]
( 009778E2 488B01 ) MOV RAX, 0 [RCX]
( 009778E5 483B4108 ) CMP RAX, [RCX+08]
( 009778E9 48895D00 ) MOV [RBP], RBX
( 009778ED 488BDA ) MOV RBX, RDX
( 009778F0 0F8D04000000 ) JNL/GE 009778FA
( 009778F6 4883C301 ) ADD RBX, # 01
( 009778FA 53 ) PUSH RBX
( 009778FB 48FF7500 ) PUSH QWORD [RBP]
( 009778FF 48FF7508 ) PUSH QWORD [RBP+08]
( 00977903 5B ) POP RBX
( 00977904 5A ) POP RDX
( 00977905 488B0C24 ) MOV RCX, [RSP]
( 00977909 58 ) POP RAX
( 0097790A 48C1E003 ) SHL RAX, # 03
( 0097790E 48034510 ) ADD RAX, [RBP+10]
( 00977912 4C8B00 ) MOV R8, 0 [RAX]
( 00977915 4C3B4518 ) CMP R8, [RBP+18]
( 00977919 488D6DF0 ) LEA RBP, [RBP+-10]
( 0097791D 488B4520 ) MOV RAX, [RBP+20]
( 00977921 48894500 ) MOV [RBP], RAX
( 00977925 48894D08 ) MOV [RBP+08], RCX
( 00977929 48895510 ) MOV [RBP+10], RDX
( 0097792D 48895D18 ) MOV [RBP+18], RBX
( 00977931 498BD8 ) MOV RBX, R8
( 00977934 0F8E31000000 ) JLE/NG 0097796B
( 0097793A 53 ) PUSH RBX
( 0097793B 48FF7500 ) PUSH QWORD [RBP]
( 0097793F 488B5D10 ) MOV RBX, [RBP+10]
( 00977943 48C1E303 ) SHL RBX, # 03
( 00977947 5A ) POP RDX
( 00977948 4803DA ) ADD RBX, RDX
( 0097794B 5A ) POP RDX
( 0097794C 488913 ) MOV 0 [RBX], RDX
( 0097794F 488B5D08 ) MOV RBX, [RBP+08]
( 00977953 48035D08 ) ADD RBX, [RBP+08]
( 00977957 48FFC3 ) INC RBX
( 0097795A 488B5508 ) MOV RDX, [RBP+08]
( 0097795E 48895510 ) MOV [RBP+10], RDX
( 00977962 488D6D10 ) LEA RBP, [RBP+10]
( 00977966 E90C000000 ) JMP 00977977
( 0097796B 488B5D18 ) MOV RBX, [RBP+18]
( 0097796F 4883C301 ) ADD RBX, # 01
( 00977973 488D6D10 ) LEA RBP, [RBP+10]
( 00977977 E91CFFFFFF ) JMP 00977898
( 0097797C 488B5D10 ) MOV RBX, [RBP+10]
( 00977980 48C1E303 ) SHL RBX, # 03
( 00977984 48035D20 ) ADD RBX, [RBP+20]
( 00977988 488B5528 ) MOV RDX, [RBP+28]
( 0097798C 488913 ) MOV 0 [RBX], RDX
( 0097798F 488B5D30 ) MOV RBX, [RBP+30]
( 00977993 488D6D38 ) LEA RBP, [RBP+38]
( 00977997 C3 ) RET/NEXT
( 280 bytes, 86 instructions )

dasm heapsort
( 009779D0 488BD3 ) MOV RDX, RBX
( 009779D3 4883C3FF ) ADD RBX, # -01
( 009779D7 B902000000 ) MOV ECX, # 00000002
( 009779DC 488BC3 ) MOV RAX, RBX
( 009779DF 488BDA ) MOV RBX, RDX
( 009779E2 4899 ) CQO
( 009779E4 48F7F9 ) IDIV RCX
( 009779E7 488D6DF0 ) LEA RBP, [RBP+-10]
( 009779EB 48C7450000000000 ) MOV QWord [RBP], # 00000000
( 009779F3 48895D08 ) MOV [RBP+08], RBX
( 009779F7 488BD8 ) MOV RBX, RAX
( 009779FA E81195A9FF487A970000000 CALL 00410F10 (DO) 0000000000977A48
( 00977A07 90 ) NOP
( 00977A08 498BD6 ) MOV RDX, R14
( 00977A0B 48C1E203 ) SHL RDX, # 03
( 00977A0F 48035500 ) ADD RDX, [RBP]
( 00977A13 498BCE ) MOV RCX, R14
( 00977A16 488D6DE0 ) LEA RBP, [RBP+-20]
( 00977A1A 48895D00 ) MOV [RBP], RBX
( 00977A1E 488B4520 ) MOV RAX, [RBP+20]
( 00977A22 48894508 ) MOV [RBP+08], RAX
( 00977A26 488B02 ) MOV RAX, 0 [RDX]
( 00977A29 48894510 ) MOV [RBP+10], RAX
( 00977A2D 48895D18 ) MOV [RBP+18], RBX
( 00977A31 488BD9 ) MOV RBX, RCX
( 00977A34 E847FEFFFF ) CALL 00977880 SIFT
( 00977A39 4983C6FF ) ADD R14, # -01
( 00977A3D 4983C7FF ) ADD R15, # -01
( 00977A41 71C5 ) JNO 00977A08
( 00977A43 415E ) POP R14
( 00977A45 415F ) POP R15
( 00977A47 58 ) POP RAX
( 00977A48 4883C3FF ) ADD RBX, # -01
( 00977A4C 488D6DF8 ) LEA RBP, [RBP+-08]
( 00977A50 48C7450001000000 ) MOV QWord [RBP], # 00000001
( 00977A58 E8B394A9FFD77A970000000 CALL 00410F10 (DO) 0000000000977AD7
( 00977A65 90 ) NOP
( 00977A66 90 ) NOP
( 00977A67 90 ) NOP
( 00977A68 498BD6 ) MOV RDX, R14
( 00977A6B 48C1E203 ) SHL RDX, # 03
( 00977A6F 4803D3 ) ADD RDX, RBX
( 00977A72 488B0A ) MOV RCX, 0 [RDX]
( 00977A75 488B03 ) MOV RAX, 0 [RBX]
( 00977A78 488902 ) MOV 0 [RDX], RAX
( 00977A7B 498BD6 ) MOV RDX, R14
( 00977A7E 4883FA01 ) CMP RDX, # 01
( 00977A82 488D6DF8 ) LEA RBP, [RBP+-08]
( 00977A86 48895D00 ) MOV [RBP], RBX
( 00977A8A 488BD9 ) MOV RBX, RCX
( 00977A8D 0F8514000000 ) JNZ/NE 00977AA7
( 00977A93 488B5500 ) MOV RDX, [RBP]
( 00977A97 48891A ) MOV 0 [RDX], RBX
( 00977A9A 488B5D00 ) MOV RBX, [RBP]
( 00977A9E 488D6D08 ) LEA RBP, [RBP+08]
( 00977AA2 E921000000 ) JMP 00977AC8
( 00977AA7 498BD6 ) MOV RDX, R14
( 00977AAA 488D6DE8 ) LEA RBP, [RBP+-18]
( 00977AAE 48895500 ) MOV [RBP], RDX
( 00977AB2 488B5518 ) MOV RDX, [RBP+18]
( 00977AB6 48895508 ) MOV [RBP+08], RDX
( 00977ABA 48895D10 ) MOV [RBP+10], RBX
( 00977ABE BB00000000 ) MOV EBX, # 00000000
( 00977AC3 E8B8FDFFFF ) CALL 00977880 SIFT
( 00977AC8 4983C6FF ) ADD R14, # -01
( 00977ACC 4983C7FF ) ADD R15, # -01
( 00977AD0 7196 ) JNO 00977A68
( 00977AD2 415E ) POP R14
( 00977AD4 415F ) POP R15
( 00977AD6 58 ) POP RAX
( 00977AD7 C3 ) RET/NEXT
( 264 bytes, 71 instructions )

Everyone can see now how powerful Stephen's Optimizer in VFX Forth is.
Even ugly Forth code it turned into beautiful assembly code.
I will not show off my broken English to describe how beautifully VFX Forth shortened and rearranged everything.

For example, let's look at the inside of the loop in our PRINT word:
( 009778D8 498BD6 ) MOV RDX, R14
( 009778DB 488B14D530789700 ) MOV RDX, [+RDX*8+00977830]
( 009778E3 488D6DF8 ) LEA RBP, [RBP+-08]
( 009778E7 48895D00 ) MOV [RBP], RBX
( 009778EB 488BDA ) MOV RBX, RDX
( 009778EE E8555DAAFF ) CALL 0041D648 .
You can see that 5 words: example i cells + @ are coded into almost 1 assembly instruction!

Question to Stephen Pelc: After all, your wonderful optimization fairy tale can be carried on. Why not to use single instruction:
MOV RDX, [8*R14+00977830]
instead of two of yours:
MOV RDX, R14
MOV RDX, [8*RDX+00977830]
or even load the target registry rbx right away:
MOV RBX, [8*R14+00977830]
instead of your 3 instructions?
MOV RDX, R14
MOV RDX, [8*RDX+00977830]
MOV RBX, RDX

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Pelc@21:1/5 to All on Sun Aug 14 09:11:20 2022

On 13 Aug 2022 at 22:58:11 CEST, "Michał Kasprzak" <siarczek83@gmail.com> wrote:

Question to Stephen Pelc: After all, your wonderful optimization fairy tale can be carried on. Why not to use single instruction:
MOV RDX, [8*R14+00977830]
instead of two of yours:
MOV RDX, R14
MOV RDX, [8*RDX+00977830]
or even load the target registry rbx right away:
MOV RBX, [8*R14+00977830]
instead of your 3 instructions?
MOV RDX, R14
MOV RDX, [8*RDX+00977830]
MOV RBX, RDX

R14 is a spacial register which must not be changed. The optimiser simply copies it rather than marking it as read-only. That'll get fixed in two CPUs time. The line MOV RBX, RDX comes from the stack shuffle at a basic block boundary and is not part of the complex load. Whether the destination register choice is optimal depends very much on coding style.

Stephen
--
Stephen Pelc, stephen@vfxforth.com
MicroProcessor Engineering, Ltd. - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, +44 (0)78 0390 3612,
+34 649 662 974
http://www.mpeforth.com - free VFX Forth downloads

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcel Hendrix@21:1/5 to siarc...@gmail.com on Sun Aug 14 02:34:37 2022

On Saturday, August 13, 2022 at 10:58:12 PM UTC+2, siarc...@gmail.com wrote: [..]

Question to all of you: Can you rewrite this algorithm so that it becomes Forth's
pride and not Forth's insult?
Maybe you might want to use variables or locals or arrange human comments? You don't have to stay in CORE, but even if you want to stay like Waldek, then
VARIABLE and ( ) are rather in Core, right Waldek? I don't know where this fear
of using variables by Forthers comes from.

[..]

The most beautiful thing about this is that the unmodified
Forth code works even on my machine using a different compiler :--)

FORTH> ' heapsort idis
$01340180 : heapsort
$0134018A pop rbx
$0134018B lea rdi, [rbx -1 +] qword
$0134018F push rbx
$01340190 mov rbx, 0 d#
$01340197 mov rcx, rbx
$0134019A mov rbx, rdi
$0134019D sar rbx, 1 b#
$013401A1 call (DO) offset NEAR
$013401AB lea rax, [rax 0 +] qword
$013401B0 pop rdi
$013401B1 mov rax, [rbp 0 +] qword
$013401B5 mov rdx, [rbp 0 +] qword
$013401B9 push rdi
$013401BA push rbx
$013401BB push [rdi rax*8] qword
$013401BE push rdi
$013401BF push rbx
$013401C0 push rdx
$013401C1 lea rbx, [rdx*2 1 +] qword
$013401C9 lea rax, [rax 0 +] qword
$013401D0 pop rdi
$013401D1 pop rax
$013401D2 cmp rbx, rax
$013401D5 push rax
$013401D6 push rdi
$013401D7 push rbx
$013401D8 push rax
$013401D9 jge $01340255 offset NEAR
$013401DF pop rdi
$013401E0 lea rax, [rbx 1 +] qword
$013401E4 cmp rax, rdi
$013401E7 jge $01340208 offset NEAR
$013401ED pop rbx
$013401EE pop rdi
$013401EF pop rax
$013401F0 pop rdx
$013401F1 mov rcx, [rdx rbx*8] qword
$013401F5 cmp rcx, [rdx rbx*8 8 +] qword
$013401FA push rdx
$013401FB push rax
$013401FC push rdi
$013401FD jge $01340207 offset NEAR
$01340203 lea rbx, [rbx 1 +] qword
$01340207 push rbx
$01340208 pop rbx
$01340209 pop rdi
$0134020A pop rax
$0134020B pop rdx
$0134020C pop r9
$0134020E mov rcx, [rdx rbx*8] qword
$01340212 cmp rcx, r9
$01340215 push r9
$01340217 push rdx
$01340218 push rax
$01340219 push rdi
$0134021A push rbx
$0134021B push rdx
$0134021C push [rdx rbx*8] qword
$0134021F mov rbx, rcx
$01340222 jle $0134023F offset NEAR
$01340228 pop rbx
$01340229 pop rdi
$0134022A pop rax
$0134022B pop rdx
$0134022C mov [rdi rdx*8] qword, rbx
$01340230 push rax
$01340231 lea rbx, [rax*2 1 +] qword
$01340239 push rbx
$0134023A jmp $0134024D offset NEAR
$0134023F pop rbx
$01340240 pop rbx
$01340241 pop rbx
$01340242 pop rbx
$01340243 mov rdi, [rsp] qword
$01340247 push rbx
$01340248 lea rbx, [rdi 1 +] qword
$0134024C push rbx
$0134024D pop rbx
$0134024E jmp $013401D0 offset NEAR
$01340253 push rbx
$01340254 pop rbx
$01340255 pop rbx
$01340256 pop rbx
$01340257 pop rbx
$01340258 pop rdi
$01340259 pop rdi
$0134025A pop rax
$0134025B mov [rdi rbx*8] qword, rax
$0134025F mov rbx, -1 d#
$01340266 add [rbp 0 +] qword, rbx
$0134026A add [rbp 8 +] qword, rbx
$0134026E pop rbx
$0134026F jno $013401B0 offset NEAR
$01340275 add rbp, #24 b#
$01340279 push 1 b#
$0134027B lea rbx, [rbx -1 +] qword
$0134027F pop rcx
$01340280 call (DO) offset NEAR
$0134028A nop
$0134028B lea rax, [rax 0 +] qword
$01340290 mov rdi, [rbp 0 +] qword
$01340294 mov rax, [rbx rdi*8] qword
$01340298 mov rdx, [rbx] qword
$0134029B mov [rbx rdi*8] qword, rdx
$0134029F mov rdi, [rbp 0 +] qword
$013402A3 cmp rdi, 1 b#
$013402A7 push rbx
$013402A8 mov rbx, rcx
$013402AB mov rcx, rax
$013402AE mov rbx, rcx
$013402B1 jne $013402C3 offset NEAR
$013402B7 pop rdi
$013402B8 mov [rdi] qword, rbx
$013402BB mov rbx, rdi
$013402BE jmp $01340369 offset NEAR
$013402C3 pop rdi
$013402C4 mov rax, [rbp 0 +] qword
$013402C8 push rdi
$013402C9 push rbx
$013402CA push rdi
$013402CB push rax
$013402CC xor rcx, rcx
$013402CF mov rbx, 1 d#
$013402D6 nop
$013402D7 nop
$013402D8 pop rdi
$013402D9 cmp rbx, rdi
$013402DC push rdi
$013402DD push rcx
$013402DE push rbx
$013402DF push rdi
$013402E0 jge $0134035E offset NEAR
$013402E6 pop rdi
$013402E7 lea rax, [rbx 1 +] qword
$013402EB cmp rax, rdi
$013402EE jge $0134030F offset NEAR
$013402F4 pop rbx
$013402F5 pop rdi
$013402F6 pop rax
$013402F7 pop rdx
$013402F8 mov rcx, [rdx rbx*8] qword
$013402FC cmp rcx, [rdx rbx*8 8 +] qword
$01340301 push rdx
$01340302 push rax
$01340303 push rdi
$01340304 jge $0134030E offset NEAR
$0134030A lea rbx, [rbx 1 +] qword
$0134030E push rbx
$0134030F pop rbx
$01340310 pop rdi
$01340311 pop rax
$01340312 pop rdx
$01340313 pop r9
$01340315 mov rcx, [rdx rbx*8] qword
$01340319 cmp rcx, r9
$0134031C push r9
$0134031E push rdx
$0134031F push rax
$01340320 push rdi
$01340321 push rbx
$01340322 push rdx
$01340323 push [rdx rbx*8] qword
$01340326 mov rbx, rcx
$01340329 jle $01340346 offset NEAR
$0134032F pop rbx
$01340330 pop rdi
$01340331 pop rax
$01340332 pop rdx
$01340333 mov [rdi rdx*8] qword, rbx
$01340337 push rax
$01340338 lea rbx, [rax*2 1 +] qword
$01340340 push rbx
ok
FORTH> create example 13 , 7 , 9 , 12 , 1 , 8 , 2 , 99 , 14 , 3 , ok
FORTH> : print 10 0 do example i cells + @ . loop ; ok
FORTH> print 13 7 9 12 1 8 2 99 14 3 ok
FORTH> example 10 heapsort ok
[1]FORTH> print 1 2 3 7 8 9 12 13 14 99 ok
[1]FORTH> . 20187136 ok

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to siarczek83@gmail.com on Sun Aug 14 10:07:11 2022

=?UTF-8?Q?Micha=C5=82_Kasprzak?= <siarczek83@gmail.com> writes:

Hello Waldek, Stephen and others!

Not so long ago my compatriot professor Waldek Hebisch "has coded" the Fort= >h version of the Heap Sort Algorithm.

http://www.math.uni.wroc.pl/~hebisch/prog/taxi_hs.fs=20

Heap Sort is a beautiful algorithm which learning will give a lot to each o= >f you! You will learn what "binary trees" are, when they are "complete", ho= >w to represent a binary tree as an ordinary array, why this array should be=
indexed from 1, and what relates this binary index to moving through the t=
ree. You will learn what "heap condition" a complete binary tree must satis= >fy in order to be a "heap". Finally, you will learn the surprising non-obvi= >ous thing, why you are creating a heap of arbitrary data in O(n) pessimisti= >c time, which is faster than you thought!
The Heap Sort Algorithm sorts in pessimistic O(n*log n) time which is bette= >r than QuickSort which does it in O(n^2). Admittedly Merge Sort has this ti= >me too, but Heap Sort sorts in place and Merge Sort needs extra second memo= >ry.

But Quicksort and Mergesort have good spatial locality, and Heapsort
has bad spatial locality. Using the taxi_hs.fs program linked to
above, and the taxi-qs.fs variant that uses quicksort instead of
heapsort <http://www.complang.tuwien.ac.at/forth/programs/taxi-qs.fs>,
we see this nicely:

Hash table ?sort Heapsort Quicksort
taxi2.4th taxi.cc taxi_hs.fs taxi-qs.fs
gforth-fast vfx64 g++ -O2 vfx64 vfx64
187_333_032 218_295_645 147_818_881 837_764_079 124_043_270 l2 accesses 102_958_191 117_248_139 6_476_984 589_376_272 8_858_405 l2 misses

The sort used by taxi.cc is probably quicksort or mergesort.

The bottom line (on a Ryzen 5800X) is also not great for taxi_qs.fs:

s cycles instructions
3.35 16_039_014_129 10_085_237_291 gforth-fast 9000 taxi2.4th
1.56 7_353_893_755 2_627_580_253 vfx64 9000 taxi2.4th
1.70 8_053_294_428 10_664_441_167 a.out (taxi.cc binary)
10.71 50_793_570_097 64_996_544_038 vfx64 taxi_hs.fs
2.56 11_446_320_529 17_253_502_741 vfx64 taxi-qs.fs

The number of executed instructions is also a lot higher for taxi_hs
than for taxi-qs.fs (and the taxi.cc binary), so it's not just the
cache misses that slow heapsort down.

: sift
( rra ra ir ii )
DUP 2 * 1+ ( rra ra ir ii jj )
BEGIN=20
>R OVER R@ SWAP R> OVER OVER ( rra ra ir ii jj ir jj ir jj )
> WHILE ( rra ra ir ii jj ir jj )
1+ > IF ( rra ra ir ii jj )
>R >R OVER R> SWAP R> SWAP OVER ( rra ra ir ii jj ra jj )
CELLS + DUP @ SWAP CELL+ @ < IF ( rra ra ir ii jj )
1 +
THEN
THEN ( rra ra ir ii jj )
>R >R >R OVER OVER R> ROT ROT R> ROT ROT R@
ROT ROT R> ( rra ra ir ii jj rra ra jj )
CELLS OVER + @ ROT ( rra ra ir ii jj ra rr_jj rra )
OVER < IF ( rra ra ir ii jj ra rr_jj )
>R >R OVER CELLS R> + R> SWAP !
SWAP DROP DUP DUP + 1+
ELSE
DROP DROP DROP ( rra ra ir ii )
OVER 1 +
THEN ( rra ra ir ii jj )
REPEAT
DROP DROP DROP SWAP DROP ( rra ra ii )
CELLS + !
;

: heapsort ( ra n )
0 OVER 1 - 2 / DO
OVER OVER OVER ( ra n ra n ra )
I CELLS + @ ROT ROT I ( ra n rra ra n I )
sift
-1 +LOOP
1 - 1 SWAP DO ( ra )
DUP DUP I CELLS + ( ra ra ra+I )
DUP @ ( ra ra ra+I rra )
SWAP ROT ( ra rra ra+I ra )
@ SWAP ! ( ra rra )
I 1 =3D IF
OVER !
ELSE
OVER I 0 ( rra ra I 0 )
sift
THEN
-1 +LOOP
;

..

As everybody can see the above code is ugly. Alas, this code is even an exa= >mple of why not to use Forth.

One might argue "how to not use Forth", although the suggestions for alternatives will differ. Waldek Hebisch suggests using locals, but
of course this is an anathema to purists. One might consider heapsort
another challenge to those who claim that every problem can be solved
without locals by good factoring.

But when we use VFX Forth, the above code becomes beautiful.
It's because of Stephen who designed the VFX code generator.
Be sure to disassemble words sift and heapsort to see this beauty!

Actually lxf produces nicer code (but because the 64-bit version has
not been released yet, it does not work for the taxi problem with the parameters we have been using):

VFX64 lxf
MOV RDX, RBX imul eax , ebx , # 2h
SHL RBX, # 1 inc eax
INC RBX mov [ebp-4h] , ebx
LEA RBP, [RBP+-08] mov ebx , eax
MOV [RBP], RDX lea ebp , [ebp-4h]
NOP cmp [ebp+4h] , ebx
NOP mov eax , [ebp+4h]
NOP mov [ebp-8h] , eax
NOP mov [ebp-4h] , ebx
NOP lea ebp , [ebp-8h]
NOP jle "0804FC80"
NOP inc ebx
PUSH RBX cmp [ebp] , ebx
MOV RBX, [RSP] mov ebx , [ebp+4h]
POP RDX lea ebp , [ebp+8h]
CMP RDX, [RBP+08] jle "0804FC31"
LEA RBP, [RBP+-10] mov eax , ebx
MOV RCX, [RBP+18] shl eax , 2h
MOV [RBP], RCX add eax , [ebp+8h]
MOV [RBP+08], RBX mov ecx , eax
MOV RBX, RDX mov ecx , [ecx]
JNL/GE 0097797C add eax , # 4h
INC RBX mov eax , [eax]
CMP RBX, [RBP] cmp ecx , eax
MOV RBX, [RBP+08] jge "0804FC31"
LEA RBP, [RBP+10] add ebx , # 1h
JNL/GE 009778FA mov eax , ebx
PUSH RBX shl eax , 2h
PUSH QWORD [RBP] add eax , [ebp+8h]
POP RBX mov eax , [eax]
POP RDX cmp [ebp+Ch] , eax
MOV RCX, RDX mov ecx , [ebp+8h]
SHL RCX, # 03 mov [ebp-8h] , ecx
ADD RCX, [RBP+10] mov [ebp-4h] , ebx
MOV RAX, 0 [RCX] mov ebx , eax
CMP RAX, [RCX+08] lea ebp , [ebp-8h]
MOV [RBP], RBX jge "0804FC72"
MOV RBX, RDX mov eax , [ebp+8h]
JNL/GE 009778FA shl eax , 2h
ADD RBX, # 01 add eax , [ebp]
PUSH RBX mov [eax] , ebx
PUSH QWORD [RBP] mov ebx , [ebp+4h]
PUSH QWORD [RBP+08] add ebx , [ebp+4h]
POP RBX inc ebx
POP RDX mov eax , [ebp+4h]
MOV RCX, [RSP] mov [ebp+8h] , eax
POP RAX lea ebp , [ebp+8h]
SHL RAX, # 03 jmp "0804FC7B"
ADD RAX, [RBP+10] mov ebx , [ebp+Ch]
MOV R8, 0 [RAX] add ebx , # 1h
CMP R8, [RBP+18] lea ebp , [ebp+8h]
LEA RBP, [RBP+-10] jmp "0804FBF0"
MOV RAX, [RBP+20] shl dword [ebp+8h] , 2h
MOV [RBP], RAX mov ebx , [ebp+10h]
MOV [RBP+08], RCX add ebx , [ebp+8h]
MOV [RBP+10], RDX mov eax , [ebp+14h]
MOV [RBP+18], RBX mov [ebx] , eax
MOV RBX, R8 mov ebx , [ebp+18h]
JLE/NG 0097796B lea ebp , [ebp+1Ch]
PUSH RBX ret near
PUSH QWORD [RBP]
MOV RBX, [RBP+10]
SHL RBX, # 03
POP RDX
ADD RBX, RDX
POP RDX
MOV 0 [RBX], RDX
MOV RBX, [RBP+08]
ADD RBX, [RBP+08]
INC RBX
MOV RDX, [RBP+08]
MOV [RBP+10], RDX
LEA RBP, [RBP+10]
JMP 00977977
MOV RBX, [RBP+18]
ADD RBX, # 01
LEA RBP, [RBP+10]
JMP 00977898
MOV RBX, [RBP+10]
SHL RBX, # 03
ADD RBX, [RBP+20]
MOV RDX, [RBP+28]
MOV 0 [RBX], RDX
MOV RBX, [RBP+30]
LEA RBP, [RBP+38]
RET/NEXT
( 280 bytes, 86 instructions )( 178 bytes, 50 instructions )

The reason is that lxf is analytical about the return stack, while VFX
is not. HEAPSORT does not use the return stack (except for counted
loops, so lxf does not do significantly better for that.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2022: https://euro.theforth.net

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From antispam@math.uni.wroc.pl@21:1/5 to siarczek83@gmail.com on Mon Aug 15 00:41:16 2022

Micha? Kasprzak <siarczek83@gmail.com> wrote:

Hello Waldek, Stephen and others!

Not so long ago my compatriot professor Waldek Hebisch "has coded" the Forth version of the Heap Sort Algorithm.

http://www.math.uni.wroc.pl/~hebisch/prog/taxi_hs.fs

<snip>

As everybody can see the above code is ugly. Alas, this code is even an example of why not to use Forth. Admit, the code discourages you.

You probably want to see how it works. Let's create an array of 10 numbers and sort it:

create example 13 , 7 , 9 , 12 , 1 , 8 , 2 , 99 , 14 , 3 ,
: print 10 0 do example i cells + @ . loop ;
print
example 10 heapsort
print

Note 1: The word heapsort incorrectly leaves an address on the stack. There is no DROP before ; in heapsort.
There is even more junk on the stack after all the rama_taxis is done.

I added missing DROP-s and did few (very minimal cleanups).

Question to Waldek: But basically your heapsort works so it could be used all over the world. The question of copyright arises. Do you give your heapsort code to the community for free and agree to any use of your heapsort, even without mentioning that

it is you who "has coded" it?

Yes. This is pretty standard algorithm and coding is pretty
strightforward.

Question to all of you: Can you rewrite this algorithm so that it becomes Forth's pride and not Forth's insult?
Maybe you might want to use variables or locals or arrange human comments?

In the other disscussion I provided version using locals. Concerning comments, data stack is used to store variables and stack comments say you which variables
are on the stack at important points. Note that actual computaions in both versions are the same, only difference is that first versions uses stack manipulations to access and store variables.

There is also different version at Rosetta code:

https://rosettacode.org/wiki/Sorting_algorithms/Heapsort#Forth

You don't have to stay in CORE, but even if you want to stay like Waldek, then VARIABLE and ( ) are rather in Core, right Waldek? I don't know where this fear of using variables by Forthers comes from.

Well, one reason to write code the way I did was to give compiler
(optimizer) reasonable chance to generate good code. Namely,
given sequence of stack/return stack operations with no
control between operations means that compiler can compute
stack effect of each such seqence. I keeps stack balanced
and in each iteration I access the same stack locations
so compiler can cache used parts of stack in registers and
replace all stack manipulations by register transfers.
After that there will be some redundant transfers, but
rather simple optimizations can remove them. So with
sufficiently smart compiler versions with locals
(which I consider much nicer) and CORE version should
give the same code.

Going beyond CORE (but with no locals) I would use PICK and
TUCK. VARIABLE has both runtime cost and makes program
_much_ harder to understand> in my ugly version one needs
to look at few 2-3 line fragments to understand what
happens, with liberal use of VARIABLE in bigger program
one my be forced to look at hundreds or thousends of
lines and analyse their interactions.

But when we use VFX Forth, the above code becomes beautiful.
It's because of Stephen who designed the VFX code generator.
Be sure to disassemble words sift and heapsort to see this beauty!

Nice, but I admit that I hoped for better result. Version
with locals translated to C gives 21 computational
instructions. C passes arguments in registers, so sticking
to Forth convention we would need 5 additional instructions,
giving 26 in total. AFAICS VFX version has 79 computational
instructions (and 7 NOP-s for padding). In particular
I see 33 accesses to data stack and a bunch of adjustments
of data stack pointer, while 4 accesses and 1 adjustement are
enough.

dasm sift
( 00977880 488BD3 ) MOV RDX, RBX
( 00977883 48D1E3 ) SHL RBX, # 1
( 00977886 48FFC3 ) INC RBX
( 00977889 488D6DF8 ) LEA RBP, [RBP+-08]
( 0097788D 48895500 ) MOV [RBP], RDX
( 00977891 90 ) NOP
( 00977892 90 ) NOP
( 00977893 90 ) NOP
( 00977894 90 ) NOP
( 00977895 90 ) NOP
( 00977896 90 ) NOP
( 00977897 90 ) NOP
( 00977898 53 ) PUSH RBX
( 00977899 488B1C24 ) MOV RBX, [RSP]
( 0097789D 5A ) POP RDX
( 0097789E 483B5508 ) CMP RDX, [RBP+08]
( 009778A2 488D6DF0 ) LEA RBP, [RBP+-10]
( 009778A6 488B4D18 ) MOV RCX, [RBP+18]
( 009778AA 48894D00 ) MOV [RBP], RCX
( 009778AE 48895D08 ) MOV [RBP+08], RBX
( 009778B2 488BDA ) MOV RBX, RDX
( 009778B5 0F8DC1000000 ) JNL/GE 0097797C
( 009778BB 48FFC3 ) INC RBX
( 009778BE 483B5D00 ) CMP RBX, [RBP]
( 009778C2 488B5D08 ) MOV RBX, [RBP+08]
( 009778C6 488D6D10 ) LEA RBP, [RBP+10]
( 009778CA 0F8D2A000000 ) JNL/GE 009778FA
( 009778D0 53 ) PUSH RBX
( 009778D1 48FF7500 ) PUSH QWORD [RBP]
( 009778D5 5B ) POP RBX
( 009778D6 5A ) POP RDX
( 009778D7 488BCA ) MOV RCX, RDX
( 009778DA 48C1E103 ) SHL RCX, # 03
( 009778DE 48034D10 ) ADD RCX, [RBP+10]
( 009778E2 488B01 ) MOV RAX, 0 [RCX]
( 009778E5 483B4108 ) CMP RAX, [RCX+08]
( 009778E9 48895D00 ) MOV [RBP], RBX
( 009778ED 488BDA ) MOV RBX, RDX
( 009778F0 0F8D04000000 ) JNL/GE 009778FA
( 009778F6 4883C301 ) ADD RBX, # 01
( 009778FA 53 ) PUSH RBX
( 009778FB 48FF7500 ) PUSH QWORD [RBP]
( 009778FF 48FF7508 ) PUSH QWORD [RBP+08]
( 00977903 5B ) POP RBX
( 00977904 5A ) POP RDX
( 00977905 488B0C24 ) MOV RCX, [RSP]
( 00977909 58 ) POP RAX
( 0097790A 48C1E003 ) SHL RAX, # 03
( 0097790E 48034510 ) ADD RAX, [RBP+10]
( 00977912 4C8B00 ) MOV R8, 0 [RAX]
( 00977915 4C3B4518 ) CMP R8, [RBP+18]
( 00977919 488D6DF0 ) LEA RBP, [RBP+-10]
( 0097791D 488B4520 ) MOV RAX, [RBP+20]
( 00977921 48894500 ) MOV [RBP], RAX
( 00977925 48894D08 ) MOV [RBP+08], RCX
( 00977929 48895510 ) MOV [RBP+10], RDX
( 0097792D 48895D18 ) MOV [RBP+18], RBX
( 00977931 498BD8 ) MOV RBX, R8
( 00977934 0F8E31000000 ) JLE/NG 0097796B
( 0097793A 53 ) PUSH RBX
( 0097793B 48FF7500 ) PUSH QWORD [RBP]
( 0097793F 488B5D10 ) MOV RBX, [RBP+10]
( 00977943 48C1E303 ) SHL RBX, # 03
( 00977947 5A ) POP RDX
( 00977948 4803DA ) ADD RBX, RDX
( 0097794B 5A ) POP RDX
( 0097794C 488913 ) MOV 0 [RBX], RDX
( 0097794F 488B5D08 ) MOV RBX, [RBP+08]
( 00977953 48035D08 ) ADD RBX, [RBP+08]
( 00977957 48FFC3 ) INC RBX
( 0097795A 488B5508 ) MOV RDX, [RBP+08]
( 0097795E 48895510 ) MOV [RBP+10], RDX
( 00977962 488D6D10 ) LEA RBP, [RBP+10]
( 00977966 E90C000000 ) JMP 00977977
( 0097796B 488B5D18 ) MOV RBX, [RBP+18]
( 0097796F 4883C301 ) ADD RBX, # 01
( 00977973 488D6D10 ) LEA RBP, [RBP+10]
( 00977977 E91CFFFFFF ) JMP 00977898
( 0097797C 488B5D10 ) MOV RBX, [RBP+10]
( 00977980 48C1E303 ) SHL RBX, # 03
( 00977984 48035D20 ) ADD RBX, [RBP+20]
( 00977988 488B5528 ) MOV RDX, [RBP+28]
( 0097798C 488913 ) MOV 0 [RBX], RDX
( 0097798F 488B5D30 ) MOV RBX, [RBP+30]
( 00977993 488D6D38 ) LEA RBP, [RBP+38]
( 00977997 C3 ) RET/NEXT
( 280 bytes, 86 instructions )

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From minforth@arcor.de@21:1/5 to siarc...@gmail.com on Mon Aug 15 09:54:53 2022

siarc...@gmail.com schrieb am Montag, 15. August 2022 um 18:42:29 UTC+2:

Thank you Waldek on behalf of the community and my own for your heapsort code! :-)

I have been working on the code since the morning. I hope not only me, but if not all of you then at least Zombie (Anton's new nickname) is also looking for holes and ways to rewrite the code ;)
So far I have found a non-critical mistake in the implementation of the algorithm. Take it easy, sorting is still working well thanks to a favorable coincidence.

Note 2: The algorithm starts with building the heap. Running from the end through the tree nodes, it tries to restore the heap condition to each node.
Leaf nodes always satisfy the heap condition. So we don't have to fix them. We can omit leaves from the heap-building process. It is why we start the loop with the index of the last parent node.
Remember that in practice unfortunately we index the array from 0, not from 1, as the theory would like.
In our situation, the index of the last parent node is n/2-1. You can check it by drawing on a sheet of paper. Unfortunately, Waldek at the beginning of heapsort calculates it like this:
OVER 1 - 2 /
This means that Waldek calculates (n-1)/2. For example, if we wanted to sort 11 items, n = 11. From the correct formula n/2-1=4 (we use integer arithmetic), from the Waldek formula (n-1)/2=5.
Check on the paper to see who is right. The last parent node has index 4, starting with 0. Node 5 is a leaf.
There should be written 2/ 1- behind OVER at the beginning of the heapsort word in the code (by the way, both 2/ 1- words are in Core):
OVER 2/ 1-
Despite this mistake, the Waldek's heapsort code works, but sometimes it does too much unnecessarily.

Meanwhile, while waiting for more results...
I found Marcel Hendrix's post in another thread from 16 years ago. It shows what Marcel is happy about, and with fond memories, allows you to see how our community made progress ;)
The discussion was about the "forth Insertion Sort". Until the end of my post there is a quote from Marcel:

Seems Gavino is back from therapy ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?Q?Micha=C5=82_Kasprzak?=@21:1/5 to All on Mon Aug 15 09:42:27 2022

Thank you Waldek on behalf of the community and my own for your heapsort code! :-)

I have been working on the code since the morning. I hope not only me, but if not all of you then at least Zombie (Anton's new nickname) is also looking for holes and ways to rewrite the code ;)
So far I have found a non-critical mistake in the implementation of the algorithm. Take it easy, sorting is still working well thanks to a favorable coincidence.

Note 2: The algorithm starts with building the heap. Running from the end through the tree nodes, it tries to restore the heap condition to each node.
Leaf nodes always satisfy the heap condition. So we don't have to fix them. We can omit leaves from the heap-building process. It is why we start the loop with the index of the last parent node.
Remember that in practice unfortunately we index the array from 0, not from 1, as the theory would like.
In our situation, the index of the last parent node is n/2-1. You can check it by drawing on a sheet of paper. Unfortunately, Waldek at the beginning of heapsort calculates it like this:
OVER 1 - 2 /
This means that Waldek calculates (n-1)/2. For example, if we wanted to sort 11 items, n = 11. From the correct formula n/2-1=4 (we use integer arithmetic), from the Waldek formula (n-1)/2=5.
Check on the paper to see who is right. The last parent node has index 4, starting with 0. Node 5 is a leaf.
There should be written 2/ 1- behind OVER at the beginning of the heapsort word in the code (by the way, both 2/ 1- words are in Core):
OVER 2/ 1-
Despite this mistake, the Waldek's heapsort code works, but sometimes it does too much unnecessarily.

Meanwhile, while waiting for more results...
I found Marcel Hendrix's post in another thread from 16 years ago. It shows what Marcel is happy about, and with fond memories, allows you to see how our community made progress ;)
The discussion was about the "forth Insertion Sort". Until the end of my post there is a quote from Marcel:

This one is complete, ready-to-run, and works. It has been tested for all the Forths I think worthwhile.

-marcel

-- ==========================
( descending INSERTION-SORT )

\ ANEW -insert

0 [IF] From Wikipedia, the free encyclopedia

Insertion sort is much less efficient on large lists than the more
advanced algorithms such as quicksort, heapsort, or merge sort, but it
has various advantages:

- Efficient on (quite) small data sets
- Efficient on data sets which are already substantially sorted
- More efficient in practice than most other simple O(n2) algorithms such
as selection sort or bubble sort: the average time is n2/4 and it is
linear in the best case
- Stable (does not change the relative order of elements with equal keys)
- In-place (only requires a constant amount O(1) of extra memory space)

It is an online algorithm, in that it can sort a list as it receives
it. In abstract terms, each iteration of an insertion sort removes an
element from the input data, inserting it at the correct position in
the already sorted list, until no elements are left in the input.

Sorting is done in-place. The result array after k iterations contains
the first k entries of the input array and is sorted. In each step, the
first remaining entry of the input is removed, inserted into the result
at the right position, thus extending the result.

The most common variant, which operates on arrays, can be described as:

Suppose we have a method called insert designed to insert a value into a
sorted sequence at the beginning of an array. It operates by starting at
the end of the sequence and shifting each element one place to the right
until a suitable position is found for the new element. It has the side
effect of overwriting the value stored immediately after the sorted
sequence in the array. To perform insertion sort, start at the left end
of the array and invoke insert to insert each element encountered into
its correct position. The ordered sequence into which we insert it is
stored at the beginning of the array in the set of indexes already
examined. Each insertion overwrites a single value, but this is okay
because it's the value we're inserting.
[THEN]

( ----------------------------------------------------------------------------------- )

: @+ ( addr -- addr' val ) DUP CELL+ SWAP @ ;
: 0>= ( n -- bool ) 0< 0= ;

: 'a[jj] ( -- addr ) S" jj CELLS array + " EVALUATE ; IMMEDIATE
: 'a[jj+1] ( -- addr ) S" jj 1+ CELLS array + " EVALUATE ; IMMEDIATE
: a[I] ( -- ) S" I CELLS array + @ " EVALUATE ; IMMEDIATE
: a[jj] ( -- ) S" jj CELLS array + @ " EVALUATE ; IMMEDIATE

( ----------------------------------------------------------------------------------- )

: insertion-sort ( array size -- )
0 0 LOCALS| temp jj size array |
size 1
?DO I TO jj
a[I] TO temp
BEGIN jj 1- DUP TO jj 0>=
WHILE temp a[jj] >
WHILE 'a[jj] @+ SWAP !
REPEAT THEN
temp 'a[jj+1] !
LOOP ;

: .DATA ( array count -- ) CR 0 ?DO DUP @ . CELL+ LOOP DROP ;

( ----------------------------------------------------------------------------------- )

9 CONSTANT mycount
CREATE myarray
9 , 3 , 4 , 12 , 7 , 1 , 4 , 3 , 2 ,

: test ( -- )
myarray mycount insertion-sort
myarray mycount .DATA ;

( ----------------------------------------------------------------------------------- )
( tested with Win32Forth 4.2, Win32Forth 6.10.04, gForth, SwiftForth, VFX, iForth 2.0 )

\ FORTH> test
\ 12 9 7 4 4 3 3 2 1 ok

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From none) (albert@21:1/5 to siarczek83@gmail.com on Tue Aug 16 13:11:51 2022

In article <277ad6ed-e704-410d-bd31-335f2a7e8b17n@googlegroups.com>,
Micha� Kasprzak <siarczek83@gmail.com> wrote:

Hello Waldek, Stephen and others!

Not so long ago my compatriot professor Waldek Hebisch "has coded" the
Forth version of the Heap Sort Algorithm.

http://www.math.uni.wroc.pl/~hebisch/prog/taxi_hs.fs

Heap Sort is a beautiful algorithm which learning will give a lot to each
of you! You will learn what "binary trees" are, when they are "complete",
how to represent a binary tree as an ordinary array, why this array should
be indexed from 1, and what relates this binary index to moving through
the tree. You will learn what "heap condition" a complete binary tree must >satisfy in order to be a "heap". Finally, you will learn the surprising >non-obvious thing, why you are creating a heap of arbitrary data in O(n) >pessimistic time, which is faster than you thought!
The Heap Sort Algorithm sorts in pessimistic O(n*log n) time which is
better than QuickSort which does it in O(n^2). Admittedly Merge Sort has
this time too, but Heap Sort sorts in place and Merge Sort needs extra
second memory.

Not true for a typical use case.
See my MERGE-SORT in ciforth. Merge Sort should properly be used with
records of variable size, pointing to each other via links.
Then it needs no extra storage. The implementation I suggest use at
its worst case O(log(n)) memory for keeping track.
Note that quicksort needs worst case O(n) for this, which can be
diminished heuristically -not guaranteed- to O(log(n)).

Groetjes Albert
--
"in our communism country Viet Nam, people are forced to be
alive and in the western country like US, people are free to
die from Covid 19 lol" duc ha
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Paul Rubin@21:1/5 to albert@cherry. on Tue Aug 16 13:15:09 2022

albert@cherry.(none) (albert) writes:

its worst case O(log(n)) memory for keeping track.
Note that quicksort needs worst case O(n) for this, which can be
diminished heuristically -not guaranteed- to O(log(n)).

Doesn't quicksort guarantee O(log(n)) if you always sort the smaller
partition first after pivoting?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?Q?Micha=C5=82_Kasprzak?=@21:1/5 to All on Tue Aug 16 15:47:41 2022

Hello Waldek, I have been still analyzing your heapsort code.
I must praise you for how you swap the values in the nodes of the subtree in the sift word.
I really like how you delay saving the subtree's root value until you reach the target location :-)

It looks like you are probably proud of your sift word and paid less attention to the heapsort word ;)

Your sift word uses 4 parametrs which you named as follows:
rra is the value of the ii-th array element,
ra is the address of the array,
ir is the total number of elements in the array,
ii is the index of the subtree's root node, which is where we want to restore the heap condition.
You also use the notation jj for the index of the left child of node ii.

Note 3: Your word sift controls index ranges very well and also works well for a single-element array, when invoked with the parameters ir = 1 and ii = 0, right?
In the sorting loop in the heapsort word near the end of the code you put the following condition IF:
I 1 = IF
OVER !
ELSE
OVER I 0 ( rra ra I 0 )
sift
THEN
This IF condition is tested n-1 times in the loop. You do these costly tests just to distinguish the case of I = 1.
But this is not needed because the case I = 1 is handled as well as the rest of the cases by sift.
If you wanted to get faster acceleration of this one case, it is definitely not worth doing it for the price of n-1 checks.
But perhaps the real reason for highlighting this case was some concern about the execution of this edge case.
Maybe when you started writing the sift word, you didn't know how it would end, and the sift evolved - you applied optimizations.
Please correct me if I am wrong, but all this IF is not needed and all you do after ELSE is enough.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From none) (albert@21:1/5 to no.email@nospam.invalid on Wed Aug 17 00:57:34 2022

In article <87zgg4x7w2.fsf@nightsong.com>,
Paul Rubin <no.email@nospam.invalid> wrote:

albert@cherry.(none) (albert) writes:

its worst case O(log(n)) memory for keeping track.
Note that quicksort needs worst case O(n) for this, which can be
diminished heuristically -not guaranteed- to O(log(n)).

Doesn't quicksort guarantee O(log(n)) if you always sort the smaller >partition first after pivoting?

To quote my numerical math professor:
In a pathological case the smaller partition can be O(1) size.
That makes for O(n) steps. By randomizing the partitioning, the chance
however may become vanishingly small.

Knuth gives an example of a naive implementation that takes O(n)
steps for an already sorted set.

There is a good reason to select the smaller partition though.
It cannot prevent the time to explode, but it restricts the storage.

Groetjes Albert
--
"in our communism country Viet Nam, people are forced to be
alive and in the western country like US, people are free to
die from Covid 19 lol" duc ha
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Paul Rubin@21:1/5 to albert@cherry. on Tue Aug 16 17:28:04 2022

albert@cherry.(none) (albert) writes:

To quote my numerical math professor:
In a pathological case the smaller partition can be O(1) size.
That makes for O(n) steps.

I thought we were talking about the amount of auxiliary space needed,
not the runtime. Yes the runtime can be quadratic in the worst case: http://www.cs.dartmouth.edu/~doug/mdmspe.pdf

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From antispam@math.uni.wroc.pl@21:1/5 to siarczek83@gmail.com on Wed Aug 17 12:07:49 2022

Micha? Kasprzak <siarczek83@gmail.com> wrote:

Hello Waldek, I have been still analyzing your heapsort code.
I must praise you for how you swap the values ??in the nodes of the subtree in the sift word.
I really like how you delay saving the subtree's root value until you reach the target location :-)

To be clear: that is classice heapsort, it is in textbooks in this
form. No invention on my part.

It looks like you are probably proud of your sift word and paid less attention to the heapsort word ;)

Your sift word uses 4 parametrs which you named as follows:
rra is the value of the ii-th array element,
ra is the address of the array,
ir is the total number of elements in the array,
ii is the index of the subtree's root node, which is where we want to restore the heap condition.
You also use the notation jj for the index of the left child of node ii.

Note 3: Your word sift controls index ranges very well and also works well for a single-element array, when invoked with the parameters ir = 1 and ii = 0, right?
In the sorting loop in the heapsort word near the end of the code you put the following condition IF:
I 1 = IF
OVER !
ELSE
OVER I 0 ( rra ra I 0 )
sift
THEN
This IF condition is tested n-1 times in the loop. You do these costly tests just to distinguish the case of I = 1.
But this is not needed because the case I = 1 is handled as well as the rest of the cases by sift.
If you wanted to get faster acceleration of this one case, it is definitely not worth doing it for the price of n-1 checks.

Well, concerning "costly": test for I = 1 should be two machine
instructions. If it is more costly, then you have bigger problem.
And if you want to have best speed use this test as loop termination
test.

But perhaps the real reason for highlighting this case was some concern about the execution of this edge case.
Maybe when you started writing the sift word, you didn't know how it would end, and the sift evolved - you applied optimizations.

No. I have similar implementations is several languages. Except for
Forth, in other languages I simply return from I = 1 branch and there
is no separate loop termination test. In Forth you can not simply
return: there are stack cleanups and the way to return/exit loop
depends on which kind of loop do you have. So I did not try to
optimize loop termination test.

Let me add that from the start it was clear that solution of
taxi problem using heapsort will be slower than version using
hash tables. Rather, I wanted to show that solution while
slower is still practical. And to see how well Forth
implementations can execute this code. Clearly, efficiency
of Forth implementation have _much_ more influence than the
extra test. In fact, smart enough Forth could see that I = 1
case terminates the loop, skip normal termination test and
arrage jump out of the loop.

Please correct me if I am wrong, but all this IF is not needed and all you do after ELSE is enough.

Yes, ELSE branch is enough.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?Q?Micha=C5=82_Kasprzak?=@21:1/5 to So in VFX Forth we can on Mon Aug 22 12:37:34 2022

I checked the Mouse (Waldek's new nickname) information, how good it looks when compiling its code in C. On my Fedora Linux 36 (which is known to be very convenient for programmers because it has the latest software versions installed out of the box) I
have installed:
gcc (GCC) 12.1.1 20220507 (Red Hat 12.1.1-1)
I got similarly good results as Mouse. Although gcc achieves excellent optimization results, they are still far from perfect.
gcc didn't really show off when we replace jj = ir with break in sift, which should have helped gcc, but destroyed the optimization of the entire code ;(

I have a gift for all of you: Mouse's Ugly Heapsort written in x64 assembly language that works in all native 64-bit Forths.
It is supposed to be the best, because the fastest heapsort ever created. We define the best one in such a way that if, after the presentation of the code and your criticism (which the whole world counts on), no one can speed up or improve this code, the
code is practically the best.
Since this code will be used by future generations, long after we are gone, try to improve it to leave something for the world after our vacation this year. Of course, all this code is intended for free use by anyone without even mentioning the authors.
First, I will describe the programming technique used.
Since the code is supposed to work on every native 64-bit Forth, we cannot use the built-in assembler because this one is different in every Forth.
We will use an external assembler which will generate a binary code that can be inserted into any 64-bit native Forth.
In this case, I used NASM that you can install on Fedora with the command: sudo dnf install nasm
Source files in NASM are very easy to write, just remember to put one directive at the beginning of the file: BITS 64
The default NASM format is binary (plain code, no extra data), so we don't need to specify the format when assembling. We can add the generation of a so-called listing file, which can help us make choices about the use of certain instructions. We will be
dealing with three files written in assembly language: heapsort.asm, prolog.asm, epilog.asm. We assemble these files with the following commands:
nasm -l heapsort.lst heapsort.asm
nasm -l prolog.lst prolog.asm
nasm -l epilog.lst epilog.asm
After assembly we will get 3 short binary files: heapsort, prolog, epilog (without extension).
The code we write must be relocatable, which is very simple and natural for x64 assembler.
The code we write must have a property called "Passthrough", which means it must start at the beginning and always end at the end. Such code can be easily combined with other codes with this property, and also inline in the native Forth.
The question remains how to pass parameters from any Forth to our assembly code.
Each native Forth can implant the parameter stack differently. These can be typical downward stacks, or upward stacking stacks, which has the added benefit of being able to fit two stacks in one block of memory. These stacks may have the top element
cached in the CPU register or not.
We need to find some common ground in all native Forths. Native means that it calls words using the processor's call instruction. So each native Forth uses the CPU stack as the return stack.
Each Forth can move an element from the parameter stack to the return stack using the word >R
So we can move 2 arguments onto the return stack using >R >R and the assembly language code can handle them.
Even though Pony's Ciforth does not live up to the natural assumption of using the CPU stack as a return stack, he is not in a lost position either. In Ciforth, the CPU stack is used as the parameter stack, and the parameters we need are already there,
but in reverse order. So in Ciforth you just need to use SWAP instead of >R >R In the future, when creating the word heapsort in Forth, we could automate the detection of what type of situation we are dealing with and using conditional compilation insert >R >R or SWAP
We will write our assembler heapsort in the file heapsort.asm. It requires two parameters: the address of the array to be sorted in rdi and the number of elements to be sorted in rsi. The choice of registers is free, and this has been made to conform to
the Linux C calling convention.
We need to save the CPU registers used by Forth before running our code. Since we don't generally know which registers in different Forths need to be kept, we keep all registers used by our code.
The code in the prolog.asm file takes care of this. In addition to saving the used registers, it loads the prepared parameters from the processor stack to the rdi and rsi we set.
It's easy to see what epilog.asm does. Of course, it restores the stored CPU registers and removes used parameters from the return stack.
Of course, if you know your Forth you can simplify the prolog.asm and epilog.asm files. You don't have to keep all the registers used then, just the ones your Forth requires. You don't even need to move parameters to the return stack using >R >R, just
transfer them from the parameter stack to the rdi and rsi registers immediately.
I am using VFX Forth to test these solutions.
In VFX Forth it is very easy to inline binary code when defining a word using the DATA-FILE word. If your Forth doesn't have that word, I highly recommend creating it. Ultimately, however, we will not need it. In the final version portable between
different Forths, we will use the comma code insertion method.
So in VFX Forth we can write the word heapsort as follows:

: heapsort

R >R

[
DATA-FILE /home/fedora/prolog DROP
DATA-FILE /home/fedora/heapsort DROP
DATA-FILE /home/fedora/epilog DROP
] ;

Do you like my solution? I am asking for criticism.
Below are 3 files: heapsort.asm, prolog.asm, epilog.asm. See if you can make any improvements to the code in heapsort.asm, even if it is only one CPU cycle or one byte.
As my English is very poor, I would also like to ask you to fix the comments in the heapsort.asm file. These comments, in my opinion, are of great value.
If you try to improve and do not get anything done, share this information - this is also a valuable result!
All language programmers will be able to use this code, let them have something from our Forthers group!

---------- start of file: heapsort.asm, cut below this line ----------
BITS 64
; rdi = array's address
; rsi = array's number of elements = n
heapsort: mov rax, rsi
shr rax, 1
; nothing to sort if there is 0 or 1 element in the array
jz the_end
; now we are sure that always n>=2
; r10 = index of the last parent node
; r10 is also the counter of the heap building loop
lea r10, [rax-1]
; rbx = index of the left child of the last parent node
; rbx is also index of the left child of r10 node in the loop
lea rbx, [rax+rax-1]
; heap building loop
; sift the counter node's value stored in r9 to the proper position
; r9 = array[r10]
build_loop: mov r9, qword [rdi+8*r10]
; r11 is index of target place for sifted values
; we start sifting from the counter node's index
mov r11, r10
; rax is the index of a current child of r11 node
; we start r11 with the index of the left child
mov rax, rbx
jmp sift1_while
; we will find the child with bigger value and store its index in rax
; we will load this bigger value to rdx
; rcx holds the index of the right child
sift1: lea rcx, [rax+1]
mov rdx, qword [rdi+8*rax]
; note that the right child may not exist
; then we consider the left child to be bigger
cmp rcx, rsi
jnb sift1_check
; we load the right child node's value to r8
mov r8, qword [rdi+8*rcx]
; if the right child node's value is bigger than the left child's
cmp r8, rdx
jle sift1_check
; then we will proceed the right child node
mov rdx, r8
mov rax, rcx
; compare the values of the current child node with the sifting node sift1_check: cmp r9, rdx
; if the sifting node's value is bigger then we end sifting
jnl build_next
; we store the child's value into the parent node's place
mov qword [rdi+8*r11], rdx
; we make the child's node the new parent's node
mov r11, rax
lea rax, [rax+rax+1]
; do sift if there is still at least one child
sift1_while: cmp rax, rsi
jb sift1
; we store sifted value into the target place and repeat the loop
build_next: sub r10, 1
mov qword [rdi+8*r11], r9
sub rbx, 2
cmp r10, -1
jne build_loop
; the sorting loop starts with n-1
; rax will hold the counter of the sorting loop
lea rax, [rsi-1]
; the sorting loop
; rdx = array[0]
; r9 = array[counter]
sort_loop: mov rdx, qword [rdi]
mov r9, qword [rdi+8*rax]
; we will sift the node r10=0 which value is in r9
xor r10d, r10d
; we store array[counter]=array[0]
mov qword [rdi+8*rax], rdx
; now we will store in rdx the left child node's index
mov edx, 1
jmp sift2_while
; sift2 works like sift1 with changed registers
; we will find the child with bigger value and store its index in rdx
; we will load this bigger value to rsi
; rcx holds the index of the right child
sift2: lea rcx, [rdx+1]
mov rsi, qword [rdi+8*rdx]
; note that the right child may not exist
; then we consider the left child to be bigger
cmp rcx, rax
jnb sift2_check
; we load the right child node's value to r8
mov r8, qword [rdi+8*rcx]
; if the right child node's value is bigger than the left child's
cmp r8, rsi
jle sift2_check
; then we will proceed the right child node
mov rsi, r8
mov rdx, rcx
; compare the values of the current child node with the sifting node sift2_check: cmp r9, rsi
; if the sifting node's value is bigger then we end sifting
jnl sort_next
; we store the child's value into the parent node's place
mov qword [rdi+8*r10], rsi
; we make the child's node the new parent's node
mov r10, rdx
lea rdx, [rdx+rdx+1]
; do sift if there is still at least one child
sift2_while: cmp rdx, rax
jb sift2
; we store sifted value into the target place
sort_next: mov qword [rdi+8*r10], r9
; next step of the sorting loop
sub rax, 1
; the sorting loop ends with 1
jnz sort_loop
the_end:
---------- end of file: heapsort.asm, cut above this line ----------

---------- start of file: prolog.asm, cut below this line ----------
BITS 64
; We assume that the following were pushed onto the CPU stack:
; [rsp+8] = array's number of elements = n
; [rsp] = array's address
; We keep all used registers for the best compatibility
push rax
push rbx
push rcx
push rdx
push rsi
push rdi
push r8
push r9
push r10
push r11
; Now we can load the heapsort arguments
mov rdi, [rsp+80]
mov rsi, [rsp+88]
---------- end of file: prolog.asm, cut above this line ----------

---------- start of file: epilog.asm, cut below this line ----------
BITS 64
; We restore the contents of all used registers
pop r11
pop r10
pop r9
pop r8
pop rdi
pop rsi
pop rdx
pop rcx
pop rbx
pop rax
; We remove the provided arguments from the stack
add rsp, 16
---------- end of file: epilog.asm, cut above this line ----------

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?Q?Micha=C5=82_Kasprzak?=@21:1/5 to All on Tue Aug 23 12:15:52 2022

Look at the following code snippet in the heapsort.asm file starting with the build_next label:
build_next: sub r10, 1
mov qword [rdi+8*r11], r9
sub rbx, 2
cmp r10, -1
jne build_loop
I was in a hurry to deliver my gift to you as soon as possible and I missed correcting it ;)
Of course, it is worth replacing the above piece of code with the following code:
build_next: mov qword [rdi+8*r11], r9
sub rbx, 2
sub r10, 1
jnc build_loop
We got 4 bytes shorter code, 1 instruction less in code and maybe faster :) Below I am inserting a revised version of heapsort.asm file.

By the way, although "dec reg64" is 1 byte shorter than "sub reg64, 1", I don't use "dec reg64" on purpose. This is because of the known partial-flag stall problem that you can read about on the internet.

Using an unused rbp register instead of the extended r8-r11 registers does not save any code length in the heapsort.asm file. Although push / pop rbp is 1 byte shorter than push / pop r8, which would benefit prolog.asm and epilog.asm, if we go to a
specific Forth version, it will probably turn out that every Forth needs to keep rbp and none of them has to keep r8, so it is not worth trying to force yourself to use rbp ;)

Be brave, criticize :)

---------- start of file: heapsort.asm, cut below this line ----------
BITS 64
; rdi = array's address
; rsi = array's number of elements = n
heapsort: mov rax, rsi
shr rax, 1
; nothing to sort if there is 0 or 1 element in the array
jz the_end
; now we are sure that always n>=2
; r10 = index of the last parent node
; r10 is also the counter of the heap building loop
lea r10, [rax-1]
; rbx = index of the left child of the last parent node
; rbx is also index of the left child of r10 node in the loop
lea rbx, [rax+rax-1]
; heap building loop
; sift the counter node's value stored in r9 to the proper position
; r9 = array[r10]
build_loop: mov r9, qword [rdi+8*r10]
; r11 is index of target place for sifted values
; we start sifting from the counter node's index
mov r11, r10
; rax is the index of a current child of r11 node
; we start r11 with the index of the left child
mov rax, rbx
jmp sift1_while
; we will find the child with bigger value and store its index in rax
; we will load this bigger value to rdx
; rcx holds the index of the right child
sift1: lea rcx, [rax+1]
mov rdx, qword [rdi+8*rax]
; note that the right child may not exist
; then we consider the left child to be bigger
cmp rcx, rsi
jnb sift1_check
; we load the right child node's value to r8
mov r8, qword [rdi+8*rcx]
; if the right child node's value is bigger than the left child's
cmp r8, rdx
jle sift1_check
; then we will proceed the right child node
mov rdx, r8
mov rax, rcx
; compare the values of the current child node with the sifting node sift1_check: cmp r9, rdx
; if the sifting node's value is bigger then we end sifting
jnl build_next
; we store the child's value into the parent node's place
mov qword [rdi+8*r11], rdx
; we make the child's node the new parent's node
mov r11, rax
lea rax, [rax+rax+1]
; do sift if there is still at least one child
sift1_while: cmp rax, rsi
jb sift1
; we store sifted value into the target place and repeat the loop
build_next: mov qword [rdi+8*r11], r9
sub rbx, 2
sub r10, 1
jnc build_loop
; the sorting loop starts with n-1
; rax will hold the counter of the sorting loop
lea rax, [rsi-1]
; the sorting loop
; rdx = array[0]
; r9 = array[counter]
sort_loop: mov rdx, qword [rdi]
mov r9, qword [rdi+8*rax]
; we will sift the node r10=0 which value is in r9
xor r10d, r10d
; we store array[counter]=array[0]
mov qword [rdi+8*rax], rdx
; now we will store in rdx the left child node's index
mov edx, 1
jmp sift2_while
; sift2 works like sift1 with changed registers
; we will find the child with bigger value and store its index in rdx
; we will load this bigger value to rsi
; rcx holds the index of the right child
sift2: lea rcx, [rdx+1]
mov rsi, qword [rdi+8*rdx]
; note that the right child may not exist
; then we consider the left child to be bigger
cmp rcx, rax
jnb sift2_check
; we load the right child node's value to r8
mov r8, qword [rdi+8*rcx]
; if the right child node's value is bigger than the left child's
cmp r8, rsi
jle sift2_check
; then we will proceed the right child node
mov rsi, r8
mov rdx, rcx
; compare the values of the current child node with the sifting node sift2_check: cmp r9, rsi
; if the sifting node's value is bigger then we end sifting
jnl sort_next
; we store the child's value into the parent node's place
mov qword [rdi+8*r10], rsi
; we make the child's node the new parent's node
mov r10, rdx
lea rdx, [rdx+rdx+1]
; do sift if there is still at least one child
sift2_while: cmp rdx, rax
jb sift2
; we store sifted value into the target place
sort_next: mov qword [rdi+8*r10], r9
; next step of the sorting loop
sub rax, 1
; the sorting loop ends with 1
jnz sort_loop
the_end:
---------- end of file: heapsort.asm, cut above this line ----------

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcel Hendrix@21:1/5 to none albert on Tue Aug 23 13:04:17 2022

On Tuesday, August 23, 2022 at 9:32:56 PM UTC+2, none albert wrote:
[..]

This is quite a valuable technique. It is an alternative to calling c-routines
from Forth.

Hardly. I am not going to write, e.g., the Faddeev-Leverrier algorithm
in assembler.

-marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From none) (albert@21:1/5 to siarczek83@gmail.com on Tue Aug 23 21:32:53 2022

In article <c3e7bae3-a98a-439e-a6ec-e7e7e3175f8cn@googlegroups.com>,
Micha� Kasprzak <siarczek83@gmail.com> wrote:

I checked the Mouse (Waldek's new nickname) information, how good it looks when compiling its code in C. On my Fedora
Linux 36 (which is known to be very convenient for programmers because it has the latest software versions installed out
of the box) I have installed:
gcc (GCC) 12.1.1 20220507 (Red Hat 12.1.1-1)
I got similarly good results as Mouse. Although gcc achieves excellent optimization results, they are still far from perfect.
gcc didn't really show off when we replace jj = ir with break in sift, which should have helped gcc, but destroyed the
optimization of the entire code ;(

I have a gift for all of you: Mouse's Ugly Heapsort written in x64 assembly language that works in all native 64-bit Forths.
It is supposed to be the best, because the fastest heapsort ever created. We define the best one in such a way that if,
after the presentation of the code and your criticism (which the whole world counts on), no one can speed up or improve
this code, the code is practically the best.
Since this code will be used by future generations, long after we are gone, try to improve it to leave something for the
world after our vacation this year. Of course, all this code is intended for free use by anyone without even mentioning
the authors.
First, I will describe the programming technique used.
Since the code is supposed to work on every native 64-bit Forth, we cannot use the built-in assembler because this one is
different in every Forth.
We will use an external assembler which will generate a binary code that can be inserted into any 64-bit native Forth.
In this case, I used NASM that you can install on Fedora with the command: sudo dnf install nasm
Source files in NASM are very easy to write, just remember to put one directive at the beginning of the file: BITS 64
The default NASM format is binary (plain code, no extra data), so we don't need to specify the format when assembling. We
can add the generation of a so-called listing file, which can help us make choices about the use of certain instructions.
We will be dealing with three files written in assembly language: heapsort.asm, prolog.asm, epilog.asm. We assemble these
files with the following commands:
nasm -l heapsort.lst heapsort.asm
nasm -l prolog.lst prolog.asm
nasm -l epilog.lst epilog.asm

This is quite a valuable technique. It is an alternative to calling c-routines from Forth.
--
"in our communism country Viet Nam, people are forced to be
alive and in the western country like US, people are free to
die from Covid 19 lol" duc ha
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From none) (albert@21:1/5 to mhx@iae.nl on Tue Aug 23 23:47:47 2022

In article <05ffffa4-cb98-42d0-8228-5b6d0a743b44n@googlegroups.com>,
Marcel Hendrix <mhx@iae.nl> wrote:

On Tuesday, August 23, 2022 at 9:32:56 PM UTC+2, none albert wrote:
[..]

This is quite a valuable technique. It is an alternative to calling c-routines
from Forth.

Hardly. I am not going to write, e.g., the Faddeev-Leverrier algorithm
in assembler.

You don't have to. Assembler trumps a high level language.
Write that algorithm in c, then use the assembler output to
add it to Forth. No OS-specific techniques required.

-marcel

Groetjes Albert
--
"in our communism country Viet Nam, people are forced to be
alive and in the western country like US, people are free to
die from Covid 19 lol" duc ha
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?Q?Micha=C5=82_Kasprzak?=@21:1/5 to All on Thu Aug 25 10:17:26 2022

Look at the xor r10d, r10d instruction in the heapsort.asm file.
This known technique of limiting code length by working on the lower 32 bits of 64-bit registers, which works great for rax-rdx registers, unfortunately does not work for extended registers r8-r15. Even the 32-bit xor r10d, r10d requires a prefix byte
and the entire instruction is also 3 bytes long.
It's a pity, but we can swap the roles of the r10 and rax registers in the sorting loop part of the code to be able to write xor eax, eax (which is 2 bytes long) instead of xor r10d, r10d and get 1 byte shorter code :)
And that's exactly what I did in the patched version of heapsort.asm below.
Now the revised assembled code is 181 bytes long. It is really short! When you honestly compare this length with the numbers you saw earlier in this thread remember that this code has a sift code inlined in two places for quicker operation.

Who next of you will shorten this code more or find some improvement to this code? :)

---------- start of file: heapsort.asm, cut below this line ----------
BITS 64
; rdi = array's address
; rsi = array's number of elements = n
heapsort: mov rax, rsi
shr rax, 1
; nothing to sort if there is 0 or 1 element in the array
jz the_end
; now we are sure that always n>=2
; r10 = index of the last parent node
; r10 is also the counter of the heap building loop
lea r10, [rax-1]
; rbx = index of the left child of the last parent node
; rbx is also index of the left child of r10 node in the loop
lea rbx, [rax+rax-1]
; heap building loop
; sift the counter node's value stored in r9 to the proper position
; r9 = array[r10]
build_loop: mov r9, qword [rdi+8*r10]
; r11 is index of target place for sifted values
; we start sifting from the counter node's index
mov r11, r10
; rax is the index of a current child of r11 node
; we start r11 with the index of the left child
mov rax, rbx
jmp sift1_while
; we will find the child with bigger value and store its index in rax
; we will load this bigger value to rdx
; rcx holds the index of the right child
sift1: lea rcx, [rax+1]
mov rdx, qword [rdi+8*rax]
; note that the right child may not exist
; then we consider the left child to be bigger
cmp rcx, rsi
jnb sift1_check
; we load the right child node's value to r8
mov r8, qword [rdi+8*rcx]
; if the right child node's value is bigger than the left child's
cmp r8, rdx
jle sift1_check
; then we will proceed the right child node
mov rdx, r8
mov rax, rcx
; compare the values of the current child node with the sifting node sift1_check: cmp r9, rdx
; if the sifting node's value is bigger then we end sifting
jnl build_next
; we store the child's value into the parent node's place
mov qword [rdi+8*r11], rdx
; we make the child's node the new parent's node
mov r11, rax
lea rax, [rax+rax+1]
; do sift if there is still at least one child
sift1_while: cmp rax, rsi
jb sift1
; we store sifted value into the target place and repeat the loop
build_next: mov qword [rdi+8*r11], r9
sub rbx, 2
sub r10, 1
jnc build_loop
; the sorting loop starts with n-1
; r10 will hold the counter of the sorting loop
lea r10, [rsi-1]
; the sorting loop
; rdx = array[0]
; r9 = array[counter]
sort_loop: mov rdx, qword [rdi]
mov r9, qword [rdi+8*r10]
; we will sift the node rax=0 which value is in r9
xor eax, eax
; we store array[counter]=array[0]
mov qword [rdi+8*r10], rdx
; now we will store in rdx the left child node's index
mov edx, 1
jmp sift2_while
; sift2 works like sift1 with changed registers
; we will find the child with bigger value and store its index in rdx
; we will load this bigger value to rsi
; rcx holds the index of the right child
sift2: lea rcx, [rdx+1]
mov rsi, qword [rdi+8*rdx]
; note that the right child may not exist
; then we consider the left child to be bigger
cmp rcx, r10
jnb sift2_check
; we load the right child node's value to r8
mov r8, qword [rdi+8*rcx]
; if the right child node's value is bigger than the left child's
cmp r8, rsi
jle sift2_check
; then we will proceed the right child node
mov rsi, r8
mov rdx, rcx
; compare the values of the current child node with the sifting node sift2_check: cmp r9, rsi
; if the sifting node's value is bigger then we end sifting
jnl sort_next
; we store the child's value into the parent node's place
mov qword [rdi+8*rax], rsi
; we make the child's node the new parent's node
mov rax, rdx
lea rdx, [rdx+rdx+1]
; do sift if there is still at least one child
sift2_while: cmp rdx, r10
jb sift2
; we store sifted value into the target place
sort_next: mov qword [rdi+8*rax], r9
; next step of the sorting loop
sub r10, 1
; the sorting loop ends with 1
jnz sort_loop
the_end:
---------- end of file: heapsort.asm, cut above this line ----------

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	300
Nodes:	16 (2 / 14)
Uptime:	42:00:23
Calls:	6,708
Calls today:	1
Files:	12,243
Messages:	5,353,929

On the ugly Heapsort implementation which becomes beautiful in VFX Fort

Who's Online

Recent Visitors

System Info