• Testing whether address register odd

    From Mux@21:1/5 to Bruce Mardle on Sun Oct 25 22:48:34 2015
    On Monday, October 20, 2014 at 1:55:36 AM UTC-7, Bruce Mardle wrote:
    Hi, all.
    What's the quickest way of testing whether the contents of an address register is odd (on a 68010)?
    Is there anything faster than:
    move a0, d0
    btst #0, d0

    Hi!

    An simple 'and' would do the trick as well (i.e and #1,d0). Don't know if bit-testing is faster than and'ing. Alterantively you can shift the value right and check the carry flag.

    -Y

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tom Evans@21:1/5 to Bruce Mardle on Mon Oct 26 15:37:05 2015
    On Monday, October 20, 2014 at 7:55:36 PM UTC+11, Bruce Mardle wrote:
    Hi, all.
    What's the quickest way of testing whether the contents of an address register is odd (on a 68010)?
    Is there anything faster than:
    move a0, d0
    btst #0, d0

    Hello from over a year ago (and to Mux, yesterday). I hope you weren't waiting for these answers, and looked in the Reference Manual instead.

    Which shows "btst" to be the WORST choice. "btst #bit" takes 10 clocks. "asr" takes 6+2n, which with "n" being "1" takes 8 clocks. Which is the same time as "andi" as it has to fetch two words, taking 8 clocks.

    If you can keep a data register spare to hold a "1" then you can use "and" in 4 clocks or "btst" in 6. The above assumes zero wait-state memory, which changes the numbers.

    Tom

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tom Evans@21:1/5 to All on Tue Oct 27 06:04:35 2015
    There's also the tricky code that some versions of gcc generate in place of a bit test, which is detailed (together with some of its problems) here:

    https://community.freescale.com/message/501384#501384

    To summarise the above, it generates code like:

    } else if (cc->status & cd_BUS_RWARN) {
    4010903a: 0800 000c btst #12,%d0
    4010903e: 6604 bnes 40109044 <comm_check_status+0x62>
    } else if (cc->status & cd_OVERRUN) {
    40109040: 44c0 movew %d0,%ccr
    40109042: 6a02 bpls 40109046 <comm_check_status+0x64>

    Note the weird word-saving trick in the last compare? It copies the data to the CCR and then tests the "N" bit.

    You can do that to test the lowermost four bits in a register, corresponding to the CCR C, V, N and Z bits. It saves one 16-bit fetch. That should save four clocks, except that this instruction takes TWELVE clocks to execute on the 68000 and 68010.

    On the Coldfire (the subject of the referenced post) there's good and bad news. The good news is that it only takes one clock. The bad news is that on the MCF53 series it corrupts the branch predict bit in the status register, and there's no known way to
    stop gcc from generating that code.

    Tom


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mux@21:1/5 to All on Tue Oct 27 13:08:34 2015
    Complete cop-out but if you're looking for the msb AND it's in a register you can do a simple 'add' and check the carry flag. Ahh, assembly language... gotta love it..

    On a completely different note, I'm reading 'I am error' about the NES hardware which is pretty awesome in that the author gets into a lot of detail and goes as far as digging into the (disassembled) code for SMB. Due to the little amount of memory the
    NES had they actually generated most of the levels in SMB with some really nifty bit packing. Good read in case anyone's interested..

    -Mux

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bruce Mardle@21:1/5 to Tom Evans on Wed Oct 28 06:09:57 2015
    Thanks, Mux and Tom. I think I'll go with...

    On Monday, 26 October 2015 22:37:06 UTC, Tom Evans wrote:
    If you can keep a data register spare to hold a "1" then you can use "and" in 4 clocks or "btst" in 6. The above assumes zero wait-state memory, which changes the numbers.

    ... especially when I need to do it twice. (As in a 'memcpy'. I have to treat even source/even destination differently from odd src/odd dst and from 1 odd/1 even.)

    <ramble> I must get back into 68k programming. Haven't done any for months. Been writing a Z280 assembler. 68k is much saner! </ramble>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Charles Richmond@21:1/5 to All on Wed Oct 28 17:56:56 2015
    "Bruce Mardle" <marblypup@yahoo.co.uk> wrote in message news:dbd4c1f7-1b92-403c-8431-998f14047add@googlegroups.com...
    Thanks, Mux and Tom. I think I'll go with...

    On Monday, 26 October 2015 22:37:06 UTC, Tom Evans wrote:
    If you can keep a data register spare to hold a "1" then you can use
    "and" in 4 clocks or "btst" in 6. The above assumes zero wait-state
    memory, which changes the numbers.

    ... especially when I need to do it twice. (As in a 'memcpy'. I have to
    treat even source/even destination differently from odd src/odd dst and
    from 1 odd/1 even.)

    <ramble> I must get back into 68k programming. Haven't done any for
    months. Been writing a Z280 assembler. 68k is much saner! </ramble>


    68K assembly language is a *sweetheart*!!! The only assembly language I
    might like better is the 6809 assembly.

    --

    numerist at aquaporin4 dot com

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bruce Mardle@21:1/5 to Tom Evans on Thu Oct 29 21:32:05 2015
    On Wednesday, 28 October 2015 23:15:35 UTC, Tom Evans wrote:
    Which is 14 clocks for looping "MOVE.W (A0)+, (A1)+" and 22 clocks for "MOVE.L (A0)+, (A1)+"

    I can vouch for that. It was one of the first things I tried on my 68010 :-)

    But if you ignore loop mode and simply unroll the copy loop by eight, then it
    takes (8 * 20 + 10) = 170 clocks while the loop-mode takes 176.

    Good point! But then I'd need more-complicated code to deal with 'stragglers': the bytes left to copy after copying 32-byte chunks. Lots of design decisions to be made! At least the 68010 is happy to read/write longwords at addresses ending in 10b; that
    simplifies things a little.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tom Evans@21:1/5 to Bruce Mardle on Wed Oct 28 16:15:34 2015
    On Thursday, October 29, 2015 at 12:09:58 AM UTC+11, Bruce Mardle wrote:
    Thanks, Mux and Tom. I think I'll go with...

    On Monday, 26 October 2015 22:37:06 UTC, Tom Evans wrote:
    ... especially when I need to do it twice. (As in a 'memcpy'.
    I have to treat even source/even destination differently from
    odd src/odd dst and from 1 odd/1 even.)

    On a 68010. Where you've got "loop mode". The following code from a project I worked on in 1991 (should, see later) take advantage of this:

    |
    | copy bytes, using movb,movw, or movl as appropriate.
    | NB: a len of <= 0 is treated as = 0, ie: do nothing.
    |
    .globl _bcopy
    _bcopy: movl sp@(4),d0
    movl d0,a0
    movl d0,d1
    movl sp@(8),d0
    movl d0,a1
    orl d0,d1
    movl sp@(12),d0
    bles 6$
    orl d0,d1
    btst #0,d1
    beqs 2$
    subql #1,d0
    1$: movb a0@+,a1@+
    dbra d0,1$
    rts

    2$: btst #1,d0
    beqs 4$
    asrl #1,d0
    subql #1,d0
    3$: movw a0@+,a1@+
    dbra d0,3$
    rts

    4$: asrl #2,d0
    subql #1,d0
    5$: movl a0@+,a1@+
    dbra d0,5$
    6$: rts

    Note: I say "should" because the Motorola 68000 User Manual is confusing and most likely dead wrong.

    "APPENDIX A MC68010 LOOP MODE OPERATION" gives as an example of Loop Mode:

    LOOP LEA SOURCE, A0 Load A Pointer To Source Data
    LEA DEST, A1 Load A Pointer To Destination
    MOVE.W #LENGTH, D0 Load The Counter Register
    MOVE.W (A0);pl, (A1)+ Loop To Move The Block Of Data
    DBEQ D0, LOOP Stop If Data Word Is Zero

    Figure A-1. DBcc Loop Mode Program Example

    I'm pretty sure ";pl" is meant to be a "+" in the above. So it is the classic block-move operation with the magic 68k auto-increment on the address registers.

    Fine, except the next table in the book, "Table A-1. MC68010 Loop Mode Instructions" lists all the acceptable addressing mode combinations, and "(Ay)+ to (Ax)+" is NOT THERE. The table says the most used addressing mode isn't supported.

    That has to be wrong because "Table 9-2. Move Byte and Word Instruction Execution Times" documents the timing for this most useful case.

    Which is 14 clocks for looping "MOVE.W (A0)+, (A1)+" and 22 clocks for "MOVE.L (A0)+, (A1)+"

    But if you ignore loop mode and simply unroll the copy loop by eight, then it takes (8 * 20 + 10) = 170 clocks while the loop-mode takes 176. Word mode is 106 for unrolled and 112 for loop mode. Loop mode is better if your memory has wait states though.

    The big win is changing simple and dumb "move bytes" code to moving words and longs when it can, as you're doing.

    But the fastest way to copy memory is to design your system so you don't have to copy at all, but just copy/read it ONCE and then pass pointers around.

    When you get into the RISC CPUs it gets really complicated. The fastest way to copy (external DDR) memory on even a middle of the range CodlFire chip is to copy 64 words (32 bits) from the external DDR to the internal SRAM, and then copy from there back
    to DDR. That keeps the caches happy and the memory controller "on page". And since it is RISC, all copies have to go through registers! So DDR# --> Register --> SRAM --> Register --> DDR3.

    Tom

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tom Evans@21:1/5 to Bruce Mardle on Fri Oct 30 20:18:12 2015
    On Monday, October 20, 2014 at 7:55:36 PM UTC+11, Bruce Mardle wrote:
    Hi, all.
    What's the quickest way of testing whether the contents of an
    address register is odd (on a 68010)?

    The absolutely fastest way is to ignore the problem completely and just go ahead with the memory copy. Then handle the unaligned cases in the exception routine.

    That only works well if the overwhelming majority of them are aligned.

    Otherwise, why worry about saving a couple of clock cycles in a function that spends 95+% of its time in a 20-clock loop moving memory around?

    Tom

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tom Evans@21:1/5 to All on Sun Nov 1 02:34:02 2015
    But if you ignore loop mode and simply unroll the
    copy loop by eight, then it takes (8 * 20 + 10) = 170 clocks
    while the loop-mode takes 176.

    Loop takes (22(2/2) * N). Unrolling by "M" takes (N * 20(3/2) + (10(2/0) * N / M)). The point is that it saves ONLY 6 clocks or 3.5% or a maximum of 10% for an "infinite unroll", so why bother? It makes a heap of difference (to unroll the loop) on the
    68000 or CPU32, but not the 68010.

    Good point! But then I'd need more-complicated code to
    deal with 'stragglers':

    Which would take a lot of clocks.

    Have you ever heard of "Duff's Device"? It is a magic fix to the "straggler" problem. It is horribly ugly C code, even worse than "ternary abuse", but is completely legal.

    https://en.wikipedia.org/wiki/Duff's_device#Original_version

    Tom

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bruce Mardle@21:1/5 to Tom Evans on Sun Nov 1 04:16:44 2015
    On Sunday, 1 November 2015 10:34:03 UTC, Tom Evans wrote:
    Have you ever heard of "Duff's Device"? It is a magic fix to the "straggler" problem. It is horribly ugly C code, even worse than "ternary abuse", but is completely legal.

    https://en.wikipedia.org/wiki/Duff's_device#Original_version

    I hadn't, but the idea of doing a calculated jump into the loop for the first 0-7 (or 1-8) copies occurred to me about a day after my previous post. In my defence, I'm a big fan of structured programming so my brain revolts at such ideas :-) (When
    writing C, I try to avoid `continue` and `break`ing from loops, never mind `goto`.)

    Anyway, thanks, everyone, for all the suggestions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mux@21:1/5 to Bruce Mardle on Fri Nov 6 14:23:08 2015
    On Sunday, November 1, 2015 at 4:16:45 AM UTC-8, Bruce Mardle wrote:
    On Sunday, 1 November 2015 10:34:03 UTC, Tom Evans wrote:
    Have you ever heard of "Duff's Device"? It is a magic fix to the "straggler" problem. It is horribly ugly C code, even worse than "ternary abuse", but is completely legal.

    https://en.wikipedia.org/wiki/Duff's_device#Original_version

    I hadn't, but the idea of doing a calculated jump into the loop for the first 0-7 (or 1-8) copies occurred to me about a day after my previous post. In my defence, I'm a big fan of structured programming so my brain revolts at such ideas :-) (When
    writing C, I try to avoid `continue` and `break`ing from loops, never mind `goto`.)

    Anyway, thanks, everyone, for all the suggestions.

    That looks... weird. I've always used 'reverse' jumptables for doing polygon fillers and what not. Basically got you the maximum amount of performance especially because you knew that you'd never be filling more than the screen width.

    -Mux

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)