Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Try this for faster multiply?
#1
(where /should/ I be putting this sort of thing?)

Plug this in for mul16.asm

Code:
__MUL16: ; Mutiplies HL with the last value stored into de stack ; Works for both signed and unsigned PROC LOCAL __MUL16LOOP1 LOCAL __MUL16NOADD1 LOCAL __MUL16LOOP2 LOCAL __MUL16NOADD2 ex de, hl pop hl ; Return address ex (sp), hl ; CALLEE caller convention ;;__MUL16_FAST: ; __FASTCALL ENTRY: HL = 1st operand, DE = 2nd Operand ;; ld c, h ;; ld a, l ; C,A => 1st Operand ;; ;; ld hl, 0 ; Accumulator ;; ld b, 16 ;; ;;__MUL16LOOP: ;; sra c ; C,A >> 1 (Arithmetic) ;; rra ;; ;; jr nc, __MUL16NOADD ;; add hl, de ;; ;;__MUL16NOADD: ;; sla e ;; rl d ;; ;; djnz __MUL16LOOP __MUL16_FAST: ld b, 8 ld a, d ld c, e ex de, hl ld hl, 0 __MUL16LOOP1: add hl, hl ; hl << 1 ;sla c rla ; a,c << 1 jr nc, __MUL16NOADD1 add hl, de __MUL16NOADD1: djnz __MUL16LOOP1 ld a,c ld b,8 __MUL16LOOP2: add hl, hl ; hl << 1 rla ; a,c << 1 jr nc, __MUL16NOADD2 add hl, de __MUL16NOADD2: djnz __MUL16LOOP2 ret ; Result in hl (16 lower bits) ENDP

I think it saves on average about 110 T states per multiply, according to my tests. If I counted correctly, it's 10 bytes longer.

Why it's faster:

SLA C is a long slow opcode, compared to just doing the RLA. It's faster to loop twice and roll the A register round the two halves than it is to roll the 16 bit pair.

Also in this case, JR is a better choice than the original JP instruction. Not only is it a byte shorter, but it's faster on average. Probably.

16 JP NC instructions = 160 T states.
JR is 7 if condition fails, 12 if it passes. We can assume that for bits, half will be 1 and half will be 0. So that's an average of (8*12)+(8*7)=156 T states. It's worth saving the byte; which compensates for a double loop being a few extra bytes.

Could also probably shave a little time by using dec b && jp nc _mul16loop since that will jump most times. Probably not worth the bytes. Having two short loops actually speeds up the DJNZ a little too Smile
Reply
#2
I'm wondering if similar optimizations could be made with other 16 bit operations?
Reply


Forum Jump:


Users browsing this thread: 1 Guest(s)