Login

britlion · 2011-03-23, 01:26 AM

(where /should/ I be putting this sort of thing?)

Plug this in for mul16.asm

Code:
__MUL16:	; Mutiplies HL with the last value stored into de stack
			; Works for both signed and unsigned

		PROC

		LOCAL __MUL16LOOP1
                LOCAL __MUL16NOADD1
		LOCAL __MUL16LOOP2
                LOCAL __MUL16NOADD2

		
		ex de, hl
		pop hl		; Return address
		ex (sp), hl ; CALLEE caller convention

;;__MUL16_FAST:	; __FASTCALL ENTRY: HL = 1st operand, DE = 2nd Operand
;;		ld c, h
;;		ld a, l	 ; C,A => 1st Operand
;;
;;		ld hl, 0 ; Accumulator
;;		ld b, 16
;;
;;__MUL16LOOP:
;;		sra c	; C,A >> 1  (Arithmetic)
;;		rra
;;
;;		jr nc, __MUL16NOADD
;;		add hl, de
;;
;;__MUL16NOADD:
;;		sla e
;;		rl d
;;			
;;		djnz __MUL16LOOP

__MUL16_FAST:
        ld b, 8
        ld a, d
        ld c, e
        ex de, hl
        ld hl, 0

__MUL16LOOP1:
        add hl, hl  ; hl << 1
        ;sla c
        rla         ; a,c << 1
        jr nc, __MUL16NOADD1
        add hl, de

__MUL16NOADD1:
        djnz __MUL16LOOP1

        ld a,c
        ld b,8

__MUL16LOOP2:
        add hl, hl  ; hl << 1
        rla         ; a,c << 1
        jr nc, __MUL16NOADD2
        add hl, de

__MUL16NOADD2:
        djnz __MUL16LOOP2



		ret	; Result in hl (16 lower bits)

		ENDP

I think it saves on average about 110 T states per multiply, according to my tests. If I counted correctly, it's 10 bytes longer.

Why it's faster:

SLA C is a long slow opcode, compared to just doing the RLA. It's faster to loop twice and roll the A register round the two halves than it is to roll the 16 bit pair.

Also in this case, JR is a better choice than the original JP instruction. Not only is it a byte shorter, but it's faster on average. Probably.

16 JP NC instructions = 160 T states.
JR is 7 if condition fails, 12 if it passes. We can assume that for bits, half will be 1 and half will be 0. So that's an average of (8*12)+(8*7)=156 T states. It's worth saving the byte; which compensates for a double loop being a few extra bytes.

Could also probably shave a little time by using dec b && jp nc _mul16loop since that will jump most times. Probably not worth the bytes. Having two short loops actually speeds up the DJNZ a little too Smile

britlion · 2011-03-23, 07:01 PM

I'm wondering if similar optimizations could be made with other 16 bit operations?

Login
Username:
Password:	Lost Password?
	Remember me