When a processor does multiplication, it doesn’t quite do multiplication in the way you can do it in your head. It uses a process you probably wouldn’t think about. In fact, with the ARM processors, every multiply operation takes at least ~4-6 cycles.
MUL operation #
The MUL operation is the “basic” multiplication operation. It takes the format
mul Rd, Rn, Rm where Rd is the destination and Rn and Rm are input registers.
This opcode, as well as all of the other multiply opcodes, only take registers as inputs: there is no possibility to use the flexible operand2.
MLA operation #
MLA stands for multiply and accumulate. As it sounds, this operation multiples two
registers and then adds the value to another register. The format is
mla Rd, Rm, Rs, Rn
where Rd = (Rm * Rs) + Rn.
You may have (or many not have) realized something. If you take two 32-bit numbers, multiply them together, you will possibly get a number that takes up more than 32-bits. The above operations (MUL, MLA) will only return the 32 least significant bits. The 32 most significant bits are discarded. For that reason, there are the following operations.
MULL and MLAL operations #
These are the “multiply long” and “multiply long and accumulate” operations. However, to complicate things a little more, they never exist in this form. They are always prefixed with a ‘U’ and a ‘S’ to indicate unsigned or signed multiplication.
umull r1, r2, r3, r4 @ r3 * r4, store high bits in r2, low in r1 smull r1, r2, r3, r4 @ same as above, but for signed numbers umlal r1, r2, r3, r4 @ (r3 * r4) + (r2,r1) @ high bits in r2, low bits in r1 smlal r1, r2, r3, r4 @ same as above, but for signed numbers
SMULxy and SMLAxy operations #
Only because your head isn’t spinning enough, you can add letters to the end of the signed multiplication opcode to multiply two 16-bit integers and store them in a 32-bit register. The x and y can either be a t or a b to indicate the top or bottom half of the registers.
smultb r1, r2, r3 @ multiply the top 16 bits of r2 and the @ bottom 16 bits of r3 and store in r1 smulbb r1, r2, r3 @ multiply the bottom 16 bits of r2 and the @ bottom 16 bits of r3 and store in r1