Speed optimization on Arduino chips

When working with an Atmega328 we’re dealing with a really slow and weak processor. But it isn’t! In fact, it’s faster and more powerful than the chip driving the once-popular Commodore 64 computer! Today I’m going to give some advice on how to make your stuff run faster.


After an inexcusable delay, I’m getting back on track. As with the previous blog post, I reserve the right to improve this one in the future. This is kinda draft article that you, the reader of my blog, has the luck to read before it’s ready for production. Well, consider it this way, I sure hope to convert all this rambling into something useful (aka profitable) for me in the future.

In any case, here are my thoughts on making your programs run faster on an 8-bit microcontroller, namely Atmega328.


1 – forget floating point

This is a very common advice. When dealing with microcontrollers, especially with 8-bit Atmega328, it is a very good idea to forget about the existence of the floating-point math altogether. There’s no hardware here to deal with floating points, so the math is very, very, very slow. A float variable takes up 4 bytes, the controller must calculate all the bits in it, and you have no control over the accuracy depth it’ll try to achieve (that is, the number of digits after the decimal point). And the fixed-point variables are not included in the avr-gcc version supplied with Arduino IDE.

There are, however, situations where the float is viable. In fact, that’s anything that doesn’t work constantly in real time. If your sketch calculates some value and prints it on the screen once or does some stuff and then waits for user input, or just waits for some fixed amount of time in between iterations, it’s ok. It is also ok to use floats in the setup() section.

But when you’re making a library for people who may want to use it for some constant real-time action, like making animations on the LED matrix, you must forget floats. And we’re talking about such library here.

The question is – how can we ditch them? You can’t draw a circle without the Pi, right? Well, in fact, you can, but that’s for the future post. Right now, we’ll jump to the solution: determine the accuracy level you want to achieve, i.e. the depth of the fractional part, move the decimal point to the right accordingly and deal with an integer. Thus, the aforementioned Pi becomes 314 if you’re ok with 10-2 granularity, and in most Arduino cases you will be. Certainly so with circles being drawn on an 8×8 screen. And then if you’ll need the float variable at some later time, simply reverse the process.

Well, ‘simply’ is not the good word here. Because while it is simple if you’re doing addition and subtraction, things start turning ugly with multiplication and division. Like, whether you do 54.1/3.14 or 5410/314, you get the same 17.2292993 etc etc float. But if you multiply the same numbers, you get 169.874 and 1 698 740. Of course, it’s not a problem if you’re doing just a single operation and know precisely what you want to get out of it, but if you’re going with the same variables over and over, adding, subtracting, multiplying and dividing them right and left, you end up with an intangible mess.

Thus, we return to where we started: forget the float. Spend half a year writing sketches without it. Find ways to get the result you want without it (even as a cheat in parentheses). Imagine YOU are the microcontroller and you don’t have the hardware to deal with it. Won’t be easy at the start, but you’ll get the hang of it and will be able to efficiently avoid it after some practice. Believe me, I had and still have the same headache. There’s no universal guide, you have to think up a solution for each problem arisen. But eventually, your code will become noticeably faster.

Also, you can learn Assembler and go the enlightened way of manual fixed-point floats.


2 – avoid division, modulus and, to a lesser extent, multiplication

Another popular advice must be taken with caution. If you google it up, you’ll find plenty of posts on replacing division with some code that does the same thing presumably faster. Well, it does, but the ‘faster’ part is questionable at best (and don’t forget to check what architecture the advice you found has been tested on, as the tricks vary greatly).

You see, the Arduino IDE has avr-gcc compiler at its core, and this is an open-source software, backed by Atmel. Which means the same people whose posts you googled up actually had an option to include all their ideas right in there. And they did. That’s the beauty of the open-source: everyone can commit stuff. So, if someone out there found a way to do a lightning-fast division by 37, be sure they included this in the division routines of avr-gcc, even if it will be needed by like 0.001% of the users once in a decade. After all, this is a compiler we’re talking about, it’s task is to make the most efficient, small and fast code possible, and it doesn’t matter if it works a bit slower itself while searching for the best option.

So, you may find good reads about the ways to improve stuff, but it will be mostly useless. Like, x/8 is as fast as x>>3, and x%16 is as fast as x&0xF. Same is true for other examples like division by five. Look:

uint8_t divBy5 (int gsValue){
return (gsValue>>2) - (gsValue>>4) + (gsValue>>6) - (gsValue>>8); // + (gsValue>>10) - (gsValue>>12));

Should be lightning-fast, I even truncated it to suit my needs! Nope, sorry. You can’t outsmart avr-gcc.

But you CAN confuse it, making it do an inefficient code.

Check this video:

Both contraptions run the same graphics routine as fast as they can (no delays). But one is more than two times faster. Why? The answer hides in one line of code.

When I made the first 8×8 RGB LED matrix prototypes I wrote some routines for it, they worked ok, and I didn’t bother to check them for speed until now. The culprit was hidden in the setPointXY() function that started like this:

void setPointXY(uint8_t x, uint8_t y, uint16_t pixVal) {
uint16_t pixNum=0;
if (x<12) pixNum=96;
pixNum = pixNum + y*12;
pixNum = pixNum + (x%12);

As my 8×8 boards are electrically 4×16 matrices, this routine was needed to determine which half of the ‘screen’ we’re working with (the function itself just writes stuff into the ‘video memory’ that is an array of 12-bit values). So, well, I get the active half and then truncate x accordingly. A clean and readable code. But slow. I already gave a hint, but here’s the new code:

void setPointXY(uint8_t x, uint8_t y, uint16_t pixVal) {
uint16_t pixNum=0;
if (pixVal<0) pixVal = 0;
if (pixVal>4095) pixVal = 4095;
if (x<12) pixNum=96; else x-=12;
pixNum = pixNum + (y<<2)+(y<<2)+(y<<2) + x;

It’s more than 2 times faster even with the two added ifs!

Here I simplified the code by getting rid of the unneeded modulus altogether. However, it was unneeded while working with a single 8×8 RGB array. I wrote the modulus function right at the start for future compatibility with bigger matrices (*). So, I tested another solution:

uint8_t modulus(uint8_t a, uint8_t b){
while (a>=b) a-=b;
return (a);

It’s also two times faster than the original x%12 (although it’s a tad slower than the previous code due to the = in a>=b). The reason this works is I know that x can’t be more than 23, but the compiler doesn’t and assumes it can be anything in the 0 – 255 range. So, it selects the average fastest routine.

I’m sure it will remain faster than modulus with x values up to about a hundred, although I can’t check this right now.

The compiler converts your program into the machine code in the best possible way. But it doesn’t know your intentions, so it deals with the raw text you wrote. Thus, the lesson here is: try to make things easier for the compiler. You know stuff it doesn’t, use this knowledge to make compilers life easier. If the compiler sees x = 532908/786 it will convert it to x = 678. If it sees x/5 it will utilize the best algorithm for division by 5, noting the size of x (byte, int, long, etc). And if it encounters x/y it will have to compose a code that can deal with any x and any y in their respectable variable types, even if these variables can’t be anything but 2, 4, 8 and 16 in your code.

And yes, you should avoid division and modulus if you can. You’ll help the compiler immensely by doing this. Just don’t overdo it.


3 – use look-up tables

Remember I wrote about the legitimate use of floats in the setup() section? Use it to build a look-up table. Sure, you’ll eat up some valuable memory, but you’ll get a lightning-fast code as a result.

In fact, you should calculate the look-up table once, print it via the serial monitor, copy and include in your code as a constant. This way you can place it in the program memory (aka PROGMEM) and keep the RAM for more important stuff.

The examples include, but are not limited to, sine waves, partial circles, square roots and all the other stuff that you know you’ll need and that can be limited to a reasonable finite number of possibilities.


4 – use the smallest footprints

Byte, not int. Byte, not int. Byte, not int.

Be furious when seeing for (int blah=0; blah<100; blah++) as an example in some book or on a web page. The blah variable is limited to 100, why does it occupy two bytes? It must be uint8_t!

The modulus function I’ve shown earlier started its life like this:

uint8_t modulus(int a, uint8_t b){
 while (a>=b) a-=b;
 return (a);

Reasonable for a universal function, considering a can be big, but b and return can’t. But it was visibly slower than the one using uint8_t (aka byte in Arduino terms) for a.

Again, a word of caution: don’t overdo this. Sometimes I spend a lot of time trying to find the bug that was hidden in an uint8_t variable that grew out of its allotted space due to some code expansion.


5 – don’t fix it if it works, test your ideas

As stated earlier, a lot of common advises for code fastening do not yield any result because the compiler is clever enough to know them. Thus, you should test your ideas. After all, you’re sacrificing the readability of your code here, and if your new unreadable code doesn’t give any advantages, why bother?

See the examples above, the ones with the weird (y<<2)+(y<<2)+(y<<2) line in them. This was just y*12 at the start. And this particular ‘improvement’ didn’t work. But now I have to decipher this each time I deal with this part of the code.

So, the advice here is simple: while creating your code stick to normal conventions (well, avoid floats still, but keep divisions, modulus, square roots, etc.). Get your program to do what you want. And really keep it as is if you’re ok with the speed, don’t try to fix something that isn’t broke. Remember, you may return to this code in a year or two. If it was kept simple, you’ll quickly get into it. If you ‘optimized’ it like I did in the example above, well, problem. Only optimize if you’re sure the solution works and only if you’re sure you need your code to run faster.


6 – learn oct and hex

If possible, always use numbers that are the powers of two. Forget decimal, try to think in terms of oct and hex. Me, I can’t, yet. But if you learn to, your programs will run a lot faster than mine.

This is because computers have 8 fingers per hand, not 5 as we do. They think differently. So, if you’re serious about it, try to catch their wave and get in their flow. Remember, unlike most modern programmers who don’t give a fig about their code size or speed, we’re dealing with just a bunch of MHz and mere KBs of memory, not GHzs and GBs.


7 – use optimized math

The problem of slow computing of the complex math was known for decades, and people searched for roundabouts for many years. Maybe the most famous cheat is the Quake Magic Number, but it works only on 32 bit systems. In any case, google stuff.

I also recommend searching for things already written. For example, some guys made a library for the evil LEDs-with-a-chip thingies, and while I don’t need its light-emitting routines, it has some neat math cheats for rough sins and coses. Cool, I’ll take it.

Note that I give this advice as the last one. There’s a reason: we’re dealing with really small displays here, so look-up tables are better in most cases. Like, you just need a quarter of a circle to draw it in full, why not pre-calculate a couple dozen points and draw lines between them? 24 XY points take just 48 bytes of space but are capable to let you instantly draw almost any circle you wish on an 8×8 screen!

But let me keep the circle’s dilemmas for the next post/article.


*ah, and yes, I did this routine for a 24×8 matrix. It goes like this:

uint16_t pixNum=480; //288 for 16x8
while (x>11) {x-=12; pixNum-=96;}

Note that this is for the current batch of UltiBlink Mx boards, the newer ones are wired more efficiently. They still require some computations though. It is great if you can avoid it totally on a hardware level, but it was not possible in my case due to size restrictions and possible ghosting problems.

Have a comment or question? Please post on the Facebook!

Leave a comment

Your email address will not be published. Required fields are marked *