Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / programming / performance

Optimizing your C/C++ programs for Arm Platforms

0.00/5 (No votes)
30 Apr 2024CPOL4 min read 2K  
The article discusses various techniques for improving program performance by optimizing code for compilers, focusing on the ARM platform.

Today’s compilers are quite good at producing highly optimized code on their own. However, there are several cases where you as the programmer can help compilers generate better code. This blog post covers techniques and tips that are useful to create better performing programs whether you are creating Android, Desktop or Server applications.

Memory aliasing and the ‘restrict’ keyword

Whenever a compiler auto-vectorizes code, it needs to first be sure that this is safe to do. One of the performed safety checks is for pointer aliasing. This check is used to see if the pointers the compiler is reading and writing can be pointing to the same data. When the compiler cannot determine this statically, it has to insert a runtime check.

These checks can slow your program down significantly or, even worse, fail to vectorize entirely if it can not insert the runtime checks. However, there is a way for you as the programmer to tell the compiler to go ahead and assume the pointers do not alias. Read the following Learning Path in the URL that explains the importance of using the ‘restrict’ keyword in C correctly.

Learn about the restrict keyword

Memory latency

Loading and storing data to memory is an activity that takes time for the CPU to complete. How much time depends on several factors, but there are various things that you as the programmer can do to improve this access time. Often compilers cannot do these for you and so knowing these techniques can give your program the edge it needs. Read the Memory Latency Learning Path in the following URL to gain a better understanding of caches, prefetching and data alignment on Arm platforms.

Learn about memory latency

Leveraging integer vs floating point

Performing operations using integer arithmetic instead of floating-point arithmetic can often result in significantly faster programs, as CPUs tend to have more bandwidth to perform integer arithmetic. However, there are cases where, due to the semantics of the programming language, you inadvertently end up with floating point operations. One common pitfall is implicit conversions to floating-point. Read the Speed benefit of integer vs float Learning Path in the following URL to find out how to avoid these pitfalls and leverage the power of integer performance for faster programs.

Learn about integer vs floating point performance

Leveraging auto-vectorization in compilers

Modern compilers are often referred to as optimizing compilers, because they perform various optimizations and transformations on your input program to get better performance. One such optimization is the transformation of your program from scalar to vector. The act of vectorization refers to transforming your program from handling one value at a time into one that can handle multiple values at a time in each operation.

While compilers are very good at this and constantly improving, there are still various ways you can structure the flow of your program to make it easier for the compiler to perform auto-vectorization and leverage the power of Advanced SIMD and SVE instructions.

Intrigued? Read more in the following URL:

Learn how to leverage auto-vectorization

Modifying loop layout to be auto-vectorization friendly

Of equal importance when writing auto-vectorization friendly programs is the data layout. When the compiler is transforming loops during auto-vectorization It makes a significant difference whether it can be load data sequentially, or whether it needs to skip some elements, for instance loading every other element. Even accesses like reading a field of a struct inside an array, such as data[i].x, can result in strided accesses.

An efficient data layout can be the difference between a slow and very fast program. This is one area where the compiler often does not have enough context to be able to help and where it is important for the programmer to understand how they can help the compiler. Interested in taking your program’s performance to the next level? Read more in the link below.

Learn about data layout

Summary

The Arm architecture has plenty of great features that when used properly can significantly improve your program's performance. It is easy to take advantage of them if you keep the tips and background knowledge in mind.

Be sure to read the other Learning Paths here: https://learn.arm.com for other helpful and informative tips on how to best leverage all that the Arm platform provides.

Read more Arm Learning Paths

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)