Performance in Software Engineering

Inefficiency has exactly two causes in software: failing to use available resources and using more resources than necessary. In my experience, most optimization issues fall into the latter category. To elaborate, nearly all cases of poor performance are a result of, one way or another, doing more operations to complete a task than are necessary.

There are a few interrelated categories of optimization we can use to reduce operations: • use faster languages/runtimes • use data structures optimized for specific operations, and use the operations that they are optimized for • caching – the art of keeping data (closer to) where you need it

In most cases the choice of language or runtime can only be made at the start of a project, changing that choice means rewriting the product. An example would be selecting Rust/Axum for a REST API instead of SpringBoot. If you know that your application will need to squeeze in every bit of optimization possible, then you’ll select Rust, since it reduces operations by avoiding garbage collection in favour of statically guaranteed memory safety. Another example would be selecting the correct database type for your application. In a work project the early architects incorrectly selected DynamoDB for the application database, despite the deeply relational character of our data model. The result is that we often needed to perform operations in slower Node.js code that could otherwise have been pushed to the level of SQL queries, running on a database implemented in C. This had the double cost of needed to transmit extra data over the network from the db service only to be filtered out in the lambda service.

When it comes to data structures and algorithms, I’ve found there are mainly three cases where they have an effect: 1) applying indexes to our database, and ensuring that the index selected matches how we’re querying, 2) when converting data from one format to another temporarily, and 3) when implementing algorithms ourselves. The second case mainly applies to the compression of data for network transmission. In the latter case, I have notices a couple of practices that contribute to optimization. First, it is necessary to be conscientious about the runtime of library functions that we’re using. I have noticed among junior developers a common tendency to not consider asymptotic runtime at all, ensuring only that the code has the desired effect. I recall one case where a paginating query function requested large amounts of data before returning, and after each page combined the data using allData.concat(...newData) which resulted in an overall runtime of $\frac{\sum_{i=1}^n i }{pageSize} = \frac{n(n+1)}{2}\cdot\frac{1}{50}=O(n^2)$. This is trivial to reduce to O(n), but only if one actually considers performance in the first place. Aside from being conscientious about libraries, the other practice is to explicitly document asymptotic runtime of algorithms wherever possible.

Finally, there are 1000 kinds of caches, and they all have their place. CDNs, server-side api caches (most API software has some caching solution built-in), client side caching (like React Query). In some low-level systems it may also be necessary to consider the caching strategies of specific CPUs, but that is outside my area of expertise. To optimize my computer cluster node deployment times, which were previously very slow due to needing to separately build packages on each machine, I configured each machine to use all the others as binary caches. This works because the nix daemon on each machine also functions as a cache server, so when a build is triggered, for every package not present on the target machine, it checks all the other cluster nodes before building the package. The result is that any package needs to be built at most one time (unless the package was garbage collected, and also not counting the fact that the cluster includes three different platforms: aarch64-darwin, aarch64-linux, and x86_64-linux).

Optimizations that aim to more fully use available resources include horizontal scaling and concurrency, either with async or multithreading, especially to optimize IO-bound tasks such as writing to disk or performing network requests (including both sending many requests at once before waiting for the results, and sending requests and not waiting before getting other work done). The three resources we aim to ensure full use of are time (so that the CPU is not idle when it could be doing somethings useful), memory (in the form of caching, if we can spare the space), and processing power (mainly through the use of cores, if our workload can be parallelized, but also through scaling, and in some cases hardware acceleration). Setting up my cluster for distributed builds is an example of using all available resources; during a deployment if a node needs to build 5 packages (assuming no dependencies between them) then it can distribute the build instructions of one package to each of 5 other nodes, completing the operation ~5x faster.

It is not sufficient to conclude here by saying that we ensure our products are fast by using fast technology, by scaling correctly, by using concurrency, and by using caches, and the correct data structures and algorithms. The challenge when confronted by poor performance is typically to determine which optimization should be performed, by determining what the bottleneck is in our application. The best way to do that is through profiling. In a recent project I used OpenTelemetry with Jaeger to identify which API endpoints were slowest. There are many similar tools, such as CloudWatch, Prometheus, etc. To be concise, I will say that applications are complex, and we don’t always know beforehand what the bottlenecks will be, and that time spent optimizing a component that isn’t the bottleneck will often be time wasted. The best practice is to implement the most essential and also the low (time) cost optimizations, and let profiling handle the rest, possibly by also setting alerts to notify the team if certain operations are unacceptably slow – then determine and optimize the bottlenecks in order of priority.


1010 Words

2025-09-26