Performance issues can be very tricky to solve. Let's take an example of the most important flows or interactions in your app, it could be your product catalog screen or even the cold starts. One fine day after your app release, you start getting complaints on this important flow from your users saying that it is taking more than usual to launch this flow. In such cases, developers mostly try to iterate on the changelog to identify the bottlenecks. There is nothing wrong with the approach but this is a kind of "hit/trial" for identifying the performance issues. You may or may not be able to identify the issue. There are multiple reasons for this:
- One of them primarily being that the codebases are so large and if you work with large teams, iterating over changelog can be a bit exhaustive too due to multiple changes sent in the release
- The answers may not even lie in your changelog. Let's say there is some unknown side effect from a remote config value that suddenly got changed by some engineer or product in this release and your launch times start getting affected.
Finding the Source of Truth
The most annoying thing about performance issues like mentioned above is that they are not at all reproducible in the developer's environment. There are multiple reasons for this: OEM-specific issues, different configurations serving production and developer's environment, device constraints, etc. So, for solving these issues we have to start trusting only on production telemetry events.
The production telemetry events can consist of various things. It can consist of device-related attributes, session-related attributes, and other vitals-related attributes like thread CPU time, memory info, etc. These telemetry events are useful to identify the exact bottlenecks on the main thread.
I found Profilo to be a great tool in which Facebook has invested a lot and it helps to capture the telemetry events related to the main thread, memory info, method trace with their respective time, etc even from the production environment. With the data from production, it becomes easy to identify various bottlenecks from the user interaction.
Optimizing the rendering ⏰
Let's say you got this task from the product to optimize the cold start time or the rendering time after an increase in the number of complaints from the users. What will be the next tasks for your team?
- Looking online for blogs like "Reducing cold starts at an X company by 80%".
- Start playing with the Android Studio profiler to identify bottlenecks in your interaction and addressing them.
- Randomly changing some things which you may think can create a bottleneck on the main thread.
- Following all the things you got from above and ship them 🚀
Although this may provide you smaller gains in performance, but this will not solve all the issues completely. Despite this, Brendan Gregg, a Performance Engineer from Netflix, identifies some of these as "anti methodologies" for solving performance issues. He identifies these as street light anti-method and random change anti-method. According to him,
- Street light anti-method is picking up tools that are familiar, found on the internet or you are familiar with and found some obvious issues from them.
- Random Change Anti-Method is randomly changing the code until you get some improvements.
There are even other anti-methodologies identified by Gregg mentioned here. Let's try to understand what is the problem with these approaches.
The common problem with solving all the performance issues is that there is an absence of a starting point. You can start optimizing from anywhere but you will not be sure of the gains it will have on performance on production due to so many uncertainties in that environment. You would want to have much more gains on performance initially to bring down the numbers of rendering time and if possible with less effort.
Doing this demands creating a methodology or a strategy rather than going all out on each and every issue you found from a random tool or a random change.
Creating a methodology 🗒
In this section, I will try to explain the methodology that works for me. For improving our metric we created the following methodology:
- Capturing rendering time for all the users.
- Segmenting the list of users who are facing the slowest rendering time in terms of percentage sessions for a particular flow. For example, a user X can have 80% sessions with slow rendering time.
- The definition of slow can be subjective to the flow. For example, for the cold starts android vitals dashboard flags all the sessions as "slow" that takes more than 5 seconds. Check the docs here.
- Capturing telemetry events for all the users who have experienced more than 50% of sessions for a particular flow with slow rendering time.
From the telemetry events, like the method traces, and the time taken by the method we get to know all the issues and we iteratively solve for them in an order which would give us more performance gains and with fewer efforts on the code change.
Deferring the usage of any tool directly and identify the affected population first gives a huge win by describing the starting point where you can use the tools.
After the improvements, I realized that it is better to create a strategy or a method because randomly solving everything through tools may give you a bottleneck but not the bottleneck.
Do you follow methodologies to solve performance issues? Reach out to me @droid_singh and let me know.