The 3 Phases of My API Optimization
This post shares what I learned about speeding up an API by improving the database and the tradeoffs that came with it.
I work at a CRM company where one feature lets a user see their own details along with the details of everyone in their family, the users who created them, the user they have created, plus the users created by those people, and so on. In other words, it shows a full “family tree” of accounts, parents, grandparents, children, and siblings.
To build this, we use an API that runs a recursive query in PostgreSQL. It starts from a user, follows the parent_id
links up until it reaches the very first ancestor, and then walks back down through all the children. With millions of records, this becomes heavy work for the database. Even on our powerful servers, the query was taking more than 1.5 seconds to run. The frontend team flagged this as a serious issue, because with thousands of users making the same request, the whole system could slow down.
Time to debug. I figured I must have missed something obvious. Maybe the parent_id column isn’t indexed. But it was a foreign key, and the index was already there, so no luck. Next, maybe the ORM I was using wasn’t generating an efficient query. I rewrote it in raw SQL to see if that helped. It was faster, but only about 72% better. That still fell far short of my target of 500%. At this point, it was clear I needed to change the way the data was structured. Since I was pulling user info by joining multiple tables, I thought of using a materialized view. A materialized view is basically a read-only table that stores the results of a complex query with all the joins, so everything I need is already in one place. The view then gets refreshed every 15 minutes with updated data. When I tried this, the difference was huge. The query dropped to around 200 ms, a 750% improvement. Great! I hit my performance goal and then some. But then I started thinking: is this really the best approach? I was duplicating a lot of data, and every refresh put extra load on the database. It worked, but it didn’t feel clean or sustainable.
I started thinking there has to be a simpler way. What if I just add a root_parent_id column to every record in the users table? That way, each user would always point to their top-level ancestor, no matter where they sit in the family tree. Then, if I need the full bloodline, I can just query all users with the same root_parent_id. I tried it out. After updating the database and running some tests, the response time averaged under 300 ms. That’s fast enough, and it didn’t involve as much data duplication or heavy refreshes on the server. But soon I hit problems. With this setup, I could get the full bloodline, but what if I only needed parents, children, or siblings later in another feature? For that, I’d still have to use the same expensive recursive query I was trying to avoid. Keeping the `root_parent_id` up to date was another issue. Every time a user was added or updated, the system would need to recalculate it. Doing that during a save would slow things down and could cause timeouts, so I’d have to push it into a job queue. It worked, but it felt like too much effort for something that still wasn’t flexible. I knew I needed an even simpler fix.
What if, I just let it run like it was previously running through a time taking, complex CTE and cached a response for the same payload for a given user? It took 1.6 seconds on the first API call and subsequent ones were giving me a response in less than 300 milliseconds. Great, I just need to make sure the response is accurate. I set the cache timeout to one hour only and If a user gets updated, there is a chance that their parent has been changed which could affect not just this user but a lot of others as well. So, I just simply need to invalidate the entire user cache in such a case. The user is not being updated very often so I will be fine.
I communicated to the frontend team that the time taking part has been fixed and to this day, I have’t received a complaint so I guess it is working.