What Goes Down Better Come Up a.k.a. Adventures in Hbase Diagnostics
Earlier this year, the feedly cloud had a rough patch: API requests and feed updates started slowing down, eventually reaching the point where we experienced outages for a few days during our busiest time of the day (weekday mornings). For a cloud based company, being down for any period of time is soul-crushing never mind for a few mornings in a row. This led to a lot of frustration, late nights, and general questioning of the order of the universe. But with some persistence we managed to get through it and figure out the problem. We thought it may be some combination of instructive and interesting for others to hear, so we’re sharing the story.
But first we’d especially like to thank Lars George. Lars is a central figure in the HBase community who has now founded a consulting company, OpenCore. It’s essentially impossible to find a great consultant in these situations, but through a bit of luck Lars happened to be in the Bay Area during this period and stopped by our offices for a few days to help out. His deep understanding of the HBase source code as well as past experiences was pivotal in diagnosing the problem.
The Cloud Distilled
Boiled down to basics, the feedly cloud does 2 things, downloads new articles from websites (which we call “polling”) and serve API requests so users can read articles via our website, mobile apps, and even third party applications like reeder. This sounds simple, and in some sense it is. Where it gets complex is the scale at which our system operates. On the polling side, there are about 40 million sources producing over 1000 articles every second. On the API side, we have about 10 million users generating over 200 million API requests per day. That’s a whole lotta bytes flowing through the system.
Due to this amount of data, the feedly cloud has grown significantly over the last 3 years: crawling more feeds, serving more users, and archiving more historical content – to allow users to search, go back in time, and dig deeper into topics.
Another source of complexity is openness. As a co-founder, this is one aspect of feedly that I really love. We allow essentially any website to be able to connect with any user. We also allow 3rd party apps to use our API in their application. As an engineer, this can cause lots of headaches. Sourcing article data from other websites leads to all kinds of strange edge cases — 50MB articles, weird/incorrect character encodings, etc. And 3rd party apps can generate strange/inefficient access patterns.
Both of these factors combine to make performance problems particularly hard to diagnose.
We experienced degraded performance during the week of April 10th and more severe outages the following week. It was fairly easy to narrow the problem down to our database (HBase). In fact, In the weeks prior, we noticed occasional ‘blips’ in performance and during those blips a slowdown in database operations, albeit on a much smaller scale.
Fortunately our ops team had already been collecting hbase metrics into a graphing system. I can’t emphasize how important this was. Without any historical information, we’d be at a total loss as to what had changed in the system. After poking around the many, many, many HBase metrics we found something that looked off (the “fsSyncLatencyAvgTime” metric). Better still, these anomalies roughly lined up with our down times. This led us to come up with a few theories:
- We were writing larger values. This could occur if user or article data changed somehow or due to a buggy code change.
- We were writing more data overall. Perhaps some new features we built were overwhelming HBase.
- Some hardware problem.
- We hit some kind of system limit in HBase and things were slowing down due to the amount or structure of our data.
Unfortunately all these theories are extremely hard to prove or disprove, and each team member has his own personal favorite. This is where Lars’s experience really helped. After reviewing the graphs, he dismissed the “system limit” theory. Our cluster is much smaller than some other companies out there and the configuration seemed sane. His feeling was it was a hardware/networking issue, but there was no clear indicator.
Theory 1: Writing Larger Values
This theory was kind of a long shot. The idea is that perhaps every so often we were writing really big values and that caused hbase to have issues. We added more metrics (this is a common theme when performance problems hit) to track when outlier read/write sizes occur, e.g. if we read or wrote a value larger than 5MB. After examining the charts, large read/writes kind of lined up with slowdowns but not really. To eliminate this as a possibility, we added an option to reject any large read/writes in our code. This wouldn’t be a final solution — all you oddballs that subscribe to 20,000 sources wouldn’t be able to access your feedly — but it let us confirm that this was not the root cause as we continued to have problems.
Theory 2: Writing More Data
This theory was perhaps more plausible than theory 1. The idea was that as feedly is growing, we eventually just reached a point where our request volume was too much for our database cluster to handle. We again added some metrics to track overall data read and write rates to hbase. Here again, things kind of lined up but not really. But we noticed we had high write volume on our analytics data table. This table contains a lot of valuable information for us, but we decided to disable all read/write activity to it as it’s not system critical.
After deploying the change, things got much better! Hour long outages were reduced to a few small blips. But this didn’t sit well with us. Our cluster is pretty sizable, and should be able to handle our request load. Also, the rate of increase in downtime was way faster than our increase in storage used or request rate. So we left the analytics table disabled to keep performance manageable but continued the hunt.
Theory 3: Hardware Problem
As a software engineer this is always my favorite theory. It generally means I’ve done nothing wrong and don’t have to do anything to fix the problem. Unfortunately hardware fails in a myriad of oddball ways, so it can be very hard to convince everyone this is the cause and more importantly to identify the failing piece of equipment. This ended up being the root cause, but was particularly hard to pin down in this case.
How we Found the Problem and Fixed it
Here again, Lars’s experience helped us out. He recommended isolating the HBase code where the problem surfaced and then creating a reproducible test by running it in a standalone manner. So after about a day of work I was able to build a test we could run on our database machines, but independent of our production data. And it reproduced the problem! When debugging intermittent issues, having a reproducible test case is 90% of the battle. I was able to enable all the database log messages during the test and I noticed 2 machines were always involved in operations when slowdowns occurred, dn1 and dn3.
I then extended the test to separately simulate the networking and disk write behavior the HBase code performed. This let us narrow down the problem down to a network issue. We removed the 2 nodes from our production cluster and things immediately got better. Our ops team found out the problem was actually in a network cable or patch panel. This was an especially insidious failure as it didn’t manifest itself in any machine logs. Incidentally, network issues was actually Lars’s original guess as to the problem!
The important thing when dealing with performance problems (outside of, you know, fixing them) is trying to learn what you did well and what you could have done better.
Things we did well:
- Have a good metrics collection/graphing system in place. This should go without saying, but lots of times these types of projects can get delayed or deferred.
- Get expert help. There’s lots of great resources out there. If you can’t find a great consultant, lots of people are generally willing to help on message boards or other places.
- Stayed focused/methodical. It can get crazy when things are going wrong, but having a scientific process and logical way to attack the problem can make things manageable.
- Dig into our tech stack. We rely almost exclusively on open source software. This enabled us to really understand and debug what was going on.
Things we could have done better:
- Communicate. While Lars suggested networking, I initially discounted it since the problem manifested everywhere in our system, not just one machine. I would have learned there are some shared resources specific to data center build outs.
- Gone more quickly to the hardware possibility. We did a lot of google searching for the symptoms we were seeing in our system, but there was not much out there. This is kind of an indicator something weird is probably happening in your environment. A hardware issue is pretty likely.
- Attacked the problem earlier. As I mentioned, we had seen small blips prior to the outages and even done some (mostly unproductive) diagnostic work. Unfortunately not giving this top priority came back to bite us.
But there’s a happy ending to this story. As this post hopefully demonstrates, we’ve learned a lot and came out stronger: the feedly cloud is faster than ever and we have a much better understanding of the inner workings of HBase. We realize speed is very important to our users and will continue to invest in making the feedly Cloud faster and more resilient. Speaking of resilience, though we had a small downturn in feedly pro signups in April, we are back to normal. This speaks to what a great community we have!
Source: Follow 数字で見るドッカンバトル！攻略情報まとめ