With thousands of servers powering Yammer, and almost a million users using it, there are billions of logs produced per day — some that tell stories, and some that are just there to be read. But the logs you may not want to read now may be the very logs you do want to read later; in the long-run, an exception can be just as telling as a success.
Yammer logs around 35,000 events per second, summing up to nearly a terabyte of data per day! It’s difficult enough to hop onto a machine, look through thousands of streaming logs, and diagnose an issue. But what if a service is down and it has a hundred machines? A huge amount of engineering resources would go to isolating a problem down to an individual or several machines.
That scenario is the primary use case for log aggregation at Yammer: on-call incident response. When services are in trouble, log analysis is very useful in detecting 1) which machines are at fault and 2) looking for clues as to what may have occurred, such as stack traces. In the long term, persistent issues can also be identified through trends.
The Beginning: Strengths, Weaknesses
We were interested in the Elasticsearch, Logstash, and Kibana (ELK) stack, and decided to build our log aggregation pipeline around it.
As per the diagram, Logstash-Forwarders configured on each of Yammer’s servers ship the logs to Logstash agents, which then run filters before finally indexing them into Elasticsearch, the log store and search engine of the pipeline. The engineers at Yammer can then query Elasticsearch through a compatible browser UI: Kibana.
The pipeline handled all of the log traffic well. It was awesome. It IS awesome.
…but things changed. Ultimately, increase in usage caught up to us. Yammer’s logs come from a variety of places, including application logs. The amount of logs that flow through the pipeline directly increases with an increasing customer base. More logs meant more disk space consumed, which meant disk space reaching critical levels.
With Logstash struggling to process all of the events, and Elasticsearch struggling in disk capacity, backpressure was introduced into the applications due to the unpredicted blockage in the pipeline. Sometimes there would be hours of delays between logs, and sometimes even data loss. After many experiences of peak load and unsuccessful temporary fixes, we needed a rework.
We had two issues to resolve: handling an ever-increasing amount of log traffic, and avoiding backpressure from Elasticsearch indexing. Both seemed simple — just scale out, right? And then we scale out the Logstash agents as well! Boom. Done.
But not so fast, that’s not all. We took the time here to really revisit the pipeline architecture and see what was going on. We did what every Elasticsearch tutorial advises you do in the beginning: know your data.
So we took everything apart. We evaluated the sizes of our events, the number of events that a Logstash agent could handle at once, how much disk capacity we’d need for a 30-day retention period. Then we put it back together.
What we ended up with was a more streamlined system: we disambiguated the collection of logs from the indexing of logs, and inserted Kafka in between those two components for buffering and future archival consumption (more on this in a bit).
By separating the collection and indexing of logs, we were able to scale out each component individually. With a good amount of tuning, we created a headroom-friendly pipeline that not only saw sub-second ingest to index times, but also a system where multiple consumers can ingest at their own rate.
Backing up log data is important for compliance, and the insertion of Kafka provides a simpler and efficient way for Yammer to do so. We intend to utilize a Logstash agent to consume and back up the data to our backup servers. Look out for a follow-up blog post as that phase of the project is completed. 😀
The Logstash-Forwarders are conveniently deployed with a default configuration across all of Yammer’s servers that need log analysis. The Forwarders make it very easy for custom configuration of additional logs if needed, leaving it up to the engineer to decide what he or she wants to track, and how. If the log type is already flowing in the pipeline’s daily traffic, then they’re already done; and if not, it’s just a simple grok filter creation away.
The Use Cases: Supporting On-Call and Deployments
There are many cases where our engineers find log aggregation particularly helpful.
Suppose an alert goes off where a machine becomes unavailable. By having its logs available, it is possible to diagnose any issues that may have led up to the machine’s failure. If it’s part of a distributed service, logs from other nodes in the same cluster may also have information as to the machine’s failure.
Or suppose there’s been access issues with a user. Our security team can retrieve access/OS logs to find related activity.
Or suppose we’re redeploying a large service. Assuming we have informative logs, with near-real-time log indexing we are able to watch Kibana for any extremities or anomalies caused by the deploy — immediately letting us know whether or not our deploy was solid or needs to be rolled back.
The best part of all: the utility of log aggregation only increases as we gather more data, increase our consumer base, and increase the numbers of services we have.
That’s all for now, but there are quite a few changes to look forward to in terms of log aggregation at Yammer. We’re planning on moving the pipeline fully into Azure, and we may even explore other pipeline options that will abstract maintenance from our Infrastructure engineers.
There were a few hurdles, but we now have a solid tool that directly aids the on-call response at Yammer. And as Yammer grows, log aggregation will only become more and more useful.