Larger clusters submit more operations, and in the presence of the discussed performance regression, larger clusters also require a longer time to complete each operation. This combination results in a superlinear performance impact.
In the process of investigating our original incident, we found that the performance regression was brought on by a seemingly innocuous change that had slipped into several upstream releases. The goal of the original change (discovered and fixed a few months later) was to fix a bug in how load was calculated when choosing DataNodes to replicate blocks onto. It seemed fairly benign, with only about two dozen lines of non-test code changes, but it turned out that it also significantly increased the time it took the NameNode to add a new block to a file. Increasing the duration of this operation is particularly bad because every access to the file namespace occurs under a single read-write lock, so an increase in block addition time is an increase in the time that all other operations are blocked. The increase in block allocation time is proportional to the size of the cluster, so its impact becomes more severe as clusters are expanded. Larger clusters also have a more intense workload and thus more block allocations, further amplifying the effects of this regression. As shown above, the compounding effects of increased operation rate and increased operation runtime mean that adding nodes can have a superlinear effect on overall NameNode load; similarly, some operations’ runtimes are affected by the number of blocks or files in the system, making aggregate scaling effects difficult to predict in advance. This was a prime example of a scaling issue that was difficult for us to predict or to notice on our smaller testing clusters, but which we should have been able to catch prior to deployment.
The requirements for our solution
As discussed above, we reduced our HDFS scaling problem to a problem of NameNode scalability. We identified three key factors which affect its performance:
Number of DataNodes in the cluster;
Number and structure of objects managed (files/directories and blocks);
Client workload: request volume and the characteristics of those requests.
We needed a solution which would allow us to control all three parameters to match our real systems. To address the first point, we developed a way to run multiple DataNode processes per physical machine, and to easily adjust both how many machines are used and how many DataNode processes are run on each machine. The second point, in isolation, would be fairly trivial: start a NameNode and fill it with objects that do not contain any data. However, achieving a similar client workload significantly complicates this step, as explained below.
The last point is particularly tricky. The nature of requests can have huge performance implications on the system. Write requests are obviously more expensive than read requests; however, even within a single request type, there can be significant performance variations. For example, performing a listing operation against a very large directory can be thousands of times more expensive than performing a listing operation against a single-item directory, and this has significant implications for garbage collection efficiency and optimal tuning. To capture these effects, we set out with a requirement that our testing should be able to simulate exactly the same workload that our production clusters experience. This dictates that we not only execute the same commands, but that those commands are executed against the same namespace, hence the trivial solution to our second point is not sufficient.
Building off of these factors, and adding in a few additional requirements, we came up with the following list of goals:
The simulated HDFS cluster should have a configurable number of DataNodes and a file namespace which is identical to our production cluster.
We should be able to replay the same workload that our production cluster experienced against this simulated cluster. To plan for even larger configurations, we should be able to induce heavier workloads, for example by playing back a production workload at an increased rate.
It should be easy to operate. Ideally, every prospective change would be run through Dynamometer to validate if it improves or degrades the performance of the NameNode.
The coupling between Dynamometer and the implementation of HDFS should be loose enough that we can easily test multiple versions of HDFS using a single version of Dynamometer and a single underlying host cluster.
To meet the aforementioned requirements, we implemented Dynamometer as an application on top of YARN, the cluster scheduler in Hadoop. We rely on YARN heavily at LinkedIn for other Hadoop-based processing, so this was a natural choice that allowed us to leverage our existing infrastructure. YARN allows us to easily scale Dynamometer by adjusting the amount of resources requested, and helps to decouple the simulated HDFS cluster from the underlying host cluster.
There are three main components to the Dynamometer setup:
Infrastructure is the simulated HDFS cluster.
Workload simulates HDFS clients to generate load on the simulated NameNode.
The driver coordinates the two other components.
The logic encapsulated in the driver enables a user to perform a full test execution of Dynamometer with a single command, making it possible to do things like sweeping over different parameters to find optimal configurations.
The infrastructure application is written as a native YARN application in which a single NameNode and numerous DataNodes are launched and wired together to create a fully simulated HDFS cluster. To meet our requirements, we need a cluster which contains, from the NameNode’s perspective, the same information as our production cluster. To achieve this, we first collect the FsImage file (containing all file system metadata) from a production NameNode and place this onto the host HDFS cluster; our simulated NameNode can use it as-is. To avoid having to copy an entire cluster’s worth of blocks, we leverage the fact that the actual data stored in blocks is irrelevant to the NameNode, which is only aware of the block metadata. We first parse the FsImage using a modified version of Hadoop’s Offline Image Viewer and extract the metadata for each block, then partition this information onto the nodes which will run the simulated DataNodes. We use SimulatedFSDataset to bypass the DataNode storage layer and store only the block metadata, loaded from the information extracted in the previous step. This scheme allows us to pack many simulated DataNodes onto each physical node, as the size of the metadata is many orders of magnitude smaller than the data itself.