Google File System
I did some reading of the GFS paper from 2003! In this post I’ll try to explain the paper. Get some beer.
I started reading the paper because I came across and grew interested Apache HDFS which one implementation of this paper. This paper also provides the basis of the file system for the entire Map Reduce infrastructure.
Google came up with GFS because of the following requirements that they had:
- Filesystems and file servers crash constantly, while this might not be true every time but when you have tens of thousands servers (circa 2003) running then there’s a high chance some of these are not going to be available at some point.
- File sizes were getting into hundreds of MB or multiple GBs.
- The unique file requirements that Google had, namely, files were mostly read-only or append-only instead of random writes. So GFS optimises for append-only operations.
Master and Chunkservers
A GFS cluster has a master server and a large number of chunkservers. With respect to files, the Master server is responsible for containing just the metadata of the files (described below), while the files actually reside on the chunkservers.
When the master starts, it iterates through all the chunkservers in the cluster and notes down their information. Anytime a new chunkserver joins the cluster, it communicates with the master. So these are the only two ways the Master adds the chunkservers. After this, the chunk continues communicating with the master by heartbeat messages and keeps updating its health information.
A file is divided into multiple chunks (of 64MB each) and each chunk has multiple copies called replicas. Each replica resides on a different chunkserver for redundancy, and is persisted as a simple 64MB file on the disk.
The master creates chunks for each new incoming file, then moves them to the different chunkservers. This metadata is stored on the master.
For example, a 640 MB file (let’s say a video) will be divided into 10 chunks, each having multiple copies distributed over multiple chunkservers. The master then moves the chunks to the servers and notes down where each chunk goes. This is the metadata which the master has.
Application/Clients & File Modifications
When an application wants to do a file operation (most of the time which is an append operation), it knows about the file name and the offset from the start. It contacts the master to know about the file name and offset, gets the information and caches it.
Caching the information helps with reducing the dependency on the master. The client can get the location of multiple files, and accumulate multiple operations for each file, and then fire off the operation.
The master maintains a list of all the chunks for each of the files. Each chunk, as described above, is replicated across multiple copies called replicas. Again, this information is maintained by the master.
There’s an application wide minimum threshold of the replicas that needs to be maintained for each chunk. In case the chunkserver holding a given replica goes down, it will not be sending the heartbeat message to the master server. The master then moves all the replicas stored on that server onto other servers. This way there’s always sufficient number of replicas available.
This provides a basic extract out of the GFS paper. There are a lot more technical details in the paper, but I found these to be pretty non-exciting if what I needed was just the basic knowledge.
One key takeaway for all of this was the interaction between the chunkservers and the masterserver. It’s not the job of the master to keep querying about the status of the various chunks. It contains just the metadata, and the actual changes are the concerns of the chunkservers and the application which are using the servers. The master is only concerned with the health of the chunkservers which it checks by the “heartbeat” messages.
The knowledge that the system-failing is just a matter of “when” and not “if”, really has allowed to architect things in a different way.