Gluster - The Red Hat of Storage?
By Tony Asaro on Dec 29, 2009 | In Data Management, Virtualization, Storage, Storage Management
Gluster is a startup that is providing an open source file system that is EASY to use and manage, has fast and versatile performance, is good for people who have 2 TB of storage or 20 PB or more of storage (and everything in between), provides scale up and scale out and will also work with existing files systems like ext3, ext4 and ZFS (and I also suggest they support Btrfs soon). But given all of that - the first time I heard about these guys my first reaction was - NOT ANOTHER FILE SYSTEM!!! Haven't we heard it all before? Yes and no.
The most important thing to me about Gluster and why it matters is their go-to-market strategy. They basically have the Red Hat model giving the software away for free and charging for support. Assuming that the software is as good as they say it is - this model could be landscape changing.
Consider people who are implementing test / dev environments that want a low cost networked storage solution that is easy to manage and cost effective. Consider companies that want to build huge archives of content that don't have a huge infrastructure budget. Consider the unbridled growth of content and the fact that a new model is required to contain costs and manage scale. And that is just the early adopters - consider how a solution like this can begin to nibble away at the bread and butter storage environment as it gains credibility and traction over time.
Certainly, the traditional storage market will continue to thrive - just like Unix has not been obsoleted by Linux - but I believe that the timing may be right for someone like Gluster. Think about it - thousands and thousands of users that build their own scalable storage system using commodity hardware with a free open source solution that is EASY, FAST, INTELLIGENT and RELIABLE. All four of these attributes have to be true for the masses to use it to the scale that Gluster I am sure is hoping for. And of course, Gluster's support has to be stellar and worth the annual pricing - which I believe is modest.
The Red Hat of storage? Not such a crazy idea and the timing may just be perfect given all of the market dynamics.
* * * * * * * * *
The following is a very informative email discussion I had with Jack O'Brian and Kamal Varma of Gluster:
Tony Asaro: If GlusterFS is a file system - why do you need another file system?
Gluster Folks: GlusterFS is a complete storage stack that includes volume management, software RAID, I/O scheduling, cache management, distributed locking, etc. For storing the data persistently on disk, Gluster (like many others) chose to use a standard disk format, rather than introduce a new one. Gluster features and performance are not dependent on the disk file system, but we benefit from leveraging a proven existing technology that is more convenient for the customer. Other file systems (from Lustre to Hadoop) also leverage the disk file system for low level management of block devices, however they do so with proprietary data formats. Gluster stores the data just as file and folders.
Tony Asaro: GlusterFS is in user space - why did you take the approach?
Gluster Folks: The two primary reasons are simplicity and flexibility, and we knew this could be achieved without sacrificing performance. Kernel implementations of file systems are by nature complex with many dependencies. Installation in user space is as simple as installing other common applications - without any need for unique kernel modifications or patches. User space implementation also enabled superior configuration flexibility. File system functionality is implemented in modules that can be stacked in user space to match the configuration to a given workload. User space modules also accelerate time to market and rapid maturity of new features - new modules can be quickly developed and integrated without the complexity of a monolithic code base with kernel dependencies.
Tony Asaro: The first conclusion will be that if it runs in user space it will not perform all that well.
Gluster Folks: We understood scalability from our experience building the first cluster that scaled to over 1000 nodes. We also had extensive OS design experience including microkernel design where OS functionality is implemented in user space. To validate our approach, we tested it. Network latency and disk latency are much higher than context switching. You can actually be faster by being in user space with techniques like eliminating the metadata server and making system calls very efficient. Sophisticated optimization algorithms are much simpler to implement in user space. If being in user space was the problem that people perceive it to be, VMware wouldn't exist.
Tony Asaro: You claim you have a great volume manager - explain why you believe your volume manager provides value to users.
Gluster Folks: Gluster Storage Platform creates/manages volumes across multiple machines into one global namespace. This gives seamless access for clients to the entire storage pool in the cluster from one mount point. The distribution of data (for load balancing) is done automatically. Use cases that require high availability can set up mirrored configurations across nodes (synchronous replication) - so our volume management functionality is quite flexible. Unlike others we do not have a metadata server - neither centralized nor distributed metadata server - Gluster clients use a deterministic elastic hashing algorithm to compute location of files thereby eliminating the need for the metadata server. In addition to removing this choke point, this enables linear scalability and better reliability. From an architecture standpoint, the power of Gluster volume management comes from the stackable design (unique to Gluster) which could be viewed as a programmable file system.
Tony Asaro: You work with ZFS - which also has a volume manager - how do you integrate with ZFS and do you have overlapping functionality?
Gluster Folks: ZFS is limited to a single system and doesn't have the capability to scale across multiple machines under a unified namespace. Gluster works in conjunction with ZFS, managing the task of unifying individual ZFS volumes across the cluster in the global namespace. ZFS manages the physical disks within each system as described above.
Tony Asaro: You claim that your performance is versatile and is good with large files, small files, streaming and transactional - how do you achieve this?
Gluster Folks: This is again a benefit of the modular, stackable design. Different optimization techniques are written in individual modules, and the stackable design allows them to be combined in clever ways to match a given workload. For example, if you have many files less than 64k the quick-read module fetches the file in one network operation. For streaming large files the read-ahead, I/O caching, and DMA capabilities provide optimization. If you have an application with a lot of stat calls, Gluster uses the stat pre-fetch module to recognizes this and fetch the information in bulk to serve from a local stat table, freeing the file system for other work.
Tony Asaro: What are your File OPS? Based on what configuration? Are you going to participate in SpecFS?
Gluster Folks: Here's an interesting test result. We tested a read workload using 128k blocks vs. the more common use of small block size such as 4k (it was a throughput test and we were not optimizing for IOPS). We achieved 131,000 IOPS across 8 storage nodes (16,375 per node). The configuration used 32 clients running IOzone. Each server had 3 RAID controllers and 18TB of storage (142TB total capacity). Interconnect was InfiniBand QDR (one card per server). We are planning to run additional tests to get some eye-opening IOPS numbers, we'll keep you posted.
We do plan to participate in SpecFS down the road. It hasn't been a focus since it is very NFS and CIFS dependent and most of our customers prefer the Gluster native protocol with other benchmarks or their own application tests.
Tony Asaro: You mentioned using your own agent above, which to me isn't nearly as a big of a deal in user space as it is with host operating systems. But others have tried and failed with this model. How can you justify an agent?
Gluster Folks: As you note the big difference is Gluster is not kernel-dependent. Those that are kernel-dependent suffer from being difficult to install, require custom patches, and are tied to a specific version of the kernel. For those in userspace, they required a proprietary API so you had to change your application (e.g. Hadoop, Maxiscale).
Our FUSE model is portable (and POSIX compliant) and installs with one command.
Of course we also give the choice of NFS, CIFS, WebDav, etc, most customers choose Gluster native protocol.
Tony Asaro: What is your throughput performance? Based on what configuration?
With the configuration above, we achieved 16 GB/sec for read and 12 GB/sec write. The test was designed such that data was served of the disk vs. cache.
Tony Asaro: How many downloads of your file system has there been?
Gluster Folks: Over 60,000 from the Gluster download site. We also have several mirror download site as well as being included in Debian, Ubuntu, Fedora, FreeBSD, and other distributions. That's cumulative since Jan 2007 (1.0 release). Current download rate is ~4500 per month with steady mo./mo. increases.
7 comments
Comment from: Steve Duplessie [Visitor]
I would argue that the storage world will stay vibrant - as the Unix world has - The Unix world is dying every day for mainstream applications - really only Solaris remains and for how long? As people flock from Unix they go to either Microsoft (gasp) or to Red Hat (and to a lesser degree, Novell) but either way, they are leaving.
The same will eventually be true in storage. Heavy weight OS type functions embedded in a storage controller are the same thing as MPE in an HP PA-RISC system 20 years ago - bloated, hard to support, and have diminishing value to customers. Removing the voodoo and opening up these functions has a history of working, so I figure it's just a matter of time.
Only question in my mind is how long will it take?
Cheers
12/29/09 @ 16:52
Steve - I think that we actually are saying the same thing - it is an issue of how long the status quo remains dominant. Remember that people have been predicting the demise of Unix for 10 years now. But actually IBM sold over $6 billion worth of Unix servers, Sun sold over $4 billion and HP sold over $4 billion in 2009 - so I think it is far from dead. Unix will be around for a very long time. The same will be true for traditional storage - it will take years for people to completely make the shift. In that time - could a true open source storage system have a major impact on the market? I believe the answer is yes.
Regardless, I do think that GlusterFS may be the start of something that is exciting. But it is a long road with lots of cool milestones and challenges on the way. To take a page from your recent blog on why startups fail - http://tinyurl.com/ybnv6u4 - in addition to needing a solid product they need great marketing.
12/29/09 @ 18:35
Comment from: Eli Collins [Visitor]
12/30/09 @ 14:37
Comment from: Max Cohen [Visitor]
------------
A filesystem requires one unique server, the name node. This is a single point of failure for an HDFS installation. If the name node goes down, the filesystem is offline. When it comes back up, the name node must replay all outstanding operations. This replay process can take over half an hour for a big cluster.[10] The filesystem includes what is called a Secondary Namenode, which misleads some people into thinking that when the primary Namenode goes offline, the Secondary Namenode takes over. In fact, the Secondary Namenode regularly connects with the namenode and downloads a snapshot of the primary Namenode's directory information, which is then saved to a directory. This Secondary Namenode is used together with the edit log of the Primary Namenode to create an up-to-date directory structure.
Another limitation of HDFS is that it cannot be directly mounted by an existing operating system. Getting data into and out of the HDFS file system, an action that often needs to be performed before and after executing a job, can be inconvenient. A Filesystem in Userspace has been developed to address this problem, at least for Linux and some other Unix systems.
---------------
Now as you read this it is outrageous to have a distributed filesystem with a single point of failure and also more
ridiculously replaying the whole of other calls which it is claimed to be of 1hr. Now here is a funny question is the HDFS ever installed on a 1000clients? did they tried replaying calls from that? i wouldn't be surprised at the very first approach. Also you can't mount HDFS as a normal filesystem now that is even strange this is what i think Gluster folks tried to tell that its not even Posix Compliance now Yahoo! uses this just becoz they didn't have any solution so they built their applications around this with HTTP get, put requests and even strange to that it is mentioned that you would need a userspace filesystem access files from HDFS.
All in all Hadoop is far cry even from calling themselves as a filesystem. Lustre is far better compared to hadoop in many cases as it feels to be a filesystem per se. But again lustre has
same problems of single metadata concept. I am not sure why people cannot see that pointing fingers and writing code to handle meta data is just stupid as the backend filesystems have done this job amazingly over the years.
MogileFS came by some promise but their performance sucks and have several design considerations.
12/31/09 @ 15:11
Comment from: Anand Babu Periasamy [Visitor]
HDFS is a distributed object storage system with centralized meta data server. It is specifically designed for map-reduce framework and can only store large objects (64MB and above). For a general purpose storage, users are not willing to make changes to their applications to use HDFS APIs.
HDFS objects are stored as structured files on top of regular disk filesystems. You still need the meta-data to restore its objects. Data is stored in a format, proprietary to HDFS.
As your storage volumes grow from 10s of TBs to 100s of TBs, it becomes painful to recover from a crash. Filesystem check downtime can take from days to weeks. That is why, keeping the files and folders as is (similar to NFS), is very crucial to scalability.
12/31/09 @ 16:30
01/04/10 @ 17:09
Comment from: Dan [Visitor]
It seems like a simple Virutal IP addition to the product would be nice.
10/31/10 @ 02:59
Leave a comment
| « IT Analysis for 2010 | Discussion with i365 Blogger » |
