Inktank ceph webinar series Jan-Feb 2013

Inktank ceph webinar series Jan-Feb 2013

#1 Webinar: Getting Started with Ceph

Learn in this webinar:

  • The architectural requirements of the Ceph Cluster
  • The role of the core RADOS components
  • What happens if an OSD fails
  • How to spin up a cluster using a VM image
  • What is required to expand the cluster

Recorded Webinar

YouTube: Getting Started with Ceph

Audience's Questions

Will Ceph work with Xen? Curious since it works with kvm right now. Is it in planning?

Ceph hasn't been integrated with the Xen hypervisor in the same way as with QEMU/KVM, so Ceph block device will need to be accessed via the Linux disk interface. The RBD volume is mapped into Linux as /dev/rbdX, and then Xen can connect the instance to that device.

Is there any interface for Ceph, like administration dashboard... or just by cmd?

Today access is via the command line interface and the various Ceph-related libraries that can be integrated directly into user applications. Inktank is working to make a number of other interface mechanisms available in the future.

Any use cases you can provide for using Ceph's rbd and CephFS in I/O intensive HPC environments. Maybe using IB or similar high speed networks?

We've been working with a couple of the National Labs to do evaluations of performance with IB setups. Primarily the focus has been on CephFS as a potential "future technology" for HPC scratch storage space. We definitely want to explore high performance RBD solutions too though. Infact as far as I know, we are still planning on focusing primarily on RBD and RGW in the near term and not CephFS.

Do you have some performance good practices if I want to use Ceph for database or i/o intensive tasks? (bandwidth, networking technologies)

Inktank's Mark Nelson has published a series of detailed blog posts covering performance testing on the Ceph blog. That would be a great starting point. Inktank professional services can help refine the solution to specifically meet the application needs.

Does Ceph support Infiniband RDMA as a transport?

Not at this time. Member of the Ceph community are working to refine and test Ceph using IP-over-IB, and there is some consideration about integrating native IB verb support directly into Ceph at some point in the future.

Is there a build/port for Solaris SPARC or x64?

Not at this time. Some Ceph client functionality is available as FUSE ports, and som work has been happening in the Ceph community to test Ceph Fuse clients on other platforms. Those may be able to meet your needs on Solaris.

#2 Webinar: Intro to Ceph with OpenStack

Learn in this webinar:

  • What you need to consider for selecting the best cloud storage system
  • Overview of the Ceph architecture and unique features and benefits
  • Best practices in deploying cloud storage with Ceph and OpenStack

Recorded Webinar

YouTube: Intro to Ceph with OpenStack

Audience’s Questions

For our needs we might also need iSCSI-access to our storage. What are the experiences with using Linux as an iSCSI-target backed by Ceph?

Several users are doing this today and finding it a good solution that meets their needs. Inktank is working to contribute code that simplifies the process further by more tightly integrating Ceph with iSCSI target software. The patches are currently being tested for integration up stream.

How small OSDs can be? Would it make more sense to have larger amount of small OSD (example 24 OSDs 100GB each per cluster node). What are the best practices for configuring OSD? Can you cover difference between xfs, ext4 and brtfs for OSD? May FC SAN LUN be used as OSD?

There are lots of questions here: (1) OSDs can be arbitrarily small, but should be large enough to justify the CPU and memory resources. It makes sense to have one OSD per drive, and with SSDs having many 100GB OSDs may make sense for certain applications. (2) For general purpose guidelines, we recommend about 1GHz of CPU cycles and 2GB memory for spinning disks. This covers recovery scenarios and should use much less during standard operations. The series of performance-related blog posts on can help with sizing, and Inktank has professional services to help match configurations to application needs. (3) The performance blog posts explore some of the differences between the different filesystem choices for the OSD. In general, xfs is a safe middle choice, btrfs is great for peak performance but may be less reliable and have strange corner case performance, and ext4 is most ubiquitous and stable at the cost of lower performance. (4) It's certainly possible to use FC LUNs to build long as the LUNs are visibile to the Linux disk subsystem they can be used. It's unlikely, but you may need to tweak device start order in the OS to make sure the LUNs are available before Ceph needs them. If the hardware is already there, this might be a good way to experiment and prototype a solution. In the long run, using FC storage is discouraged because it adds considerable unnecessary costs to the solution.

So I have 1 SSD PER storage node for journaling?

Not necessarily. It depends on a number of factors. In some cases that may be sufficient, and in others the SSD can become a bottleneck and rapidly weak out. Different applications will have a different ideal ratio of SSD journals to spinning disks, taking into account rate of write IO and bandwidth requirements for the node.

Is there a limit to the number of physical disks possible in a single storage node? Or is it only the 500MB per disk recommendation? So is it practical to have something like only 6 storage nodes, but each one has plenty of ram - say 64GB, and that way each storage node can hold up to 128 disks?

The recommended sizing guidelines for OSDs is 1GHz CPU and 2GB memory per OSD in order to handle recovery scenarios smoothly. For large clusters and general purpose storage, something like a 36 drive node with dual 8-core CPUs and 96GB memory might work well. Specialized deployments may be able to use denser platforms. Keep in mind that node size also impacts failure domains, recovery times, and performance in degraded states. If a node with many OSDs fails in a small cluster, the remaining nodes may be too busy handling recovery tasks to meet application SLAs.

#3 Webinar: DreamHost Case Study: DreamObjects with Ceph

This webinar discusses best practices and lessons learned in creating DreamObjects, including the need to manage scale, speed, monitoring, uptime, security and cost.

Recorded Webinar

YouTube: DreamHost Case Study: DreamObjects with Ceph

Audience’s Questions

Is there a way to limit how much people can upload to the rados gateway? A user limit?

There are not software limitations to the number of users that Ceph's radosgw can support, it boils down to your hardware and system configuration. That said, Ceph supports a scale out strategy for radosgw nodes, if you need to support more users and/or traffic then you can simply add more.

What’s the least amount of nodes for a reasonable performance storage for hosting VMs?

The minimum number of nodes recommended for a production cluster is 3. This ensures that the cluster can maintain quorum across node failures and maintain data redundancy. It's difficult to comment on what's required for reasonable performance, because that depends on the VM demands, the power behind each cluster node, and what kinds of disks are used. Inktank provides professional services that can help tailor a solution to specific requirements, and optimize goals such are cost or electrical limitations when recommending a solution.

How much storage pr. storage node? You mentioned 8-12 OSD pr. server - are those 2-3 TB disks?

Typically, yes. We often see deployments where the node hardware supports 12 spinning disks in the 2-4TB range. Sometimes these are enhanced with SSDs for journaling or a separate high performance pool of storage.

You guys run storage nodes with no SSD journals - doesn't that hurt performance?

For many applications, it's possible to get good performance without using SSDs for journals. The journal device is written sequentially and played out to the rest of the OSD at a later time, For data that isn't being aggressively updated, journals on the OSD itself can be sufficient. On simple optimization is to partition the disk so that a few GB of the outer cylinders are used for the journal, and the remaining surface for the OSD data.

What is the normal bandwidth that the Ceph cluster is providing to the servers?

That depends on the size of the cluster and configuration of the individual nodes. One example might be to consider a node with 24 OSDs and a 10GbE link to the application servers. It should be possible to saturate the 10GbE link for that node by reading data from the OSDs and feeding it to the servers. Scalability is expected to be linear, so adding more storage nodes and application servers should increase the aggregate bandwidth in the cluster. Check out the performance blog posts on for more insight on sizing individual nodes.

How is the performance with Xen compared to KVM or others?

It should be similar, although we haven't performed any benchmarks to compare the two at this time. KVM and libvirt can talk to Ceph's block devices directly, while Xen currently requires first mapping the Ceph block device via the Linux kernel driver. Most of the IO stays in the kernel either way, but the KVM approach bypasses a few layers which may improve latency to some degree.

Have you compared XFS OSD to ext4? Any benchmark data for using XFS over ext4?

Take a look at the performance articles written by Mark Nelson on the blogs. There is a blog article comparing different OSD file system options.

#4 Webinar: Advanced Features of the Ceph Distributed Storage System Delivered by Sage Weil, Ceph Creator

Learn in this webinar:

  • Deploying Ceph
  • Enhance Deployment
  • Block Devices

Recorded Webinar

YouTube: Advanced Features of the Ceph Distributed Storage System Delivered by Sage Weil, Ceph Creator

Audience’s Questions

Is it possible to define storage domains that have more and less capability (e.g. SSD vs spinning disk) and define a migration scheme accordingly? Can migration be gated by API so that a user can select whether or not to migrate said data?

It's possible to create multiple pools of storage so that some pools are built on spinning disk and other on SSD. Then data can be stored in the appropriate pool. Currently there is no inherent mechanism or policy engine to migrate the data between pools. That would need to be implented external to Ceph. The next release of Ceph, Cuttlefish, will include RESTful APIs for all functions which can be used to export data from one pool and import it into another as a way of implementing the migration.

What is about the performance of ceph when you compare it to a RAID/JBOD with the same number of disks in a raid 5? Is there a caching mechanism like in enterprise storage?

Check out the performance-related blog posts on for detailed information. In general, we are able to reach the hardware limits of the storage and network components used on the individual Ceph nodes, and these are expected to scale linearly by adding more nodes. There is no specific data caching mechanism implemented in Ceph, but Ceph OSDs store objects within a Linux file system that has the usual file system caching mechanisms that help accelerate access.

Can you have high latency pools across data centers?

You can, but the currently implemented replication mechanisms are synchronous and don't take latency into account in placement decisions. That means all data accesses will slow down in the general case, with the latency on the WAN link determining the overall performance.

Are there any known crashes and how were they fixed?

Take a look at for information on accessing the code base, reported bugs, and discussions of work in progress. That will give you an idea of how Ceph has been maturing and stabilizing over the last several years.

Can you give a rule of thumb of how many metadata servers one would need for a given number of OSDs?

Ceph provides a unified storage solution with object, block, and file access methods. In the general case, Ceph uses the CRUSH algorithm to compute data placement rather than managing metadata. For the file system component, Ceph uses metadata servers (MDS) to maintain and present a distributed parallel file system. There is no fixed rule for how many MDS to deploy for a given number of OSDs. CephFS clients access data directly from the OSDs and only need the MDS for metadata operations. The more metadata operations required, the higher the ratio of MDS-to-OSD. In addition, the MDS can be deployed on a wide range of hardware. If more powerful servers are used for the MDS, fewer of them will be needed. InkTank professional services can help recommend a good ratio based on the application.

Any io/disk/performance penalties using mapped rbd for block devices instead of directly via KVM?

We haven't profiled any specific penalties, although we expect some slight degradation due to the additional layers. For large deployments or OEM appliance design, InkTank professional services can help characterize the performance difference and determine if the penalty is small enough to make the mapped device acceptable. In general, we would expect a very small increase in latency and CPU utilization.

Is it possible to get a real enterprise support including security patches and 24/7 support?

Yes. That’s a big part of what InkTank provides for Enterprises who are interested in leveraging Ceph software.

Any big clusters in production beside Dreamhost?

Yes (details to follow soon).

Are there any plans to have geo-aware replication?

Yes. We're currently working to add geo-aware replication to the Object access method by implementing it in the RADOS Gateway (RGW). There are additional plans to provide geo-aware replication for the RADOS Block device as well, although those are still in the early design stages. Eventually, geo-awareness and asynchronous replication is expected to be integrated directly into RADOS, becoming a native capability across all access methods.

Is it possible to change crush rules online? What will happen?

Yes, CRUSH rules can be changed dynamically at any time. The updated rules will be propagated throughout the cluster and clients, and put into use like any other changes.

What is a suggested OSD size in Ceph? Does it make more sense to build more smaller size OSDs or less larger OSDs to reach maximum performance? Is LVM for OSDs supported configuration?

The OSD term is a bit overloaded. The cleanest definition is the Object Storage Deamon, which is a process that manages some chunk of media and presents it as objects that are part of the cluster. Typically, a single node (server) holds many disks and many OSDs, with one OSD per disk. When deciding on the number of OSDs per node, we need to take multiple factors into account. One is the CPU and memory power available at an acceptable cost. This will determine how many OSDs can be managed in a single node without the CPU becoming a bottleneck. This partly dependends on the application and access pattern. Another is the the failure domain of a node. Even if the CPU is sufficient, if a node fails the data on all those disks will need to be rebuilt somewhere. Those recovery operations will consume resources in the cluster, and ratio of failed to surviving OSDs will determine how long the recovery takes and how long the cluster operates with degraded performance. For example, if each node is very dense with 80 disks that might be an acceptable failure domain in a cluster with hundreds of nodes but not in one with just 12 nodes. Typically, OSD nodes have 12-24 disks. This provides a good balance of density, performance, and costs given today's commodity hardware configurations and pricing.

Is Ceph using erasure coding approach to save objects?

Not at this time. Ceph uses simple replication to place multiple copies of each object in locations that will survive different failure scenarios. We have discussed implementing erasure coding as an option in addition to replication, and that may become a Ceph feature at a later time.

Will there be Xen live migration for VMs specifically for OpenStack or CloudStack? I know it’s available for KVM but was just wondering for Xen.

We are currently working with Xen engineers to integrate Ceph with CloudStack and Xen to enable live migration and other capabilities. There is no firm date on when this work will be completed, but it's expected to be some time within 2013.

Is it possible to use the librados API while CephFS is in use?

Yes. All access methods, including librados can operate in parallel within the same Ceph cluster. However, it's generally a bad idea to try and access the same exact objects through multiple methods. For example, if an object is part of a CephFS file then it should only be accessed via CephFS.

Article ID: SLN311833

Last Date Modified: 08/14/2018 06:32 AM

Rate this article

Easy to understand
Was this article helpful?
Yes No
Send us feedback
Comments cannot contain these special characters: <>()\
Sorry, our feedback system is currently down. Please try again later.

Thank you for your feedback.