Storage
Warning
Capacity usage should be considered in a warning state at 85% usage and data unavailability may occur at 95% capacity usage.
The primary goals for the storage layer in HyperCloud are resilience and scalability. As it underpins a highly-scalable compute layer, HyperCloud's storage needs to be architected to scale linearly as we scale our compute, and to make it very hard to lose data.
Additionally, the storage layer aims to be flexible enough to provide advanced storage features and interfaces to the layers above.
These include:
- Compute layer storage: Image and snapshot storage for the compute layer and marketplace.
- Object storage: A fully-featured S3-compliant object storage API consumable by apps.
- Guest layer block storage: Persistent storage for virtual machines and containers.
Fundamentals
Resilience and scalability are achieved with a few key approaches.
Calculated placement
All host libraries that communicate with the storage are cluster-aware, and not bottlenecked by controller or journal/index daemons. Instead of the journal state found in traditional storage systems, a cluster and policy map is used. This map is consistently distributed to all clients by a set of highly available cluster monitors. The client host libraries then use the map and policy with a hashing algorithm to deterministically calculate where data lives as they read and write.
Stringent replication mechanisms
All data is triple-replicated by default. Three replicas at all times can be used to ensure data integrity in case of failed media, but each replica can also be hashed to ensure consistency in case of partially-failed or corrupted media.
At scale, triple replication ceases to be a cost effective method to add storage to the cloud, but this can be resolved with erasure coding. With enough individual storage nodes, data can instead be split into a set of data and parity shards.
Placement abstraction
All data placed in the storage layer is stored in an abstraction layer which ensures data is evenly spread across storage media and can be addressed with different rulesets. This also means that recovery from media failure is very quick, as additional replicas are not created on a single other media, but in the abstraction layer itself, and therefore on all media.
Scrubbing and self-healing
The storage layer continually checks itself for consistency, by checking that objects are located and sized as expected, and by running frequent checksums against objects to ensure that they are consistent.
From a performance perspective, there are a few notable aspects about the storage layer in HyperCloud.
- Mix and match: Due to a diverse set of media available, including HDD, SSD, and NVMe, there are a number of different storage tiers which can be mixed and matched within the same storage layer. These can be exposed to the compute layer as independent datastores.
- Journaling: At a cluster level, journaling is a bottleneck and a flawed architecture. However, it makes a lot of sense at a node level, so all HyperCloud HDD nodes have built-in journals on SSDs which accelerate the node-level performance of spinning rust.
- Caching: Within a single node, we also make use of caching to cache reads and writes using in-kernel libraries. This is particularly effective at accelerating performance for workloads with small block sizes.
Block features
When provisioned, guests such as virtual machines and containers can isolate and lock their own virtual block devices from the storage layer, and these virtual block devices need to persist changes that occur at both the compute and storage layers.
Resilience
The virtual block device will seamlessly support:
- Storage node and media failure: If related media or an entire storage node fails, the virtual block device will continue to operate without a hiccup, as it will serve the data from a replica on another storage node.
- Compute node failure: If a compute node fails, the guest will redeploy on another compute host, and the persistent storage will pick up where it left off.
- Live VM migration: If a guest is migrated from one compute node to another, the persistent guest volume is seamlessly transferred to the new compute node.
- Network or interconnect failure: All nodes are connected via redundant LACP links, but in addition, HyperCloud is resilient to full node or full interconnect failure.
Efficient clones
All guest volumes are clones of existing template images - every new guest volume is a COW (copy-on-write) image of its original template. This means that guest volumes only consume new data written to the storage - similar to deduplication.
Snapshots
Thanks to COW clones, the storage layer enables guests to take snapshots at regular intervals while consuming relatively little additional storage utilization (in some cases, close to none). While this functions as a type of resilient backup system, snapshots can also be cloned into newly provisioned guests. While the snapshots themselves are immutable, the clones of the snapshots become writable volumes in their own right - and again are COW clones.
Object features
Object storage in HyperCloud is compliant with the S3 API, so can be used just like any other S3 service. The API is served from a highly available service that is load-balanced across the control plane nodes.
Functionality includes:
- Bucket management: Creating, deleting, and listing buckets owned by a particular user within a particular tenancy.
- Object management: Putting, getting, and deleting objects in buckets.
- Advanced object operations: Multipart uploads and byte-range reads.
- Access control: Managing access with ACLs at both bucket and object level.
- Object and bucket tagging: Adding tags to buckets and objects for categorization.
Instance backups
In addition to standard image snapshots, HyperCloud also supports external VM backups, either over HTTP or S3. Backups can be executed periodically.