Background

To properly understand why we've built what we've built and what we're trying to achieve, you first need to understand who we are, what our background is, and most importantly - where our collective scar tissue lies.

History

The team behind HyperCloud is diverse. Collectively, we bring to our work a vast amount of technical depth and centuries of experience, covering a great number of domains. Across the stack, we can speak to multiple major industry wins and successes - and some monumental industry failures.

Operations

Our team has architected, designed, deployed, and managed cloud infrastructure at scale for banks, government agencies, utility providers, large enterprises and research institutes.

Some of the highlights include:

Operating private clouds with tens of thousands of cores and dozens of petabytes of storage for critical national infrastructure.
Backup and recovery deployment and management for global financial services, all of which adhered to the data sovereignty and data retention legislation required across 32 countries for many thousands of backup clients.
Migration of legacy monolithic applications to CI/CD (continuous integration and continuous delivery/deployment) containerized services. Building and dynamically maintaining custom container images and virtual machines for regulatory compliance.
Many OpenStack deployments, including integrating and replacing proprietary solutions and repatriating IaaS (infrastructure as a service) from expensive public cloud providers.

Experience and scar tissue

Open source cloud orchestrators, proprietary cloud tooling (VMWare, Nutanix), backup and recovery, disaster recovery, monitoring and observation, distributed systems, fault tolerance, OLAs and SLAs (operational level agreements and service level agreements), customer support, authorization and authentication, federated identity, tenant management.

Compute

We've designed, deployed, and managed cloud infrastructure at scale for banks, government agencies, large enterprises, and research institutes.

This has included:

Going off the beaten path for architecture, building compute infrastructure with ARM, SPARC, POWER, MIPS, Itanium, and many others.
Building bare-metal provisioning stacks, including greenfield open source projects and fully-fledged commercial products, with support for most major hardware vendors.
Designing, commissioning and supporting Kubernetes and OpenShift deployments to manage hardware at many different scales and in multiple industry sectors, to provide PaaS (platform as a service) for internal and public consumption.

We were also the first to attack the ARM64 server compute problem head-on back in 2012, before a proper software ecosystem even existed for the compute architecture. Over a decade we've worked with ARM, AMD, Linaro, Cavium, Calxeda, Marvell, Broadcom, Applied Micro, Ampere, Packet, Red Hat and Bloomberg to help make the architecture a reality in the data center.

Experience and scar tissue

Container runtimes (Docker, Podman, LXC etc.), hypervisors (KVM, Libvirt, Qemu), the Linux HA stack (Heartbeat, Corosync, Pacemaker, Keepalived), and others such as HAProxy, Guacamole, Prometheus, Grafana, ELK, Splunk.

Storage

Over the last decade we've delivered our own turn-key appliances to over 100 organizations and enterprises either as stand-alone storage solutions, or as part of large scale cloud infrastructure projects.

This has included:

Storage to support large scale multi-region open source cloud deployments, powering tens of thousands of users.
Multi-petabyte shared file systems, with SMB and NFS support layered on top.
Pretending to be a traditional SAN and multipath iSCSI, for customers that still have a lot of catching up to do.
Tiered storage with NVMe and spinning disk.
Multi-region storage with asynchronous replication.

Experience and scar tissue

As part of delivering distributed storage solutions, we've had to deal with the ugliness of vendor inconsistency in networking platforms found in the wild, and in many cases had to act as surrogate network engineers on behalf of our customers. No single networking platform is excluded from this. As a group we have done lots of stuff with: Ceph, ZFS, LVM, DRBD, AWS S3, MinIO, Veeam, VMWare vSAN, Pure, EMC Isilon, Quobyte, Storpool, Nimble, GlusterFS, Lustre and BeeGFS.

Operating systems

Our team has been building and developing firmware, boot loaders, compilers, drivers, and custom Linux platforms for a very long time.

Between us we have:

Built many Linux distributions - both package-based and from scratch, in dozens of different ways and for lots of different environments. We know where the bodies are buried.
Worked with the upstream kernel community on issues ranging from small bugs to maintaining entire trees of embedded code.

Built our own BMC from scratch , and all the protocol and security that comes with it.

https://blog.softiron.com/engineering/run-bmc-why-we-decided-to-build-our-own-baseboard-management-controller/

Maintained and built our own firmware, boot loaders, and drivers for both our own devices and third-party ones.
Delivered our own repositories and package management systems as services to our customers.

In many cases we've done the above work with very stringent security requirements mandated by our customers, which include financial services, government agencies and security professionals.

Experience and scar tissue

Kernel, GCC, OpenSSL, Buildroot, OpenEmbedded, OpenBMC, IPMI, Redfish, UEFI, U-Boot, PXE boot, OS build and installation.

Hardware and electronics

Finally, we have a strong background in hardware design and electronics. Our team consists of people who have been critical players both at the silicon and board level at organizations such as Intel, AMD, TI, National Semiconductor, Altera, Synopsys and many more.

Our team can claim:

Many firsts around the ARM64 architecture, including the first ever ARM64 production server and SDS (software-defined storage) appliance.
Development of several key development boards and reference designs for both ARM and AMD - including early work on AMD EPYC.
Many automotive firsts including in-car voice recognition, connected car navigation, and in-car voice browsing.
Being part of the original team behind well-known industry designs such as Pandaboard, Minnowboard, and Beagleboard.
Designing and building numerous ASIC and FPGA-based products.

Experience and scar tissue

Several chapters include building mission-critical hardware designs where human lives depended on our hardware reliability, including life support systems. We've worked across silicon design, SERDES, analogue design, digital design, DFA, DSP, power engineering and thermal engineering.

Lessons learned

Given we've seen where some of the bodies are buried - what have we learned along the way?

State is lethal

Distributed systems are resilient based on their ability to shrink and grow, whereas state leads to complexity, inconsistency, and makes systems impossible to manage at scale. A good chunk of service failure and delayed recovery can be blamed on stateful systems and discrete points of failure. Stateless systems mean planning for failure from the outset, which is an operationally hygienic approach.

Hardware matters

The ugliest and most unexpected issues tend to come down to failures in hardware where systems only partially fail - these are also the hardest to debug. Faulty error correction, memory corruption, CPU failure, and kernel panics can all lead to serious problems. We've learned the hard way that to run infrastructure building fleets on consistent, task-specific hardware is the way to go. Once the task-specific commitment is made, amazing things can be done to improve operator experience.

Open source in production needs a strict ruleset

Open source software is the foundation of server infrastructure worldwide, and consistently produces higher quality, more secure code. When used correctly this code is extremely powerful, but most open source software doesn't come with usability guidelines, or in many cases not even with basic documentation. Running open source software to underpin production systems requires a strict approach to quality assurance.

Ownership is critical

Cloud operators who do not have ownership over their own platforms are crippled by inability to operate, support themselves, and make the changes that their tenants and customers request. Many private cloud acquisition models will try to lead operators to surrender ownership of different facets of their stack, and will mask these models as a boon. This is to be avoided at all costs.

Features must not compromise simplicity

Most cloud operators, even at massive scale, are just looking for a way to manage and provision resources, and provide resilient services to their stakeholders along with SLAs. At the end of the day though, it's only software and there will always be a way to boot from a NIC-mounted ARM chip, directly access memory addresses on remote machines, or use Raspberry Pis as compute hosts. These are all fun and viable things to do, but not at the expense of solution integrity and simplicity. In many cases these are over-engineered answers to commercial questions that were originally very different. Sensible operators need their cloud solutions to be an abstraction inversion for the technology underneath.