Logo Computer scientist,
engineer, and educator
• Articles • Articles about computing • Articles about software development

OpenShift for engineers

This article provides a brief overview of the technical implementation of the OpenShift cloud hosting platform. It assumes no knowledge of OpenShift or cloud computing in general. The article is written mostly for engineers who will develop applications for the OpenShift platform, or support applications running on that platform. There is a bit of a bias towards Java-based applications, because that's where I have most experience. However, the general principles should apply to other languages and technologies.

Cloud computing — an engineer's perspective

What is this cloud thing, anyway?

Cloud computing is all about providing managed services to subscribers. The idea is not new in its concept — businesses have been managing computing hardware and software for their customers for decades. What is new is that recent enormous increases in Internet bandwidth mean that a group of service can be hosted in some central location, perhaps distant not only from the end users of those services, but also from the businesses that will provision them with applications and data.

Cloud services do not usually provide complete, packaged applications. There are notable exceptions, of course — Google Docs is a service that provides a set of document management applications, with related document storage. Most cloud services, however, provide some sort of hosting environment, on which a subscriber can deploy custom applications. Very broadly, it has become common to divide cloud service offerings into two basic types: infrastructure-as-a-service (IaaS) and platform-as-a-service (PaaS). Broadly, OpenShift is a PaaS service.

Infrastructure-as-a-Service (IaaS)

Usually, IaaS offerings provide the subscriber with some well-defined share of an operating system, hardware, and network infrastructure. To the subscriber, the service might be nothing more complex than an operating system user account, with some way to log into it (SSH, perhaps) and some way upload software to it (sFTP, perhaps). As a subscriber you'll be able to run an application that listens for clients on some TCP port, and the service provider's infrastructure will connect that port to some hostname with Internet presence. As a subscriber to a service of this type you might be able to see some sign of other subscribers' activity. You might be aware that your home directory is /home/user1234, for example, and draw the reasonable conclusion that there are at least 1233 other subscribers. You might see that there are operating system processes that have no connection with your own activities. In a well-designed IaaS offering you won't be able to disrupt these processes, or mess about with other user's files, but you'll know they're there.

A more sophisticated IaaS offering might take the form of a genuine virtual machine. As a subscriber you might have what appears to be complete, exclusive access to an operating system and hardware. You might get the impression that there is only one network interface on the system, and no processes except those you create. You might get administrator ('root') access to the virtual machine. True virtual machines make for better IaaS services than simple schemes like operating system user accounts, because subscribers are completely isolated from the activities of other subscribers, and can develop applications with little regard for the implementation of the service itself. However, these services have vey high overheads, and do not necessarily make good platforms on top of which to offer Platform-as-a-Service schemes.

A compromise between the simplest, account-based services, and full virtual machines, is some sort of lightweight container, such as LXC or Docker. OpenShift currently uses plain Linux user accounts and SELinux security policies to isolate one application from another, but it is likely that it will transition to Docker containers eventually.

Platform-as-a-Service (PaaS)

A PaaS offering provides more than just an operating system and a network interface — it provides a managed runtime environment that can host applications of a certain type. Where development is to be in Java, the runtime environment may consist of some sort of Java-based Webserver (Tomcat, for example) or application server (JBoss EAP), or a container for OSGi bundles (Fuse, Karaf). Naturally, for Java development the service will have to provide a Java Virtual Machine and probably other Java development tools as well. IaaS offerings are not limited to Java, of course — other programming languages have their own particular ways of bundling and deploying executables.

The user of a PaaS service will have a defined way to supply executables to the service. For Java, that might simply be a Web interface by which to upload a JAR or WAR file. In OpenShift, most subscribers will upload either compiled code or source code to a git repository hosted on OpenShift. The OpenShift infrastructure will read the git repository and provision the service with the executable code.

In general, users of an IaaS service get some infrastructure too — underneath your Java Virtual Machine, or Perl interpreter, or whatever, there will have to be an operating system, with a filesystem and all the usual bits and pieces. However, subscribers are generally shielded from the operating system in services of this sort, and access to the underlying operating system is not generally encouraged by service operators. As far as practicable, OpenShift subscribers are expected to interact with the service through an application's git repository. and through specific OpenShift management tools.

Public and private cloud services

An interesting feature of cloud technology is that the same infrastructure can be used to provide a public service to general subscribers, or an internal service within a particular business. Red Hat, for example, provides the public OpenShift Online service, to which anyone can subscribe. OpenShift Online is based on the OpenShift Origin project, which is an open-source PaaS implementation. But OpenShift Origin can also be used internally by organizations, perhaps to simplify and centralize their IT infrastructure. Red Hat provides a supported commercial offering, OpenShift Enterprise, based on Origin, that businesses can use to implement their own clouds.

The OpenShift platform

OpenShift is a PaaS provider — it provides a set of runtime environments onto which subscribers can deploy code developed using particular technologies and programming languages.

Brokers and nodes

An openshift platform installation consists of brokers and nodes. A broker provides the administative interface to the service, by allowing subscribers to create, modify, and delete application containers called gears. The concept of a gear is a central one, and I will describe it in much more detail shortly.

OpenShift provides both a Web-based interface and a command-line tool (rhc) by which subscribers interact with the broker.

A node is any (real or virtual) machine on which subscriber's applications are deployed. In general, a subscriber's interactions will be mostly with the specific node or nodes which host the application's gears; access to the broker is generally only needed to set up new applications, or remove them.

A set of gears managed by a particular subscriber is known as a domain. There is a loose correspondence with a DNS domain, as by default each gear will have a DNS name based on the subscriber's username. The specific application runtime environment is provided by a cartridge — another important concept which is described later. This basic architecture is shown below.
OpenShift architecture

Gears

A gear is the basic unit of hosting in OpenShift. At present, a gear is essentially a Linux user account, with a set of SELinux security policies. It's not a true virtual machine, or even a lightweight container like LXC. This simplistic architecture is justified by the fact that subscribers do not usually interact directly with gears — their technical implementation should be invisible. To the subscriber a gear is a unit of resource — a certain amount of disk space, a certain amount of RAM, etc.

Although a gear is a user account on a particular OpenShift node, it is not a subscriber account. That is, it is not the case that a particular subscriber has a particular user account on a particular node. Instead, each application gets its own user account. A particular subscriber may own a large number of applications, each with its own gear, and therefore its own user account. The gear infrastructure therefore isolates not only one subscriber from another, but one application from other applications from the same subscriber. Readers who are familiar with Android application development may be interested to know that this is almost exactly the same model that Android uses to isolate apps from one another.

When a gear is assigned to a particular node by the OpenShift broker, it will be allocated a machine-generated user ID and a network interface. Because all gear accounts are unprivileged, applications can bind only to IP ports numbered above 1024. However, the OpenShift infrastructure takes care of mapping user-friendly port numbers and Internet hostnames to the the gears's ports. By convention, for example, Web browsers will use the HTTP port 80 or HTTPS port 443, and these will be proxied to (by default) port 8080 on the gear's network interface.
OpenShift proxy
Subscribers can log into gears using SSH, and will see a pretty conventional Linux filesystem. Each gear has a directory under /home, and the usual directories /usr, /proc, etc., are present. The actual layout of files in the gear directory will depend on the cartridge that was chosen to populate the gear (see below).

Note:
OpenShift gears use client certificates for authentication, not user/password credentials. Since each gear has a separate user identity in Linux, user IDs would be unhelpful. Part of the preparation to use OpenShift is to create a keypair for SSH, and upload the public key to the subscriber's profile

OpenShift uses the Cgroups ('Control Groups') system to allocate resources to gears. When a gear is created it will be allocated a particular amount of resource. For simplicity, the subscriber is typically offered the choice of 'small', 'medium', or 'large', each of which corresponds to a particular allocation of disk, CPU, RAM, etc. The exact specifications for each gear size will depend on the installation; the free public OpenShift services offers three 'small' gears, with 1Gb storage and 512Mb RAM per gear.

Although a gear is a unit of resource, subscribers generally can not pool resources from multiple gears to the same application. It isn't possible, for example, to use the three small gears of the free service to create one application with 3Gb storage and 1.5Gb RAM. However, it's possible to replicate the application on multiple gears using OpenShift's built-in scaling mechanism, and share client load between the gears.

Applications

Within a particular subscriber's account, an application is the unit of administation. Using the OpenShift console or command-line tool, subscribers create applications; the infrastructure creates and populates the associated gears, according to the type of application selected.

An application does not correspond to an operating system process although, very often, a application will indeed consist of one process. A Tomcat-based application may, for example, have one instance of the Java runtime, executing the Tomcat container. There are no restrictions on the number of processes that an application can create, other than the obvious one — memory; but there are only a limited number of TCP ports available to connect the gear with the outside world.

Since the subscriber has SSH and sFTP access to the gear, the application can, in principle, be absolutely anything, so long as it will run on the Linux platform within the resources allocated. However, OpenShift is a PaaS offering, and developers are expected to base their applications on cartridges.

Cartridges

Cartridges are what make OpenShift a platform, rather than an infrastructure, cloud. When an application is created, using the console or the command-line tool, it will always be based on some sort of cartridge. In brief, the cartridge is a specification for OpenShift to populate the gear. If a subscriber creates an application based on the Tomcat cartridge, for example, OpenShift will create a gear on a particular node, and the cartridge will install Tomcat in that gear.

As well as populating the gear, the cartridge serves as a mediator between the OpenShift infrastructure and the developer. The diagram below illustrates this with respect to a Tomcat cartridge, but the same basic principles apply to most OpenShift cartridges.
OpenShift cartridge
A key feature of most cartridges is their support for build-on-deploy provisioning. All cartridges are provided with a git repository, and many cartridges will accept source code directly into this repository. In the Tomcat example, the source code is expected to be based on a Maven project (Maven is an automated build tool for Java applications). If a Maven project is pushed to the repository, code in the cartridge will invoke Maven to compile and package it, and then install the compiled code in Tomcat. There are also ways to push precompiled code and have that deployed, for situations where the use of Maven would be inappropriate.

From a Tomcat developer's perspective, therefore, OpenShift is pretty transparent. Java developers are generally used to working with git repositories and Maven, so deploying the application on OpenShift consists of little more than pushing the application's source code to the OpenShift git repository using an SSH URL.

If git is the primary interface between the cartridge and the developer, "hook scripts" are the primary interface to the OpenShift intfrastructure. In practice, all cartridges will provide at least two scripts: setup and control. In fact, only the latter is mandatory, and there are a whole bunch of others that might be invoked at certain times in the life of the cartridge if they are provided. The function of the setup script should be fairly obvious: this is invoked by OpenShift as soon as the gear has been created, and is expected to install whatever software is appropriate to the cartridge (Apache Tomcat, in this case). The control script starts and stops the application; in practice it might simply invoke appropriate scripts or binaries in the software that was installed by the setup script.

In implementation, a cartridge is simply a text file, whose name conventionally ends in .yml. The OpenShift broker will present a list of known cartridges whose YML files have been registered, but a subscriber can also provide the URL of a YML file outside the system. OpenShift will retrieve it and, provided it is formatted correctly, process it as a cartridge.

The YML file will have at least one crucial entry: a source URL. This is the URL of a bundle of software that will be retrieved and unpacked in the newly created gear. This software must contain at least the control script, as described above. Sometimes the bundle identified by the source URL will not necessarily be the complete package of software needed to build the gear. In such a case, the setup script will retrieve additional software as appropriate to that cartridge type. Note that there are no restrictions on outbound Internet connections from within gears although, of course, there are on inbound connections.

The YML file will also specify TCP port mappings needed by the cartridge, and a bunch of environment variables to which the hook scripts may refer, and which can be manipulated by the administration tools.

An application (and thus a gear) can be populated with more than one cartridge. In most cases, where multiple cartridges are applied, there will be a primary cartridge and one or more subsidiary cartridges. The subsidiary cartridges add features which are typically used by many different application technologies, such as relational databases.

Working with OpenShift as a developer

This sections discusses some of the ways that developers and deployers work with OpenShift, that might be different from non-cloud development.

Administration and deployment of applications

OpenShift subscribers can interact with the service in a number of ways: via the broker's web interface, using the rhc command-line tool, uploading code or data using git, or SSH direct to the gear.

The web interface

Most routine administration operations can be carried out using the web user interface to the OpenShift broker: create or delete an application, restart and application, check its status, and add features. The web user interface is clear and easy to use for these simple operations. The screenshot below shows that I have created an application called tomcat, that it is running, and that it was populated by two cartridges — one for Tomcat 6, and one for the MySQL database.
OpenShift web UI

The command-line tool

The command-line tool rhc is written in Ruby and is available for many different platforms (see here for details).

On most Red Hat Linux systems, you should just be able to do

$ sudo yum install rubygem-rhc
It's worth bearing in mind that the version of rhc in the standard repositories for Linux versions earlier than Fedora 19 is likely to be quite out-of-date, and you might be better to install manually as described in the documentation referenced earlier.

Having obtained the tool, the usual first step is to run

$ rhc setup 
This will create and upload the client certificate needed for authentication, and record the OpenShift broker location for future use. rhc defaults to using the public OpenShift service, but this default can be overridden when running rhc setup.

Thereafter, rhc can be used to create and delete applications, retrieve log files, and perform all the same operations as web console. However, rhc potentially has the advantage that it can pass arguments to the cartridge at creation time, and thus configure it more appropriately to the developer's needs.

One use of rhc which deserves a special mention (because there is no equivalent in the web console), is its support for port tunnelling. Debugging on OpenShift can be tricky because there is very limited access to the gear that is hosting the application. You can't attach a Java debugger to it, for example. Running rhc portmap will create a tunnel over SSH for all ports on which the cartridge's application is listening. On the workstation, rhc will open ports corresponding to all the cartridge ports. It is therefore possible to tunnel debug traffic and related communications into the gear.

Provisioning using git

As discussed earlier, OpenShift favours git for provisioning the application. The content pushed to the gear's git repository will vary according to the cartridge. In the Tomcat case discussed above, the developer was expected to push a Java Maven project. For other cartridges the developer may push Perl scripts, or HTML documents, or whatever else is meaningful to the cartridge. It is the cartridge's responsibility to interpret the git repository in an appropriate way.

SSH to the gear

The lowest level of access is to start an SSH session on the gear itself, and manipulate it using Linux utilities.

OpenShift gotchas

It is hoped that an application that runs on desktop system will run on OpenShift with the appropriate cartridge. Developers should not, in principle, need to know much about OpenShift, as the cartridge abstracts away the low-level details.

However, there are a few things to watch out for.

Environmental limitations

OpenShift applications run in an environemnt with strict resource controls. Your application won't be able to starve another of resources, even temporarily. Proper attention to scaling and sizing is therefore even more important in OpenShift that in conventional deployment.

Applications are expected to read and write files only within their specific gear (user account); ideally, they should read and write only files that can be provisioned using git. It's possible to use SSH/sFTP to provide arbitrary files to the gear, but this makes it hard to move an application to a different gear, should that become necessary.

Versioning

The OpenShift environment will run a particular version of Linux, and provide particular versions of common utilities like the Java JVM. Although various Java versions are provided, it's tricky to configure a Java-based cartridge to use a JVM other than its default. Moreover, all the JVMs provided are variants of OpenJDK, not the Sun/Oracle product; a Java application that relies on particular features of the Sun/Oracle JDK will not work. There are similar limitations to utilties like Perl and PhP. Ideally, applications should be developed to be reasonably independent of specific utility versions.

Idling

On the public OpenShift service, applications owned by free subscriptions are subject to idling — they will be shut down after a certain period of inactivity. Inactivity is generally taken to mean no HTTP(S) or SSH connections. Even where applications are hosted on a paid subscription, developers need to be aware that they don't necessarily have control of the cartridge's lifecycle, and applications should be robust about unplanned shutdowns (of course, that's not just true in the cloud environment, but problems of this kind are particularly troublesome here).

Not a virtual machine

An OpenShift gear is not a virtual machine — an application does not have exclusive access to any system resource. SELinux security policies (which are quite strict) act to restrict access to resources that cannot safely be shared. To take just one example, your application won't be able to bind a socket to localhost, because this interface cannot safely be shared in a multi-tenancy environment like OpenShift. Your application can bind only to the IP number which its gear has been assigned. An application can find its IP number using an environment variable if necessary, but it's best if applications don't need to do this, for obvious reasons.

There are many subtle differences of this kind between working on OpenShift and working in a non-cloud environment. While the developer who works with OpenShift directly will find out about them soon enough, and find ways to avoid their effects, the developers of third-party libraries may not have that experience.

Summary

OpenShift is a platform for deploying application code using particular technologies. So far as possible, low-level details of the the platform and operating system are concealed from the developer, and development for OpenShift should be broadly the same as for any other environment. However, there are a number of limitations, and a knowledge of the technical implementation is often very useful when it comes to troubleshooting.
Copyright © 1994-2013 Kevin Boone. Updated Oct 02 2014