Part 7: Cloud Storage
Welcome back! If this is the first article on the Google Cloud Platform you’re reading up
here, you may want to check out the intro to this series, available here,
but if you’re more of a "watch the Super Bowl at halftime because nothing
exciting happens in the first two quarters" kind of person, feel free to jump
right in. You can catch the highlights of the safety-on-the-first-play, the
pick-6 and the rest on the sports channels later.
As we mentioned last time, Google offers several different
ways of dealing with data storage: Google Cloud SQL is for those
applications that want to store data in the time-honored fashion of the
relational database and relational model. There’s also Google Cloud Datastore,
a non-relational "NoSQL" data storage approach, which we covered last time. And
lastly, there’s Google
Cloud Storage client, which is geared more for "large binaries," like
images and videos, which we’ll get into here, in just a moment.
Before we do so, however, a very serious point bears
repeating: Much as the various technical pundits and evangelists might want to
disagree, none of these is "superior" to the others. Those individuals who
prefer to slavishly follow whatever "best practice" industry pundits are going
on about will hate to hear me say this, but the fact is, each one solves a
different kind of problem, and sometimes the best approach is to use all of
them simultaneously, a technique sometimes called "polyglot persistence" or
"poly-store persistence". Or, as a famous writer once put it, "From each
database, according to its abilities, to each project, according to its needs."
Overview
First off, let’s make the point clear that Google Cloud Storage
is not really something that programmers will use to store the traditional
"object" or "customer data" that most other storage options (including Google
Cloud SQL and Google Cloud Datastore) will be storing. Google Cloud Storage is
more about storing larger, atomic (meaning we, as programmers, don’t look
inside of them, but treat them as entirely opaque blobs of storage) binary
data, like videos, sounds, large images, and what-not. It’s not a programmatic
API for being able to read parts of the file or for picking apart these files
into constituent parts. It’s more of a cloud-scale storage area network, with
APIs to upload, download, enumerate and remove. This is further reinforced by
the fact that for the duration of a given storage’s lifetime on Google Cloud
Storage, it is entirely immutable—it cannot be appended to, trimmed, or
modified in any other way.
That doesn’t make it less useful than the other two forms of
storage, however. If, for example, the application under construction needs to
store anything larger than a few K in size, particularly if that facility needs
to be allowed to the end-users of said application—such as allowing users to
attach arbitrary files as part of a bug report, or as part of a blog comment,
or as part of a forum post, or any of a dozen different possibilities—then
pretty clearly this is going to be a vastly superior solution, particularly
since it’s an "out of the box" solution, as opposed to having to write something
custom. (While the other two data storage options are certainly capable of
holding large data files, Google Cloud
SQL has the usual drawbacks every relational database has when trying to
store large BLOBs, and Google Cloud Datastore
is going to run into similar problems for similar reasons.)
And yes, if this facility sounds similar to Amazon’s S3
system, that’s because it is; not surprisingly, Google offers a helpful guide
for migrating data out of S3 and into Google Cloud Storage at https://developers.google.com/storage/docs/migrating.
I’ll leave reasons why one might look to migrate to the imagination of the
reader, but there’s also a reasonable discussion around using both systems as a
form of redundancy and backup—the chances of both major cloud platforms failing
are roughly equivalent to winning the lottery. In three states. On the same
day. While being struck by lightning. You get the idea.
Secondly, in marked contrast to the other Google Cloud
Platform discussions in this series, the Google Cloud Storage API is entirely
language-agnostic, being entirely "RESTful" APIs, using HTTP and either XML or
JSON (which is still experimental as of this writing) as the means by which a
developer interacts with Google Cloud Storage. This has both benefits and
drawbacks. It’s easier for Google to maintain, document, and improve the developer
API, because they have only one entry point to worry about. On top of that, an
HTTP API means that the Web tooling natively knows how to "speak" Google Cloud
Storage; we’ll get into this in a moment. But, it’s sometimes harder on the
individual developer, because now there isn’t a language-friendly API to lean
on, and (in the case of Java, at least) as a result we lose any compile-time
type safety to save us from stupid mistakes.
(For those whose HTTP skills are less than stellar, Google
does have some experimental client libraries that wrap the HTTP API calls, but
frankly, the benefit gained seems negligible, and developers will eventually
want to see the "raw" HTTP calls anyway, for debugging purposes if nothing
else. So this article is going to proceed under the assumption that the HTTP
APIs are the preferred option.)
APIs, URLs, HTTP, oh my!
Fundamentally, Google Cloud Storage
recognizes a few core concepts. First, all data within a given Google Cloud
Storage system is scoped to a "project", which corresponds to the projects that
we’ve been examining all along as part of this series. As far as developers are
concerned, the project is essentially the administrative unit for Google Cloud
Storage, so in order to start working with Google Cloud Storage, a developer
needs to go into the console (the same one we’ve been using for everything
else) and "turn on" Google Cloud Storage. This also implies that this is the
project that will be billed for the storage consumed.
Secondly, all data stored within Google Cloud Storage will
be stored within "buckets", which, as the name implies, are named containers
that will contain named "objects" that contain the actual data. In other words,
"buckets" are directories, "objects" are files. Pretty straightforward, no?
Buckets can be geographically located, so as to make the data live as close to
the target user as possible (to reduce the latency of download when the objects
are pulled down), and a given project has no limit to the number of buckets
that it can contain, so for projects that span worldwide boundaries, feel free
to create buckets with unique names that are essentially duplicates of one
another except for the geographic location. (In essence, this would be
duplicating the effects of a CDN, up to a point.) Buckets cannot nest, but
there’s no real limit to the names of the buckets, so it’s not unusual to
create naming schemes that mimic a directory path
("images-accident-12232013-seattle", for example), up to 63 characters in
length.
Objects, like files, have both data and metadata associated
with them, similarly (at least conceptually) to the metadata associated with
files on the filesystem, in the form of name-value pairs, similarly to how HTTP
handles header arguments.
The API itself is pretty straightforward; https://developers.google.com/storage/docs/reference-methods
has the complete list of all the APIs, but a quick summary is as follows:
-
GET /: Return a list of all the buckets this authenticated user
can see
-
PUT /: Create a bucket of the name specified in the Host HTTP
header
-
GET /: Get the items in a bucket of the name specified in the
Host HTTP header
-
GET (bucket): Get the items in a bucket of the name specified in
the URL request line
-
GET (object): Get the object of the name specified in the URL
request line, from a bucket of the name specified in the Host HTTP heade
-
PUT (object): Upload an object to the bucket specified in the
Host HTTP header, giving it the name specified in the URL request line
Note that most of these have a series of HTTP headers that
must also be included as part of the request, including an Authorization string
obtained prior to the Google Cloud Storage request, often done by carrying out
an OAuth request against the Google Cloud Platform.
Example
Imagine, for a moment, that the application being developed
is a trading card game, similar to Magic: The Gathering, except it’s entirely online
and Web-based. Trading card games are often characterized (and bought) because
of the artwork created for each card; downloading all of these images ahead of
time can be prohibitively expensive, and if the game wants to allow players to
upload custom-created cards (which would be a cool feature, you have to admit),
the images will need to be stored online somewhere and referenced as part of
the game. That somewhere, of course, is Google Cloud Storage.
Uploading a new image (whether by the developers or by a
player) would be a PUT operation to the "card-images" bucket, like so:
PUT /meatshieldfighter.jpg HTTP/1.1
Host: coolcardgame.storage.googleapis.com
Date: Sat, 20 Feb 2010 16:31:08 GMT
Content-Type: image/jpg
Content-MD5: iB94gawbwUSiZy5FuruIOQ==
Content-Length: 552346
Authorization: OAuth 1/zVNpoQNsOSxZKqOZgckhpQ
Notice the Host parameter to the HTTP request; the
"coolcardgame
" is the name of the bucket to which the object "meatshieldfighter.jpg
"
(obviously the artwork for the Meat Shield Fighter card) would be stored. Then,
the game can later simply include a GET URL, perhaps directly as part of an
"img" tag, like so:
GET /meatshieldfighter.jpg HTTP/1.1
Host: coolcardgame.storage.googleapis.com
Content-Length: 0
Authorization: OAuth 1/zVNpoQNsOSxZKqOZgckhpQ
This is part of the power of using the HTTP API directly—by
using direct URLs into the Google Cloud Storage system; developers can let
users’ browsers take full advantage of the rest of the Web infrastructure,
including any and all caching servers in between their browser and the Google Cloud Platform servers. For this
same reason, Google Cloud Storage supports a POST form of the PUT command to
upload an object, specifically built to allow for Web forms to POST images (in
this case) directly into the cloud.
There’s more, of course, including the ability to use HEAD
to obtain the metadata for an object and DELETE to delete objects and/or empty
buckets, but the HTTP to use those are pretty inferable from looking at the
above. There’s also an access-control system that can be set on individual
objects and buckets, as well as a mechanism to mark certain objects as "public"
(meaning no Authorization header is necessary to retrieve them), all of which
is described in the Google Cloud Storage documentation online.
Summary
Realistically, there’s not much more to say about Google
Cloud Storage: it supports upload, download, enumeration and deletion of large
binary objects. By using an HTTP-based API (what developer enthusiasts are more
and more coming to call "web APIs"), Google makes it
pretty trivial to do all of these things from within the browser as well as
from an application server if necessary.
More importantly, the Google Cloud Platform, as we’ve seen,
is a pretty full-featured, "batteries included" cloud platform, with all the
core capabilities that developers have come to expect from a cloud API
environment, as well as a few features that other cloud platforms lack.
Overall, it’s a powerful platform, and one that developers need to examine
carefully as an option for the next Web-based or mobile-based application
they’re asked to build.
As the Google Cloud
Platform evolves further, we’ll take a look at the various parts and pieces
that emerge, but for now, we’ve covered the core parts, so good luck, get
clouding, and happy coding!