At the end of September, AWS announced a big new feature for its Step Functions product, and my tweet noting the announcement got a shocking number of impressions for something way out at the geeky end of Cloud tech. In retrospect, a design choice we made back in 2016 turns out to be working very well, and there’s a lesson to be learned here: If you need to integrate an arbitrarily large and and diverse set of software capabilities, URIs are the best integration glue.
Background ·
Step Functions launched in December 2016, and I did a whole lot of work on it. In particular, my fingerprints are all over
the
Amazon States Language, a JSON DSL that describes all the workflow stuff:
What software to run, branches and loops, error handling, retrying, parallelism, and so on. In the States Language, each of the
steps in the workflow is represented by a little blob of JSON called a “state”, and each has a Type
field saying
what it does. (I wanted to call the product “Amazon States” or
“AWS State Machines” but Andy and Charlie puked at that and we ended up with Step Functions, which isn’t terrible.)
The argument ·
When we first cooked up the product, the only real target we had was Lambda functions, and so the suggestion was that we have
a "Type": "Lambda"
state, with another field that would give the name of the Lambda function.
But I said
“Long-term, we want to be able to orchestrate lots of other things, not just Lambdas, right?” Everyone agreed. So I said “OK
then, let’s just have a Task
state which identifies the worker with a URI. That way, everything we orchestrate has the same contract,
you send it some JSON and you get some JSON back.”
People looked a bit puzzled and said “But Lambdas don’t have URIs.” I said “Sure they do, they have ARNs and ARNs are URIs.” (Well, they would be if Amazon registered the “arn:” URI scheme, which I should have while I was there and they should now. But close enough.) There was a little push-back on making people use the long klunky-looking ARN as opposed to the nice user-friendly function name, but I was pretty convinced and eventually won the argument.
I was remembering the dawn of the Web, quoting from someone (I think TimBL?) who said “On the Web, a resource is a unit of information or service.” Which I thought was a good fit here.
Flashing forward five years ·
Let’s just have a look at what Step Functions has been integrated with. Start
here, and scroll down (there are duplicate
anchors, grr) to the “Service Integrations” header, and look at the table. As I write this, there are 17 “Optimized”
integrations, and then 200+ SDK-based integrations. And they all use the same Task
state and address the target
worker by URI (which at the moment is always an ARN).
The ARN for an “Optimized integration” looks like (taking EMR for an example):
arn:aws:states:::emr-containers:createVirtualCluster
. “Optimized” means it’s smart about the way it calls the
service and can operate in either fire-and-forget or wait-for-completion mode. Also, it can autogenerate IAM policies to make
your life easier.
The recent announcement that kicked this discussion off made it possible to call more or less any API in the AWS SDK, addressing it with an ARN like:
arn:aws:states:::aws-sdk:emrcontainers:createVirtualCluster
. There’s a
really
excellent blog that walks you through the process.
I’m happy · I’m feeling just the tiniest bit smug that they were able to add all these integrations, and in particular this latest huge one, without needing to make any major changes to the States Language.
But to be honest, all of that comes more or less for free once you decide that everything you might want to integrate is a resource and thus should be identified by a Uniform Resource Identifier. I recommend this design pattern.
The future · I’ve always thought that once you agree to address things by URI, well that includes HTTP URLs, so why shouldn’t a Step Functions Task state be able to include an arbitrary external Web endpoint? SNS can already do this. Now… it’s kind of scary making an AWS service take a runtime dependency on an uncontrolled external anything, so this would be tricky to implement.
But it’s another thing you could do with no language changes, just because you decided to do things the Web-native way.