Filtering Lessons
2021-12-11 10:38:42 Author: www.tbray.org(查看原文) 阅读量:18 收藏

Last week, AWS announced (docs, blog) event filtering for Lambdas reading from SQS, DynamoDB, and Kinesis. Lambda’s new filtering is consistent with EventBridge’s. Also announced was faster, slicker, cheaper S3/EventBridge integration. These features rely on software I helped write. The problem turns out to be more interesting than you might expect, by a factor of 10¹² or so. I think it’d be really great if the code were open-sourced.

The core idea is that in modern applications, streams of data flow around between services, and very often you want to select just some of the data out of the stream. For example, if you’ve written a few Lambdas, quite likely some of your code was about looking at the input events and deciding whether to process or ignore them. Or deciding which other software to dispatch them to. If you can achieve the same effect by writing pattern-matching rules, you can subtract code and save money: a win-win.

Interesting why? · It’s not as though these filtering features are doing anything new or innovative or surprising; this sort of thing goes back as far as 1973, with grep. There are lots more landmarks along that road; for example a lot of smart people seem to like jq these days. So, why should anyone care about this software?

First of all, the stream it’s filtering might be carrying millions of events per second. Second, people will want to post a million rules against the stream. (Neither of those numbers are hypothetical.) So it’s a million times a million times trickier than you might think.

Bearing that in mind, we created a library which I’ll call the Matcher. Its query expressions are a little idiosyncratic, but people seem to like them; that blog I linked to earlier is probably the best introduction I’ve seen.

The performance envelope is, um, unusual. But pleasing. “Good enough for EventBridge” is decent testimony. Even stronger is “good enough for S3”. Check out this short thread from Usman Khalid, GM for EventBridge and some other stuff, a guy I really enjoyed working with at AWS. He explains that the event-matching logic has been pushed down into S3, which doesn’t even emit the events unless they match a rule. Now, “inside of S3” is a very special place. An almost unimaginable amount of customer data lives in there, subject to an almost unimaginable transaction rate, and everyone is very very latency-sensitive.

How it works · Well, that’d be a secret. At least I thought so until I sat down to write this, and discovered U.S. Patent 11,068,487: Event-stream searching using compiled rule patterns, filed in 2015 but just issued last July. Patent theory says that the disclosure should be sufficient to enable a competent practitioner to implement the technology. Heh, good luck with that. I know a better way than patents to share engineering progress.

Open-source why? · The Matcher would be useful for a lot of people, if the AWS experience is any guide. It’s easy to integrate. It’d be a significant and visible give-back from AWS to the software commons from which it’s profited so richly. I also think it’d be a useful teaching tool (see below).

Having said that, and to be brutally honest, my reasons are selfish. I enjoyed working on this so much, and would like to do it some more. So from that point of view, AWS would get some free time out of an engineer with highly relevant experience.

I might also like to make versions of it in other programming languages.

And courseware · I applied plenty of lessons hard-won in decades of programming to this software, and learned useful new ones. So there are deep software-tech blog pieces screaming to be written. In fact, I think I could probably put together a 400-level CompSci course on “Topics in high-performance software design” using the Matcher as a lab substrate.

I’ll support that claim with one interesting and counter-intuitive lesson. But first…

History · When I started at AWS on December 1, 2014, my hiring manager said “Our idea is we’re gonna have an event bus carrying notifications of everything that’s happening in everybody’s AWS account and you can subscribe to it, select the events you care about, and process them with Lambdas.” So, it wasn’t my idea; and that manager is now a VP who owns a huge swathe of AWS Networking.

But I helped write the PR/FAQ and was senior engineer on the dev team.

I couldn’t stop thinking about matching the rules that defined the subscriptions. My mind hadn’t stretched to millions times millions yet, but it seemed obvious that the biggest customers, who would have the most events, would also have the most rules. So if you coded up a simple-minded loop like “for each event, retrieve all the rules posted by the event’s account and check one by one to see if they match” that’d probably be bad.

So I coded up most of release 1.0 of the Matcher by January, with a couple of breakthroughs happening on my Mom’s couch when we visited her in Saskatchewan for Christmas.

EventBridge (launched as “CloudWatch Events”) worked out pretty well and soon other groups started using the Matcher. It grew plenty of features over the years, some only exposed internally. Naturally, like any other widely-used piece of software that started out lean and mean, it became fatter and more complex. While there’s a ton of my code there, I’m sure it’s a minority by now.

I met some wonderful people from other AWS and Amazon teams who said “Oh we need Matcher but it also needs to do X, mind if I code that up?” Working at AWS was a fun job, and Matcher was one of the most-fun parts.

Now, I promised a non-obvious finding that emerged during the Matcher work…

Lesson: JSON is faster than binary · Here’s a surprising statement: If you want to filter messages very, very quickly, you can do it much faster if they’re in JSON as opposed to a strongly-typed “binary” format like Protobufs or Avro.

No, really. Because for those formats, you have to parse the whole JSON blob into a native data structure, which means looking at all the fields and allocating memory for them. You really want to minimize memory allocation for anything that happens millions of times per second.

JSON, however, allows stream parsing. Consider for example, the Jackson library’s nextToken() or Go’s json.Tokenizer.

Which means you can skip through each event, only keeping around the fields that you’re interested in filtering on, which in most event-processing scenarios is very few of them. So yeah, you do have to parse strings of digits into numbers, but only for the fields you care about.

And there are better structures for storing a small selection of data fields than JSON’s messy, weakly-typed tree. I’ve measured; the performance delta between parsing incoming events into an object and just skipping through their fields is freaking huge.

That’s just one take-away; there are plenty more. I’d sure like to share them.



文章来源: https://www.tbray.org/ongoing/When/202x/2021/12/03/Filtering-Lessons
如有侵权请联系:admin#unsafe.sh