Using AWS? This is probably why you’re getting it wrong

At The Scale Factory, we’ve been an AWS Consulting Partner for over six years. Back in April 2018, we were invited to join their Well-Architected program, and you probably haven’t stopped hearing me bang on about it since.

Well that’s because we’re a leading partner in this program, reviewing over 250 production AWS infrastructures. This gives us a unique and valuable perspective on how modern development teams are using the cloud.

From this vantage point, we’ve seen that a number of areas almost always need improvement. Today I’m going to talk a bit about the framework itself, and then share some of our findings in the hope that it will help you avoid making some of the common mistakes we’ve seen.

What is AWS Well-Architected and why should I care?

AWS Well-Architected is a set of design principles and architectural practices for building and running platforms in the cloud.

Provided as a set of whitepaper documents, the framework covers the five main pillars: 

  • Operational Excellence;Security;, 
  • Reliability;
  • Performance Efficiency; and 
  • Cost Optimization (to which I upset my spell checker by reluctantly adding the ‘z’ because that’s how US-based AWS spell it).

There are also a number of lenses, which cover specific types of workload includingFinancial Services, Analytics, Machine Learning, IoT, High Performance Computing (HPC), and Serverless Applications. Each lens adds workload-specific guidance to the five pillars.

The patterns found in the Well-Architected framework aren’t just opinions, they’ve been put together by AWS’ own Solutions Architects, informed by conversations with real customers of the platform – people like you.

With something like 176 individual services (a number  likely to increase in December when the annual re:Invent conference kicks off), the AWS platform is vast and constantly evolving. Following this sort of guidance is important to ensure you’re making the right choices.

It’s worth noting that you can benefit from these resources even if you’re not using AWS; the guidance and recommendations themselves aren’t AWS specific. In fact, Microsoft provides a near-identical Microsoft Azure Well-Architected Framework, which is so clearly ‘borrowed’ from AWS’ work that it has exactly the same five pillars.

What are these reviews you’re talking about?

A Well-Architected review is a series of multiple choice questions, designed to assess your use of AWS against the recommendations made by the framework.

There’s a review tool in the AWS console if you want to work through those yourself, but you’ll get more value from engaging an AWS Certified Well-Architected partner to facilitate it. Did I mention that we’re one of those?

A partner-led review takes around half a day. We walk through the questions in the review tool with your infrastructure stakeholders, using it as a starting point for deeper discussion.

Often, the review works as a teaching tool, with our Solutions Architect who runs the session offering guidance based on their own experience. Some teams use the review as a gap analysis ahead of an infrastructure roadmapping exercise. For larger organisations, the review framework can help with aligning the practice of multiple AWS-using teams.

At the end of a review, the console tool provides a list of high and medium risks, highlighting areas for improvement. We usually provide a more detailed report, expanding on those recommendations and contextualising them within what we know about your business.

The good news is that our findings show you’ve probably done a good job of building your platform. We’ve found that the majority of teams choose appropriate compute, storage, and network services, based on the needs of the workload, taking into account the characteristics of that part of the platform. Nobody we’ve reviewed is using Route53 as a database, or serving video from Lambda, which is honestly reassuring.

You’re probably not good enough at operations

Building a platform is only part of the story though. In terms of platform lifetime, unless something goes horribly wrong with your plans to hit profitability, you’ll be running your service for a much longer time than it takes to build. So why is operations so frequently overlooked?

The most common high risk issue we’ve found in our reviews relates to disaster recovery. 79% of customers either have no disaster recovery plan, or have a plan which isn’t adequate.

67% of teams don’t do enough resilience testing of their platforms – almost nobody practices failure injection or ‘game day’ style rehearsals for dealing with unanticipated failure.

My own theories on this topic are that either we’re all optimists and believe that nothing we’ve built could possibly go wrong, or we’re struggling to justify spending engineering hours on ‘maybe’ scenarios when there’s still a huge backlog of revenue-generating features to be built.

It might help to think in terms of revenue protection. What percentage of your revenues are you prepared to invest into making sure they keep flowing?

You’re probably not as secure as you should be

On the topic of incident response, 75% of customers either entirely, or partially lack a response plan for security incidents. If a bad actor (not Ben Affleck) gets their hands on your security tokens and uses them to provision instances for mining Bitcoin, how will you respond? Half of the teams reviewed know who in their organisation they’d contact in that circumstance (also probably not Ben Affleck), but that person wouldn’t have access to adequate tools for containing and analysing the attack.

47% of teams had improvements to make in terms of access control for their human users. Although most teams (90%) provide unique user credentials to their human users, or make use of some kind of identity federation service (62%), many of these teams (42%) grant too much privilege to their users. You’ve long since grown out of logging into every Linux server as root, it’s time to approach your access to cloud resources the same way.

It’s not just human users of these APIs that we need to worry about. For software components that need access, 30% have too high a level of privilege, and a full 78% aren’t making use of AWS dynamic authentication features to grant temporary access tokens without having to bake credentials into configuration files.

How do you know how to secure each service if you don’t have a good awareness of what data it holds? 75% of teams don’t do any kind of data classification. With considerations like GDPR and PCI-DSS, we all need to be more careful with what we store where, and who can access it.

Why are teams not great at security? Again, as in the case of operations, security is often an invisible consideration, rarely top of mind for a lot of businesses unless something awful happens, by which time it can be too late. The product landscape for security and compliance tooling in AWS has evolved substantially in the last couple of years. Maybe it’s time to take another look?

Your DevOps practices likely need work

We’ve been talking about DevOps since 2009, but these practices remain unevenly distributed.

Most teams are using version control (90%), configuration management (78%), and automated build and deployment systems (82%). However, only 63% of teams are making frequent, small, reversible changes; and only 52% can get code changes into production in a fully automated fashion.

With the Accelerate book providing us with quantitative research on DevOps practices, we know that high performing organisations typically have short lead times and high frequency of deployment. Investing in improvements here can really pay off.

We’ve found that monitoring is reasonably common, with 72% of teams capturing some kind of metric data about their workloads, but it seems that much of the time this data isn’t really being used. Only 51% of teams are establishing baselines for these metrics, and just 40% raise alerts when the business outcomes for the platform are at risk. Perhaps this is also a symptom of under-investment in operations?

Time to act

Our findings from these reviews show that most teams are better at building platforms than they are at securing or running them. Maybe this is true of your team too?

If you’d like to find out for yourself, then take a Well-Architected review: you can find the review tool in your AWS console. Alternatively, give us a call, our reviews are provided free of charge, and we have access to AWS funding programs to help pay for any improvements we identify as necessary.

(The data in this post comes from around 150 reviews, run up until January 2020. Most review recipients were small or medium-sized businesses, but based on prior experience I’d expect to see similar results in large or enterprise organisations too).

***

Want to hear more from Jon Topper and other incredible speakers? Join us at our forthcoming CTO Craft Con 2020: The People One on 1 – 3 December. Find out more and get your tickets here!

If you or your CTO / technology lead would benefit from any of the services offered by the CTO Craft community, use the Contact Us button at the top or email us here and we’ll be in touch!

Subscribe to Tech Manager Weekly for a free weekly dose of tech culture, hiring, development, process and more.