The four levels of automated remediation

Automated remediation of cloud misconfigurations was a big theme in 2018, and here at DivvyCloud we expect the trend to continue through 2019. One of the significant challenges customers face is putting automation into action, instead of just talking about it.

When enterprises evaluate Cloud Security Posture Management (CSPM) solutions, automated remediation is frequently the end goal. As with any enterprise system, it is critical to learn, plan, and prototype your automation capabilities until the power is fully understood. Our challenge is to help those clients who are starting from a blank slate to take a “crawl, walk, run” approach. Running aggressive automated remediation from day 1 risks causing more issues than you’re solving. As a result, your team will most likely be averse to future automated remediation efforts. A poor initial implementation of remediation introduces a risk of organizational opposition to automation going forward.

Automation can range from basic notification and logging to fully automated remediation (the most advanced type of automation). You don’t need to start with 100% automated remediation from day 1. In fact, most organizations benefit greatly from working their way through the levels of automation to fully explore what approaches suit their environment best. In this paper, we’ll examine the different steps and levels of automation, and at the end of this document, you’ll be able to choose the appropriate level of automation for your environment.

First, let’s start by reviewing the benefits of automated remediation:

Save time - humans no longer need to react and take action manually. The actions are performed automatically, allowing humans to work on higher value-add tasks. Especially at enterprise scale, the time savings can be significant.
Improved security - vulnerabilities and problems are addressed immediately upon discovery, preventing bad actors from capitalizing on issues.
Consistency - every action runs with the exact same workflow, and organizations can be sure that the prescribed procedures are always being followed correctly.
Continuous compliance logging - provide proof of the results of real-time corrections to keep cloud environments compliant, rather than periodic audits.

Before Starting: Notification

Notification is the foundational building block of automated remediation, and something you will use on an ongoing basis. Because you’ll configure notifications to send reports of remediated events, this is a great place to start for testing.

During the initial rollout, using only notifications for the first tests is critical because it allows you to audit exactly what would be remediated without making changes. This is a great way to ensure any actions that would be invoked will do exactly what you want.

The old saying of “measure twice, cut once” applies here. Do not move out of the notification stage until you’re able to consistently validate that you are only receiving notifications for the defined resources.

When you’re ready, the steps are:

1. Decide which resource type you want to remediate and what check will trigger the automated remediation (exposed SSH, public buckets, etc.)
2. Perform several dry-runs with notification only, and make sure the reported results are exactly what you expect.
3. If the notifications and resources that were marked as non-compliant are aligned with expectations, then move on to level 2.
Tip: After initial testing and the first stage of remediation has been performed, it is important to keep notifications turned on so that the ongoing results of your remediation are logged for audit.

This is critical for several reasons. If things are breaking and being fixed automatically, how will you know when something bad is happening? Also, if there’s someone who is making unintentionally incorrect or insecure changes, they won’t have any way of being notified that they need to change what they’re doing.

It’s important that notifications don’t become “noise” or “spam” to the recipients. To that end, notifications should include as much contextual information as possible.

Sample notification, ticketing, and logging targets include:

          • Email
          • Splunk/Sumo
          • Slack
          • Service Now
          • Post to API

Level 1: Ensure Visibility and Accountability through Logging

The next step in moving towards automated remediation should focus on locking down your account fundamentals. There are several initial configurations that every new cloud account should have, and most of them can be controlled with automation.

Sample automation for AWS can include:

Cloudtrail - ensure there is one trail logging global services
Cloudtrail - ensure that all regions are logging
Cloudtrail - ensure all logs are being aggregated to a central bucket
IAM - enforce a complex password policy
S3 buckets - enable versioning
S3 buckets - enable logging
S3 buckets - enable server-side encryption
VPC - ensure all VPCs have VPC Flow Logging enabled
EC2 - Auto-tag instances that are spun up without an owner tag to populate the value with the username of whoever created it
EBS - Ensure all volumes associated with an instance are tagged

These fundamentals will save valuable time and increase the security of your environment. None of these configurations should have a negative impact on your day to day operations or users.

Level 2: More Impactful Best Practices

The automation recommendations in level 1 and 2 line up closely with the AWS CIS benchmark. These are account fundamentals that any organization can employ to improve the security and overall hygiene of their cloud accounts.

The difference between level 1 and 2 is that in level 2 the automation for account fundamentals will take some planning to ensure they won’t have an impact on your users. The level 2 automation will be easier to roll out than the automation that provides remediation in levels 3 and 4.

Examples of AWS housekeeping automated remediation:

• Remove all unused security groups that start with launch-wizard*

• Delete the default rules on the default security groups

• Ensure that all custom AMIs are set to private

Level 3: Governance and Account Hygiene

Things get a bit more free-form in level 3. The goal here is to make this automation your own and add actions that bring your company the most value while affecting day to day operations as little as possible.

There are several use-cases that may be employed, and you’ll have to explore which of these best fits your company:

Sandbox environment enforcement (clean up every X hours, or in line with your software release cycles)
Notify / kill expensive instances that are spun up (either by cost, family type, or specs on the hardware)
Cost control (clean up unused databases, old snapshots, orphaned resources, etc.)
Port exposure cleanup (lock down SSH, RDP, etc.)
Keep non-main regions unused (kill off or alert for assets outside of primary regions)

Level 4: “Classic” Automated Remediation

It’s not until level 4 that you’ll begin performing the kind of automated remediation most people think of when discussing the topic. That’s because while these are the most exciting form of automation, these are actions that give you the most control, and can also do the most damage. Additionally, when you start rolling out these kinds of actions, you need to make sure that the organization is completely on board. If these types of actions are being run in a vacuum, you’re going to have a lot of confused people and potentially a lot of broken systems.

Examples of level 4 automated remediations in AWS:

Deactivating old API keys
Deleting newly provisioned SG rules which do not have a description
Killing instances that aren't tagged properly, or aren't running a golden AMI
Lock down public S3 buckets (zero out ACLs and bucket policy)
- Add in exceptions (by tag, name, etc.)

Other Considerations — Timing

It’s important to mention the concept of timing for the actions that are in levels 3 and 4. It’s appealing to initially roll out actions that have a lag between notification and remediation built into them, but that might not be the best approach.

For example, if you wanted to lock down SSH exposure in your development environment, you might design your remediation to send a Slack notification that an instance is out of compliance because it has SSH exposed and that it’ll be terminated in two hours if not fixed. Two hours later, the instance can be terminated.

It may seem counterintuitive, however, this is actually a more disruptive workflow than if the instance was turned off as soon as the issue was created. In the first scenario, if a developer spins up a non-compliant instance and then it goes away after two hours, they will have spent two hours of work on that instance. The code will have been loaded, the app might be running, and if it just goes away, that’s developer time that was wasted —and they’ll probably be upset! Instead, if the instance is torn down as soon as it’s seen, there wouldn't have been an opportunity for the developer to waste any time on the instance. It’ll go away a minute or two after it is created, and the supporting notifications will give them the context they need to avoid the same mistake again.

Different scenarios require different timing and you’ll always have to balance the risk of security exposure with the operational impact it will have on the organization.

To Sum Up…

Every organization will have a unique journey when implementing automation. For some, there’s no appetite for full automated remediation and just using automation for notifications will be enough. In other organizations, everything will be automated and completely locked down. Whatever level your organization strives to get to, by working through these 4 levels you can be successful in gradually rolling out automation to achieve fully automated remediation and get the most value out of the actions with the least amount of shock to the organization.

Chris DeRamus, CTO and co-founder, DivvyCloud