Over the past few weeks, I have been working on a small project to automate the creation of an AMI (Amazon Machine Image) for Ubuntu 20.04, hardened according to CIS benchmarks. I ended up using a combination of GitHub actions, Terraform, and an existing Ansible role for this project. The main motivation for this project was exploring the various features of these frameworks. Depending on your use case, you might find HashiCorp Packer easier to use for creating an AMI or you can directly use the official CIS AMI in the AWS marketplace. This article will give a brief introduction to the project architecture and go over a few design decisions and challenges faced. For source code and detailed instructions on how to replicate the project, check the GitHub repository here.
- All actions are performed using GitHub actions workflows.
- Terraform is used to create a temporary VPC, IAM Role, and EC2 instance.
- After the infrastructure has been deployed, an Ansible playbook is executed on the remote EC2 instance using the AWS Systems Manager Run command and the AWS-ApplyAnsiblePlaybooks document. The instance is in a private subnet and does not require any open inbound ports.
- AMI creation is triggered for the instance after playbook execution is completed.
- Terraform destroys the temporary infrastructure created after the AMI is created.
- Static resources like the Terraform State, the Ansible playbook zip, and the output of the Systems Manager Run command are stored in Amazon S3.
Design decisions and challenges
EC2 instance and private subnet connectivity
Initial design: Construct the VPC without an Internet Gateway, NAT Gateway, or public subnets. Instance connectivity could still be established using AWS Systems Manager Session Manager. Ansible has a connection plugin that uses Session Manager instead of SSH to connect to the instance. This would have required three VPC interface endpoints for Session Manager (AWS documentation) and one VPC gateway endpoint for S3. This solution is cheaper and more secure than using a NAT Gateway.
First problem faced: I came across this issue while executing a simple Ansible playbook with only one task to gather facts. At the time of writing this article, the issue is still open. Since I did not want to risk facing this issue with a random step in the actual Ansible role, either now or in the future, I decided to abandon the Session Manager connection plugin.
Solution: I decided to use the AWS Systems Manager Run command and AWS-ApplyAnsiblePlaybooks document to execute the Ansible playbook on the instance.
Second problem faced: AWS-ApplyAnsiblePlaybooks has a few dependencies which need to be installed. Since the VPC had no internet connectivity there was no way of directly installing the dependencies using apt. It is possible to still download packages using apt if the repositories are hosted on S3 using gateway endpoints. Unfortunately, Canonical doesn’t currently have an official S3 mirror and AWS’s default repository is not hosted on S3.
Solution: I changed the architecture to use NAT Gateway instead of VPC endpoints. In the long term, I am considering using a temporary private apt repository hosted on S3 or some other solution that would not require Internet connectivity.
GitHub Actions workflow design
The repository has two GitHub action workflows.
Code Checks workflow: It performs some basic linting and executes a terraform plan step to review the infrastructure being created. It is executed for internal pull requests and when code is pushed to the main branch.
Ubuntu 20.04 CIS AMI Baker workflow: It performs the main AMI creation, steps for which are detailed in the Architecture section. Since we don’t want a separate AMI to be created for all code changes pushed to the main branch, It is triggered manually using the Workflow Dispatch event. This also has the advantage of being able to accept input parameters to decide which part of the workflow needs to be triggered.
- The first part of the workflow creates the infrastructure and starts Ansible playbook execution.
- The second part of the workflow creates the AMI and deletes the infrastructure.
Running the workflow in two parts:
- Allows manual changes to the instance before the AMI is created.
- Reduces GitHub actions execution time since the second part doesn’t have to wait for the Ansible playbook execution to complete.