The Path to Microservices CI/CD Nirvana

By Sam Banks on March 18, 2020

By Sam Banks

March 18, 2020

What is CI/CD Nirvana?

Stackchat is a complicated platform from a DevOps perspective. It contains dynamically provisioned infrastructure, multiple AWS accounts per environment, dozens of Lambda-powered and Fargate-powered APIs and many other idiosyncrasies. For us, microservices CI/CD nirvana looks like the following:

The power to scale infrastructure up and down quickly, easily and without downtime.
The ability to easily spin-up new platform instances for feature branches, partner testing and new geographic regions.
Inexpensive infrastructure, utilizing automated cost saving mechanisms such as spot instances and scheduled cluster shutdowns and resizing.
Fully-automated, platform-wide code deployments completed in under an hour.
Automation of all repetitive ops processes such as certificate renewal, backups, version cleanups, security hardening and git repository maintenance.
Empowered developers creating and publishing new API services without involving the DevOps team.

There is plenty more exciting work planned over the next 12 months and the journey is far from over, but we are really happy with the current setup and feel that it's a good time to share some of our learnings.

Why we chose Buildkite + Ansible + Cloudformation to deploy our Infrastructure

Over the course of 24 months, we have built a global chat platform, with a focus on security, scalability and auditability.

We were able to build our platform and scale it out with a single-person DevOps team, avoiding many of the common pitfalls, largely thanks to a new breed of excellent DevOps tooling that favours simplicity over features, and by ensuring that we strictly adhere to good engineering principals.

In this post I will give a description of those engineering principals and will lay out the strengths and weaknesses of the tools that we chose.

I will also give some example code and screenshots at the end, with deep-dives on specific tasks and pipelines to follow in subsequent posts.

The engineering principals we live by

DRY
"Don't Repeat Yourself" or the DRY principle is stated as, "Every piece of knowledge or logic must have a single, unambiguous representation within a system."
Not only does DRY make code more efficient and readable, but it also encourages coders to observe other best practice conventions. For example, DRY motivates us to create reusable modules or templates which can be used and improved by other members of the team. The practice also encourages us to look for the most elegant solutions to any problems we encounter.
KISS
“Keep it simple, stupid” is thought to have been coined by the late Kelly Johnson, who was the lead engineer at the Lockheed Skunk Works, a place responsible for the S-71 Blackbird spy plane amongst many other notable achievements. Kelly explained the idea to others with a simple story. He told the designers at Lockheed that whatever they made had to be something that could be repaired by a man in a field with some basic mechanic’s training and simple tools.
To us, this translates directly to infrastructure design. All tooling should be easily maintained and modified by junior engineers with miminal assistance.
This can also be seen as a counterpoint to dogmatically adhering to DRY. Before creating a new module or script, We need to ask ourselves if it provides enough advantages to justify the increased complexity.
YAGNI
Yagni stands for "You Aren't Gonna Need It". The Yagni principle prevents over-engineering by limiting the development of speculative future software features because, more likely than not, "you aren't gonna need it".
Designing for future hypothetical use cases (especially related to performance optimization) is a common pitfall in DevOps which leads to time-wasting and bloated code bases that are hard to maintain. Remember, features are easy to add, but hard to remove so only build what you need right now.
Principle Of Least Astonishment
This principle means that your code should be intuitive and obvious, and not surprise another engineer when reviewing your code. If you hear your team mate muttering "what the f*!$%?" under their breath after you send them a PR, you are most likely in breach of this principal.
For variables, modules, roles, etc. your naming should always reflect the component's purpose, striking a balance between wordy and ambiguous, and the logic you create should be easy to follow.
Don't Over-Engineer
This one is especially close to our hearts and encapsulates most of the previous principals.
The pernicious effects of over-engineering can cripple an engineering team as a business scales, especially if it happens quickly.
Keeping your infrastructure code small and modular and not reinventing functionality already present in your chosen tooling (RTM before writing a new module), will allow your infrastructure to scale exponentially without requiring your team to.
Some common over-engineering crimes and their outcomes:
- Abstraction
  The temptation to wrap wrappers in wrapped wrappers is often too strong to resist for engineers in our industry. Abstraction is essential and exists at every level of an application, however it should be used sparingly to avoid the compounding costs involved.
  It may save you some future typing to use the latest so-hot-right-now-on-hacker-news tool in order to avoid the manual creation of a new module. However when it mysteriously breaks your integration pipeline six months down the track after a seemingly unrelated package update, and all traces of the tool and its author have dissappeared off the internet, you may regret having added so many layers of indirection and abstraction to your stack.
- State
  Avoid state when you can design a system without it. Like abstraction, state is everywhere and is a core building block of applications, but like abstraction, too much will lead to increased complexity and will add to your accumulative technical debt and maintenance burden.

Microservices vs Monolith

Before we get into the tooling we chose, it's worth briefly addressing our decision to go with a microservice (vs. a monolith) architecture. Both approaches have strengths and weaknesses, the finer points of which we will discuss in a future post. For the sake of brevity here, I'll just say that the added complexity of a microservices architecture was worth it for this project and it has given us the agility needed to quickly pivot and scale out the platform in a way that would be hard to imagine using a monolith without a significantly larger team.

If you do decide to adopt a microservices architecture, this post will hopefully give you an idea of the type of investment you can expect to make in your CI/CD processes in order to successfully build, scale and maintain your infrastructure.

Buildkite, Ansible, and Cloudformation

In this section I'll briefly describe our chosen tooling and will then deep dive into the strengths and weaknesses of each one, providing comparisons to other popular tooling.

As with all tooling assessments, the strengths and weaknesses I've outlined are highly subjective and may not ring true with you, but hopefully there is value in sharing our decision making process and subsequent outcomes.

Buildkite
Buildkite is a CI and build automation tool that combines the power of your own build servers with the convenience of a managed, centralized web UI. Buildkite allows us to automate complicated delivery pipelines and it gives us crazy levels of flexibility around custom checkout logic and dynamically building pipelines as part of a build.
Ansible
Ansible is an IT automation tool. It can configure systems, deploy software, and orchestrate more advanced DevOps tasks such as continuous deployments or zero downtime rolling updates.
Ansible’s goals are foremost those of simplicity and maximum ease of use. This approach has resulted in it recently overtaking Chef to become the most popular configuration management tool in the world.
Cloudformation
AWS Cloudformation allows us to use programming languages or simple config files to model and automatically provision all the AWS resources needed for Stackchat across all our supported regions and accounts.
This gives us a single source of truth for our infrastructure that can live in an API's code base next to the application code, empowering Developers to make infrastructure changes alongside their regular commits.

Buildkite

In a crowded market Buildkite distinguishes itself by being simple, fast and intelligently designed.

Strengths

Simple
Like Ansible, Buildkite pipelines are configured with YAML. A common configuration language is a big plus for us.
Compared to venerable stallwarts of CI/CD such as Jenkins and Bamboo, Buildkite has a fraction of the features and plugins. For us this is actually a huge plus, as we like to do everything with Ansible and use the CI solely to bootstrap it. This lack of bloat makes it a joy to use in this fashion.
Intelligently Designed
While Buildkite may be light on features, it isn't missing anything we need. This is no mean feat and is testament to a company with great engineering that listens to its customers.
They may not have a plugin to integrate with HP Operations Orchestration, or a Skype Notifier, or 1714 other plugins like Jenkins. They do however provide, via github, an excellent Elastic CI Stack for AWS codebase that allows you to easily implement an autoscaling fleet of build agents running on AWS Spot Instances, that can scale all the way down to zero when not in use. This gives us infinite horizontal scale for large deploys, overnight maintenance, and integration and load testing, at a very cheap cost, with very little engineering investment on our end.
Hosted
As a Serverless company, hosted is essential for us and Buildkite's speed and uptime has been flawless.
Open Source
Buildkite is also open source, which aligns with our company values and future goals. They host all of their code publicly on their github, even their own website!
Support
Buildkite support desk is top notch.
There are no levels to work your way through to get to someone who can help and no attempts to defer blame. You hit a knowledgeable engineer straight away and more often then not get the solution first time. If there is no immediate solutions they go out of their way to help with workarounds. If it's a bug in their system they tell you straight up and they fix it.
They also provide a weekly community summary via email containing announcements and features, as well as help requests from the forum. It has been surprisingly helpful to see other companies issues and how they solved them.
Company Culture
While not a technical requirement, it's nice to work with companies who's values align with yours. Their values from their website:
- Transparency
- Empathy
- Quality
- Collaboration
- Diversity
- Sustainable Growth
- Independence

Weaknesses

Lack of Features and Plugins
While this was a plus for us, people who prefer batteries included solutions should look elsewhere.

Example Buildkite pipeline

This simple pipeline is the standard one we use for our Typescript Node APIs. This is triggered on git commits to each individual microservice repository.

1---
2steps:
3  - label: ":ansible: Run stack_node.yml playbook"
4    command:
5      cd environment-automation/ansible/ &&
6      ansible-playbook
7        -e product=io
8        stack_node.yml
9

Our shared automation and configuration code lives in the environment-automation repository, which is accessed as a git submodule in all our other repositories. This allows us to test changes to our automation code in a specific branch/environment of a microservice without affecting others. These changes can then either be merged into integration via a pull request, or discarded.

Our pipelines simply bootstrap Ansible with a single product fact, which is our classifier for the different products we build. The rest of the facts, such as the AWS Account ID, Environment and the Cloudformation Stack Name are then generated by Ansible, based on the git branch and pipeline name.

This is demonstrated in the image below. The build was triggered by a pull request into our integration branch of our IO User API. Based on the product io, the branch integration and the user-api repo name, Ansible generates the IoIntegrationUserApi stack name and creates/updates that stack in the dev AWS account where our Integration environment lives.

Buildkite Screenshot

Ansible

The Swiss Army Knife of Continuous Integration, Continuous Deployment and Configuration Management.

Strengths

Simple to learn
Ansible playbooks are imperative, rather than declarative. Most automation and infrastructure management tools work by declaring a state of configuration. With Ansible you define a series of steps that are executed in order. This makes Ansible much easier to learn for engineers coming from scripting backgrounds.
Writing Ansible is similar in many ways to scripting, supporting popular imperative programming constructs such as conditionals, loops, and dynamic includes. Ansible modules are written in YAML, which is extremely popular, being one of the easiest configuration languages to use.
Declarative Modules
While automation code written in Ansible is written in simple imperative playbooks and roles, the modules provided by Ansible work in a declarative fashion. This gives you the best of both worlds.
Agentless
Chef, Puppet, Saltstack and so on have a master and a client. They need to be installed and configured on both the master and the client. Ansible requires installation only on the master server. It communicates with the other nodes through SSH.
While this isn't a concern for us, since our infrastructure is serverless with Ansible running standalone (no master) on our build agents, it is worth mentioning.
Idempotent
When you write a playbook for configuring your nodes, Ansible first checks if the state defined by each step is different from the current state of the nodes and only makes changes if required. Therefore if a playbook is executed multiple times, it will still result in the same system state.
Batteries Included
Ansible comes out-of-the-box ready to use, with everything you need to manage the infrastructure, networks, operating systems and services that you are already using via the 3000+ included modules.
These modules make it incredibly easy to perform complex tasks across virtually any Public Cloud and Private Infrastructure running any Operating System and Software Stack.
Multi Purpose
Unlike single purpose orchestration tools like Terraform, Ansible supports orchestration and configuration management, as well as much more.
While one tool might do specific things better than another, applying the 80/20 rule and keeping your stack lean, at the expense of non-core functionality, pays huge dividends as your business scales.

Weaknesses

Speed
Ansible is slower than many other tools, possibly due to it's serial nature and it's "push" model. This gets worse at scale. This isn't much of a problem for us due to our masterless setup and the awesomeness of Buildkite, which we will get into soon.

Example Ansible playbook

This is the stack_node.yml Ansible playbook that was bootstrapped by Buildkite in the previous example.

1---
2- hosts: all
3  gather_facts: true
4  roles:
5    - add_groups
6    - aws_sts_assume_role
7    - get_facts
8    - npm_token
9    - node_test
10    - role: node_build
11      when: deploy == true
12    - role: artifact_upload
13      when: deploy == true
14    - role: aws_cloudformation_deploy
15      template: "{{bk_root}}/cloudformation_template.yml.j2"
16      when: deploy == true
17

Lines 5-9 trigger the roles that set up the environment, gather all the variables based on the git branch, then run all the tests defined within the microservice codebase. These roles are executed every time there is a push to any branch in the codebase.

Lines 10-16 contain conditional roles; which only run on deploy-enabled git branches (e.g dev, integration, etc.). These roles compile the code, upload it to s3, then deploy it to a Lambda function via Cloudformation.

The Cloudformation template for each microservice is collected by Ansible from the root of the git repository for that microservice. This allows us to use shared playbooks and pipelines and therefore add microservices to our stacks without our automation code sprawling.

Cloudformation

Amazon's infrastructure-as-code solution provides a feature-rich automation and deployment platform.

Strengths

Language Support
Cloudformation allows you to write templates in YAML or JSON. But if you would prefer to use a full fat programming language of your choice the recently released Cloud Development Kit allows you to define your application using TypeScript, Python, Java, and .NET.
Vendor Support
You can orchestrate infrastructure in AWS using external tooling such as Terraform, Salt, Puppet and even Ansible via modules. While this does work and is an approach preferred by many, it violates several of our engineering principals and can lead to significant problems:
- Complexity
  The level of abstraction that make tools like Terraform more attractive to many newcomers inevitably leads to sprawling codebases that are hard to maintain and even harder to uplift.
  Many are also stateful, such as Terraform which uses a state file (why???), and require complex state modification when importing resources or resolving conflicts.
- Features
  Newly released Amazon products are immediately available in Cloudformation. When using external tooling you have to wait for someone to write (and hopefully test) a new module to support the product.
- Stability
  When you deploy a change via Cloudformation that requires a new resource, Amazon creates a new resource and waits until it is available and healthy before seamlessly replacing and deleting the old resources. If there are any issues a rollback is easy, as the old resources are kept until the cleanup stage of the deployment and it also results in reliably zero-downtime deploys. While some preference tools like Terraform for the speed of their in-place modifications, the ability to avoid downtime and roll back automatically is more important to us.
Extensibility
The recently released AWS Cloudformation Registry means you can now define third party resources in Cloudformation. For us being able to define our Datadog alerts in the Cloudformation template for the microservice that they are monitoring is a huge win.
Dependency Management
AWS Cloudformation automatically manages dependencies between your resources during stack management actions. You do not need to worry about specifying the order in which resource are created, updated, or deleted. Cloudformation determines the correct sequence of actions to use for each resource when performing stack operations.
This, along with the imperative nature of Ansible, means we get to completely avoid dependency management, which is a huge maintenance overhead in many other tools such as Puppet.

Weaknesses

Slow
Due to the deployment style of Cloudformation, which preferences zero-downtime and rollbacks over speed, it is slower than tools that do in-place updates.
(Mostly) Single Vendor
While they have recently introduced the Cloudformation Registry, this isn't for deploying to other clouds like Azure and GCP. For multi-cloud look elsewhere.

Example Cloudformation template

This is the cloudformation_template.yml.j2 that lives in the root of the user-api repository and is executed via the aws_cloudformation_deploy role in the Ansible playbook. Keeping infrastructure code next to application code makes our platform easy to reason about and maintain.

We prefer a combination of Ansible Inventory and the AWS SSM Parameter store over the Cloudformation Parameter Mappings you will see in many example templates on the web.

Ansible uses Jinja2 for it's templating including all the default filters as well as many excellent Ansible filters, adding lots of functionality, allowing us to simplify our templates. Before submitting the template to the Cloudformation SDK Ansible will evaluate the Jinja2 template and write it to cloudformation_template.yml after performing any transformations and replacing variable names (double curly braces {{var}}) with parameters from SSM or facts from Ansible.

Parameters pulled from SSM such as {{IoAppTableArn}} are scoped to the environment, which keeps environment permissions sandboxed and prevents cross-environment pollution.

This is one of our most complex templates; as this stack has multiple APIs in both Lambda, which is our default; and also Fargate, which we use for our IO product APIs that require more speed.

1#jinja2: trim_blocks: True, lstrip_blocks: True
2---
3AWSTemplateFormatVersion: '2010-09-09'
4Transform: AWS::Serverless-2016-10-31
5
6Resources:
7  IoUserApi:
8    Type: AWS::Serverless::Api
9    Properties:
10      StageName: Live
11      MethodSettings:
12        - HttpMethod: '*'
13          ResourcePath: /*
14          LoggingLevel: INFO
15          DataTraceEnabled: true
16          MetricsEnabled: true
17      DefinitionBody:
18        swagger: 2.0
19        info:
20          title: {{stack_name}}
21        x-amazon-apigateway-policy:
22          Version: 2012-10-17
23          Statement:
24            - Effect: Allow
25              Principal: '*'
26              Action:
27                - execute-api:Invoke
28              Resource: execute-api:/*
29        securityDefinitions:
30          sigv4:
31            type: apiKey
32            name: Authorization
33            in: header
34            x-amazon-apigateway-authtype: awsSigv4
35          tenant-id-authorizer:
36            type: apiKey
37            name: Authorization
38            in: header
39            x-amazon-apigateway-authtype: custom
40            x-amazon-apigateway-authorizer:
41              authorizerResultTtlInSeconds: 300
42              authorizerUri:
43                !Sub arn:${AWS::Partition}:apigateway:${AWS::Region}:lambda:path/2015-03-31/functions/{{AuthorizerArn}}/invocations
44              type: token
45        paths:
46          /admin/apps/{appId}/users:
47            options:
48              x-amazon-apigateway-integration:
49                httpMethod: POST
50                type: aws_proxy
51                uri: !Sub arn:${AWS::Partition}:apigateway:{{cf_region}}:lambda:path/2015-03-31/functions/${IoUserAdminApiFn.Arn}/invocations
52            x-amazon-apigateway-any-method:
53              security:
54                - tenant-id-authorizer: []
55              x-amazon-apigateway-integration:
56                httpMethod: POST
57                type: aws_proxy
58                uri: !Sub arn:${AWS::Partition}:apigateway:{{cf_region}}:lambda:path/2015-03-31/functions/${IoUserAdminApiFn.Arn}/invocations
59          /admin/apps/{appId}/users/{proxy+}:
60            options:
61              x-amazon-apigateway-integration:
62                httpMethod: POST
63                type: aws_proxy
64                uri: !Sub arn:${AWS::Partition}:apigateway:{{cf_region}}:lambda:path/2015-03-31/functions/${IoUserAdminApiFn.Arn}/invocations
65            x-amazon-apigateway-any-method:
66              security:
67                - tenant-id-authorizer: []
68              x-amazon-apigateway-integration:
69                httpMethod: POST
70                type: aws_proxy
71                uri: !Sub arn:${AWS::Partition}:apigateway:{{cf_region}}:lambda:path/2015-03-31/functions/${IoUserAdminApiFn.Arn}/invocations
72          /sdk/apps/{appId}/users/{proxy+}:
73            options:
74              parameters:
75                - name: appId
76                  in: path
77                  required: true
78                  type: string
79                - name: proxy
80                  in: path
81                  required: true
82                  type: string
83              responses: {}
84              x-amazon-apigateway-integration:
85                uri: https://{{IoClusterServiceApiUrl}}/users/sdk/apps/{appId}/users/{proxy}
86                responses:
87                  default:
88                    statusCode: 200
89                requestParameters:
90                  integration.request.path.appId: method.request.path.appId
91                  integration.request.path.proxy: method.request.path.proxy
92                passthroughBehavior: when_no_match
93                httpMethod: ANY
94                type: http_proxy
95            x-amazon-apigateway-any-method:
96              produces:
97                - application/json
98              parameters:
99                - name: appId
100                  in: path
101                  required: true
102                  type: string
103                - name: proxy
104                  in: path
105                  required: true
106                  type: string
107              responses: {}
108              security:
109                - sigv4: []
110              x-amazon-apigateway-integration:
111                uri: https://{{IoClusterServiceApiUrl}}/users/sdk/apps/{appId}/users/{proxy}
112                responses:
113                  default:
114                    statusCode: 200
115                requestParameters:
116                  integration.request.path.appId: method.request.path.appId
117                  integration.request.path.proxy: method.request.path.proxy
118                  integration.request.header.x-cognito-identity-id: context.identity.cognitoIdentityId
119                  integration.request.header.x-cognito-identity-pool-id: context.identity.cognitoIdentityPoolId
120                passthroughBehavior: when_no_match
121                httpMethod: ANY
122                type: http_proxy
123
124  IoUserApiBasePathMapping:
125    Type: AWS::ApiGateway::BasePathMapping
126    DependsOn: IoUserApiLiveStage
127    Properties:
128      BasePath: users
129      DomainName: {{IoSharedApiGatewayUrl}}
130      RestApiId: !Ref IoUserApi
131      Stage: Live
132
133  IoUserAdminApiFnLambdaExecRole:
134    Type: AWS::IAM::Role
135    Properties:
136      AssumeRolePolicyDocument:
137        Version: 2012-10-17
138        Statement:
139          - Effect: Allow
140            Action:
141              - sts:AssumeRole
142            Principal:
143              Service:
144                - lambda.amazonaws.com
145      Path: /
146      Policies:
147        - PolicyName: AttachedPolicy
148          PolicyDocument:
149            Version: 2012-10-17
150            Statement:
151              - Effect: Allow
152                Action: dynamodb:*
153                Resource:
154                  - {{IoAppTableArn}}
155                  - {{IoAppTableArn}}/index/*
156              - Effect: Allow
157                Action:
158                  - ssm:GetParametersByPath
159                  - ssm:GetParameters
160                  - ssm:GetParameter
161                Resource: !Sub arn:${AWS::Partition}:ssm:{{cf_region}}:{{cf_account}}:parameter/{{env}}/*
162              - Effect: Allow
163                Action: ssm:DescribeParameters
164                Resource: '*'
165              - Effect: Allow
166                Resource: '*'
167                Action:
168                  - xray:PutTelemetryRecords
169                  - xray:PutTraceSegments
170      ManagedPolicyArns:
171        - !Sub arn:${AWS::Partition}:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
172
173  IoUserAdminApiFn:
174    Type: AWS::Serverless::Function
175    Properties:
176      CodeUri: {{code_uri}}
177      Role: !GetAtt IoUserAdminApiFnLambdaExecRole.Arn
178      Runtime: nodejs10.x
179      Handler: adminHandler/adminHandler.adminHandler
180      Timeout: 300
181      MemorySize: 1024
182      AutoPublishAlias: live
183
184  InvokeIoUserAdminApiPermission:
185    Type: AWS::Lambda::Permission
186    Properties:
187      Action: lambda:InvokeFunction
188      FunctionName: !Ref IoUserAdminApiFn
189      Principal: apigateway.amazonaws.com
190
191  IoUserSdkApiRole:
192    Type: AWS::IAM::Role
193    Properties:
194      AssumeRolePolicyDocument:
195        Version: 2012-10-17
196        Statement:
197          - Effect: Allow
198            Action:
199              - sts:AssumeRole
200            Principal:
201              Service:
202                - ecs-tasks.amazonaws.com
203      Path: /
204      Policies:
205        - PolicyName: AttachedPolicy
206          PolicyDocument:
207            Version: 2012-10-17
208            Statement:
209              - Effect: Allow
210                Action: dynamodb:*
211                Resource:
212                  - {{IoAppTableArn}}
213                  - {{IoAppTableArn}}/index/*
214              - Effect: Allow
215                Action:
216                  - xray:PutTraceSegments
217                  - xray:PutTelemetryRecords
218                  - ecr:*
219                  - sts:AssumeRole
220                  - iam:GetRole
221                  - iam:PassRole
222                Resource: '*'
223              - Effect: Deny
224                Action:
225                  - logs:CreateLogStream
226                  - logs:PutLogEvents
227                Resource: '*'
228              - Effect: Allow
229                Action:
230                  - ssm:GetParameterHistory
231                  - ssm:GetParametersByPath
232                  - ssm:GetParameters
233                  - ssm:GetParameter
234                Resource: !Sub arn:${AWS::Partition}:ssm:{{cf_region}}:{{cf_account}}:parameter/{{env}}/*
235              - Effect: Allow
236                Action: ssm:DescribeParameters
237                Resource: '*'
238              - Effect: Allow
239                Action: iot:Publish
240                Resource:
241                  !Sub arn:${AWS::Partition}:iot:${AWS::Region}:${AWS::AccountId}:topic/live-chat/*
242      ManagedPolicyArns:
243        - !Sub arn:${AWS::Partition}:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
244        - !Sub arn:${AWS::Partition}:iam::aws:policy/service-role/AmazonEC2ContainerServiceAutoscaleRole
245
246  IoUserSdkApiTaskDefinition:
247    Type: AWS::ECS::TaskDefinition
248    Properties:
249      Cpu: '{{ecs_cpu_multiplier*256}}'
250      Memory: '{{ecs_mem_multiplier*512}}'
251      NetworkMode: awsvpc
252      RequiresCompatibilities:
253        - {{ecs_launch_type}}
254      ExecutionRoleArn: !GetAtt IoUserSdkApiRole.Arn
255      TaskRoleArn: !GetAtt IoUserSdkApiRole.Arn
256      ContainerDefinitions:
257        - Name: IoUserSdkApi
258          LogConfiguration:
259            LogDriver: awslogs
260            Options:
261              awslogs-group: !Ref IoUserSdkApiLogGroup
262              awslogs-region: !Ref AWS::Region
263              awslogs-stream-prefix: IoUserSdkApiService
264          Image:
265            !Sub ${AWS::AccountId}.dkr.ecr.${AWS::Region}.${AWS::URLSuffix}/{{IoUserApiEcrRepositoryId}}:{{container_name}}
266          PortMappings:
267            - ContainerPort: {{cluster_service_port}}
268
269  IoUserSdkApiService:
270    Type: AWS::ECS::Service
271    Properties:
272      Cluster: {{IoEcsClusterId}}
273      LaunchType: {{ecs_launch_type}}
274      DesiredCount: {{ecs_container_multiplier}}
275      DeploymentConfiguration:
276        MaximumPercent: 200
277        MinimumHealthyPercent: 100
278      NetworkConfiguration:
279        AwsvpcConfiguration:
280          SecurityGroups:
281            - {{IoContainerSecurityGroup}}
282          Subnets:
283            - {{PrivateSubnetA}}
284            - {{PrivateSubnetB}}
285 {% if PrivateSubnetC is defined %}
286            - {{PrivateSubnetC}}
287 {% endif %}
288      TaskDefinition: !Ref IoUserSdkApiTaskDefinition
289      LoadBalancers:
290        - ContainerName: IoUserSdkApi
291          ContainerPort: {{cluster_service_port}}
292          TargetGroupArn: !Ref IoUserSdkApiLoadBalancerTargetGroup
293
294  IoUserSdkApiLoadBalancerTargetGroup:
295    Type: AWS::ElasticLoadBalancingV2::TargetGroup
296    Properties:
297      HealthCheckIntervalSeconds: 6
298      HealthCheckPath: /healthcheck
299      HealthCheckProtocol: HTTP
300      HealthCheckPort: '{{cluster_service_port}}'
301      HealthCheckTimeoutSeconds: 5
302      HealthyThresholdCount: 2
303      TargetType: ip
304      TargetGroupAttributes:
305        - Key: deregistration_delay.timeout_seconds
306          Value: '30'
307      Port: {{cluster_service_port}}
308      Protocol: HTTP
309      UnhealthyThresholdCount: 2
310      VpcId: {{VPC}}
311
312  IoUserSdkApiLoadBalancerListenerRule:
313    Type: AWS::ElasticLoadBalancingV2::ListenerRule
314    Properties:
315      Actions:
316        - TargetGroupArn: !Ref IoUserSdkApiLoadBalancerTargetGroup
317          Type: forward
318      Conditions:
319        - Field: path-pattern
320          Values:
321            - /users
322            - /users/*
323      ListenerArn: {{IoSharedSdkApiLoadBalancerListenerArn}}
324      Priority: 1
325
326{% if 'prd' not in acc %}
327  AutoScalingTarget:
328    Type: AWS::ApplicationAutoScaling::ScalableTarget
329    Properties:
330      MinCapacity: {{ecs_container_multiplier}}
331      MaxCapacity: {{ecs_container_multiplier*2}}
332      ResourceId: !Sub service/{{IoEcsClusterId}}/${IoUserSdkApiService.Name}
333      ScalableDimension: ecs:service:DesiredCount
334      ServiceNamespace: ecs
335      RoleARN: !GetAtt IoUserSdkApiRole.Arn
336      ScheduledActions:
337        - ScalableTargetAction:
338            MinCapacity: 0
339            MaxCapacity: 0
340          Schedule: {{ecs_scale_in_schedule|default(omit)}}
341          ScheduledActionName: ScaleIn
342        - ScalableTargetAction:
343            MinCapacity: {{ecs_container_multiplier}}
344            MaxCapacity: {{ecs_container_multiplier*2}}
345          Schedule: {{ecs_scale_out_schedule|default(omit)}}
346          ScheduledActionName: ScaleOut
347{% endif %}
348

Lines 7-122 define an api gateway with both Lambda and Fargate powering different endpoints:

/sdk endpoints
- API key authorization via sigv4
- ECS Fargate cluster backend
/admin endpoints
- Cognito authorization via a Lambda authorizer function
- Lambda function backend

Lines 124-131 attach our api gateway to a shared domain name for this environment under the /users path

Lines 133-171 create a role for the admin api function, with limited access to the specific resources it needs.

Lines 173-182 create the /admin backend function

Lines 184-189 grants access to api gateway to invoke the /admin lambda function

Lines 191-244 create a role for the /sdk fargate task

Lines 246-324 create a fargate task definition, service and load balancer rule and attaches them to the shared ecs cluster and load balancer for the environment

Lines 326-347 create an autoscaling schedule (individual schedules are defined in Ansible inventory) to shut down non-prod fargate services outside of business hours.

Conclusion

While we are still on the journey to full CD, we are very happy with how the platform has been performing and with how easily non DevOps developers have taken to the codebase. In most cases developers have been able to create and modify APIs and pipelines with little to no assistance. It's hard to understate the velocity boost this gives our team and the productivity bump that comes from empowering developers to better understand the infrastructure that their code is running on.

Hopefully if you are still reading now, you were able to find some value in this overview of our experiences. If you are looking for more details on any specific parts of our stack, or just want nerd out about tooling, hit me up on the LinkedIn link at the top of the post.