class: center, middle, title-slide, title-slide #
EGRESS ## without cold start penalties ### Elliot Murphy ### 2019/01/08 (updated: 2019-01-09) --- # GOALS * .large[Build/operate a fintech API with no EC2 servers (no patching, no SSH access)] * .large[Traffic pattern is relatively low frequency, each request is important (hitting a cold start on a request to transfer funds would be unacceptable)] * .large[Integrate with existing technical control schemes to build trust with other groups in the new system and technology] ??? The context for this was working in a large bank. These kinds of organizations not only have legacy technology, they often have grown by buying other companies, and so there are many different groups that may have different opinions, goals, and timescales. --- # TRANSLATION TIME - No EC2 Servers means Lambda functions. Also possible - ECS services running on Fargate - Since each request is important cold start times are crucial - Building an API translation layer with existing control schemes - IP based whitelist at backend service - Need IP based whitelist capability on our service frontend --- # MORE CONSTRAINTS ## VPC makes cold start worse Measuring with X-Ray, the cold start times are affecting us. Trying the heartbeat hack to minimize, but still vulnerable - each simultaneous requests still penalized for cold start - Yan Cui published excellent analysis at https://hackernoon.com/im-afraid-you-re-thinking-about-aws-lambda-cold-starts-all-wrong-7d907f278a4f Important detail: using Lambda VPC makes cold start much worse due to ENI attach times - extra 10 seconds --- # MORE CONSTRAINTS ## API Gateway problems API GW doesn't support IP based whitelisting WAF does support IP based whitelisting, but you have to connect to CloudFront. Edge-optimized API gateway doesn't provide a way to connect WAF to the underlying CloudFront distribution. --- # INGRESS WAF with ingress IP whitelist flows into Custom CloudFront, which connects to Regional API GW which connects to standard (non-VPC) Lambda warmed by our own heartbeat lambda functions driven by AWS CloudWatch Events with schedule expressions --- background-image: url(natgw.png) background-position: 10% 50% background-size: 80% class: right, top # STABLE EGRESS IP ??? So this is the standard diagram off the AWS docs for a NAT Gateway We don't have any web servers or database servers. The important thing to note is the Elastic IP assigned to the NAT GW When we drop traffic into the private subnet, the route tables mean that traffic goes to the NAT Gateway, which modifies the source IP on the outgoing packet, and keeps track of the connections to be able to correctly route return traffic. So if we can get our traffic into this VPC, it will "originate" from the Elastic IP. NAT Gateways are a per-availability zone thing, so we'll want to set these up in multiple AZs, distribute our ECS service across those AZs, and then rely on an Elastic Load Balancer to spread traffic across the AZs --- background-image: url(envoy.png) background-size: 100% background-position: bottom class: center # ENVOY --- # CONNECTING In the Egress VPC, we defined an ECS/Fargate proxy service, implemented as a container running https://www.envoyproxy.io/ from Lyft (alternatives: nginx, haproxy) ECS service control plane runs the service inside the Egress VPC, at least one instance in each of the 3 AZs Load balancer accepts traffic and distributes across 3 AZs, health checking and routing around down proxies. --- # CONFIGURING ## Authentication is important Lambda functions read from EC2 Systems Manager Parameter store, retrieve an authentication token and add it in a custom HTTP header Lambda functions configured to send traffic to the ELB fronting the proxy cluster instead of the real backend service proxy service reads from Parameter Store also, and authenticates incoming requests --- # Envoy http filter in lua ```lua function envoy_on_request(request_handle) local path = request_handle:headers():get(":path") if (path == "/healthcheck") then request_handle:respond({[":status"] = "200", ["Content-Type"] = "application/json"}, '{"healthy": "yep"}\n') end local basicauth1 = 'Basic SECRET_PLACEHOLDER' local basicauth2 = 'Basic PREVIOUS_SECRET_PLACEHOLDER' local authorization = request_handle:headers():get("X-Proxy-Auth") if ((basicauth1 == authorization) or (basicauth2 == authorization)) then request_handle:headers():remove("X-Proxy-Auth") else request_handle:respond({[":status"] = "403", ["Content-Type"] = "application/json"}, '{"allowed": "nope"}\n') end end ``` ??? One thing to note here is that we made a little wrapper script in go that reads the auth keys from AWS EC2 Parameter store (both the current and previous version) and then templates them into the code when the service boots. This means that credential rotation is as easy as updating Parameter Store and restarting the services. --- # NOTES ## things that were helpful httpbin and requestbin useful for verifying that the forwarded requests look the way we expect used a custom request header for proxy auth so we could leave the Authentication header for the backend service in place undisturbed no measurable slowdown is a much better result than adding 10s to every cold start deployment to both ECS & Lambda is CodePipeline driving CloudFormation stack updates --- # NOTES 2 Envoy was picked to power AWS App Mesh, and is a key component of Istio, reinforcing selection over nginx Initially CloudFormation support for `AWS::ApiGateway::DomainName` did not expose regionalDomainName attribute, so a custom CloudFormation resource enabled connecting Regional API GW to CloudFront+WAF. No longer needed. wrapper utility in go handles templating the Envoy config file with secrets retrieved from Parameter Store --- # GOALS * .large[Build/operate a fintech API with no EC2 servers (no patching, no SSH access)] * .large[Traffic pattern is relatively low frequency, each request is important (hitting a cold start on a request to transfer funds would be unacceptable)] * .large[Integrate with existing technical control schemes to build trust with other groups in the new system and technology] ??? Reviewing those original goals, we were able to deal with the surprise IP restriction requirement in about 2 weeks end to end, and the system has been delightfully stable. --- background-image: url(lines.png) background-position: bottom -135px center background-size: 40% class: center, middle # "Tell me about the last bug you filed against your cloud provider." elliot@kindlyops.com