+ - 0:00:00
Notes for current slide
Notes for next slide

The context for this was working in a large bank. These kinds of organizations not only have legacy technology, they often have grown by buying other companies, and so there are many different groups that may have different opinions, goals, and timescales.


EGRESS

without cold start penalties

Elliot Murphy

2019/01/08 (updated: 2019-01-09)

1 / 15

GOALS

  • Build/operate a fintech API with no EC2 servers (no patching, no SSH access)

  • Traffic pattern is relatively low frequency, each request is important (hitting a cold start on a request to transfer funds would be unacceptable)

  • Integrate with existing technical control schemes to build trust with other groups in the new system and technology

2 / 15

The context for this was working in a large bank. These kinds of organizations not only have legacy technology, they often have grown by buying other companies, and so there are many different groups that may have different opinions, goals, and timescales.

TRANSLATION TIME

  • No EC2 Servers means Lambda functions. Also possible - ECS services running on Fargate

  • Since each request is important cold start times are crucial

  • Building an API translation layer with existing control schemes

    • IP based whitelist at backend service

    • Need IP based whitelist capability on our service frontend

3 / 15

MORE CONSTRAINTS

VPC makes cold start worse

Measuring with X-Ray, the cold start times are affecting us. Trying the heartbeat hack to minimize, but still vulnerable - each simultaneous requests still penalized for cold start - Yan Cui published excellent analysis at https://hackernoon.com/im-afraid-you-re-thinking-about-aws-lambda-cold-starts-all-wrong-7d907f278a4f

Important detail: using Lambda VPC makes cold start much worse due to ENI attach times - extra 10 seconds

4 / 15

MORE CONSTRAINTS

API Gateway problems

API GW doesn't support IP based whitelisting

WAF does support IP based whitelisting, but you have to connect to CloudFront.

Edge-optimized API gateway doesn't provide a way to connect WAF to the underlying CloudFront distribution.

5 / 15

INGRESS

WAF with ingress IP whitelist flows into

Custom CloudFront, which connects to

Regional API GW which connects to

standard (non-VPC) Lambda warmed by our own heartbeat lambda functions driven by AWS CloudWatch Events with schedule expressions

6 / 15

STABLE EGRESS IP

7 / 15

So this is the standard diagram off the AWS docs for a NAT Gateway We don't have any web servers or database servers. The important thing to note is the Elastic IP assigned to the NAT GW

When we drop traffic into the private subnet, the route tables mean that traffic goes to the NAT Gateway, which modifies the source IP on the outgoing packet, and keeps track of the connections to be able to correctly route return traffic.

So if we can get our traffic into this VPC, it will "originate" from the Elastic IP.

NAT Gateways are a per-availability zone thing, so we'll want to set these up in multiple AZs, distribute our ECS service across those AZs, and then rely on an Elastic Load Balancer to spread traffic across the AZs

ENVOY

8 / 15

CONNECTING

In the Egress VPC, we defined an ECS/Fargate proxy service, implemented as a container running https://www.envoyproxy.io/ from Lyft (alternatives: nginx, haproxy)

ECS service control plane runs the service inside the Egress VPC, at least one instance in each of the 3 AZs

Load balancer accepts traffic and distributes across 3 AZs, health checking and routing around down proxies.

9 / 15

CONFIGURING

Authentication is important

Lambda functions read from EC2 Systems Manager Parameter store, retrieve an authentication token and add it in a custom HTTP header

Lambda functions configured to send traffic to the ELB fronting the proxy cluster instead of the real backend service

proxy service reads from Parameter Store also, and authenticates incoming requests

10 / 15

Envoy http filter in lua

function envoy_on_request(request_handle)
local path = request_handle:headers():get(":path")
if (path == "/healthcheck") then
request_handle:respond({[":status"] = "200", ["Content-Type"] = "application/json"}, '{"healthy": "yep"}\n')
end
local basicauth1 = 'Basic SECRET_PLACEHOLDER'
local basicauth2 = 'Basic PREVIOUS_SECRET_PLACEHOLDER'
local authorization = request_handle:headers():get("X-Proxy-Auth")
if ((basicauth1 == authorization) or (basicauth2 == authorization)) then
request_handle:headers():remove("X-Proxy-Auth")
else
request_handle:respond({[":status"] = "403", ["Content-Type"] = "application/json"}, '{"allowed": "nope"}\n')
end
end
11 / 15

One thing to note here is that we made a little wrapper script in go that reads the auth keys from AWS EC2 Parameter store (both the current and previous version) and then templates them into the code when the service boots.

This means that credential rotation is as easy as updating Parameter Store and restarting the services.

NOTES

things that were helpful

httpbin and requestbin useful for verifying that the forwarded requests look the way we expect

used a custom request header for proxy auth so we could leave the Authentication header for the backend service in place undisturbed

no measurable slowdown is a much better result than adding 10s to every cold start

deployment to both ECS & Lambda is CodePipeline driving CloudFormation stack updates

12 / 15

NOTES 2

Envoy was picked to power AWS App Mesh, and is a key component of Istio, reinforcing selection over nginx

Initially CloudFormation support for AWS::ApiGateway::DomainName did not expose regionalDomainName attribute, so a custom CloudFormation resource enabled connecting Regional API GW to CloudFront+WAF. No longer needed.

wrapper utility in go handles templating the Envoy config file with secrets retrieved from Parameter Store

13 / 15

GOALS

  • Build/operate a fintech API with no EC2 servers (no patching, no SSH access)

  • Traffic pattern is relatively low frequency, each request is important (hitting a cold start on a request to transfer funds would be unacceptable)

  • Integrate with existing technical control schemes to build trust with other groups in the new system and technology

14 / 15

Reviewing those original goals, we were able to deal with the surprise IP restriction requirement in about 2 weeks end to end, and the system has been delightfully stable.

"Tell me about the last bug you filed against your cloud provider."

elliot@kindlyops.com

15 / 15

GOALS

  • Build/operate a fintech API with no EC2 servers (no patching, no SSH access)

  • Traffic pattern is relatively low frequency, each request is important (hitting a cold start on a request to transfer funds would be unacceptable)

  • Integrate with existing technical control schemes to build trust with other groups in the new system and technology

2 / 15

The context for this was working in a large bank. These kinds of organizations not only have legacy technology, they often have grown by buying other companies, and so there are many different groups that may have different opinions, goals, and timescales.

Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow