6 Things to know about AWS Elastic Beanstalk

Elastic BeanstalkElasticBeanstalk is AWS’ PaaS, and it’s a powerful platform to accelerate application delivery in AWS’ cloud.  That said, I have come across a few things that have been stumbling blocks for some users of the service, especially as they try to deploy production workloads.  In an effort to help you leverage beanstalk for production environments, I’ll go through the most common issues and some workarounds.

Security Groups

lockBeanstalk creates its own security groups so you don’t have to.  That’s great if you don’t want to create security groups, and an important concept to keep in mind.  While you can certainly modify the ingress and egress rules of these security groups, what you can’t easily do is use your own security groups.  Beanstalk does allow you to add security groups to your EC2 instances through their interface, but you cannot remove the security group they created for the EC2 instances.  This in contrast to AWS’ OpsWorks service, which asks you if you want to use OpsWorks security groups or not.  By saying no, you can use your own custom security groups.  In order to use custom security groups with Beanstalk, you would:

  • Create your custom security groups for the application layers
  • Create your Beanstalk environment
  • In the Beanstalk configuration, add your application security groups to the configuration
  • Modify the ELB manually to use your security group instead of the Beanstalk security group (replacement)
  • Manually remove all ingress and egress rules in the security group Beanstalk created for your EC2 instances

Configuration Management

ebextensionsBeanstalk does provide a hook to modify the underlying EC2 instances on creation.  Things like installing a log aggregation agent for example can be done through a feature called .ebextensions in your application code.  Which brings us to .ebextensions in your application code.  You are now bundling any configuration management in your application code.  In the case where the development team is small, this might make sense (one place for all code). Once you have different teams managing system configuration and application features, they are now using a single application repository.   Further, in order to test and deploy a system configuration change, you need to do an application release and test.  If for example your application is Java, that means something as trivial as modifying the log aggregation agent configuration file entry means an application compilation, test and release.

Configuration Changes

yellow-lightBeanstalk has a wide set of configuration settings that provide a single place to make changes like the EC2 instance type, auto scaling group size, etc.  However, not all changes just work.  For example, moving from a single instance to a load balanced setup works well.  Moving from a load balanced setup to a single instance, that can generate some errors.  In any case though, it takes time to apply a change.  Expect a significant amount of time for some environment changes, or just make sure you got the setup right the first time (for those values you are able to set during initial creation).  Finally, most changes require the termination and recreation of the EC2 instances.  This is a core concept for Beanstalk users to understand.  The EC2 instances can and will be replaced as needed by Beanstalk.  More on that later.

Tagging

tagging

Beanstalk supports adding custom tags during environment creation.  This is a great feature, especially for those users who have setup Cost Allocation Tagging and want to accurately report on AWS fees by tag.  However, once you set your tags, you cannot change them.  There is no workflow to modify tags (keys or values) once the Beanstalk environment is created in Elastic Beanstalk.  You would need to backtrace all those resources and modify the tagging manually (ELB, Auto Scaling Group, etc.).

Custom AMI

Beanstalk supports using your Custom AMI (maybe you baked in your server changes into your custom AMI).  That said, you can’t use your custom AMI until after your environment is created.  You have to build off the base Beanstalk AMI, wait for the environment to build, then replace the AMI with your custom one (thus rebuilding the environment to use the new AMI).  Not terrible, but not terribly efficient either.  Because Beanstalk uses agent software, you also need to follow a specific procedure to build a custom AMI that will function as expected in AWS.

Server Changes

revertOne really critical thing to always remember is, things you do on the EC2 instance manually are temporary changes.  Any environment change or scaling activity can remove your changes since they are not part of your application code (configuration management).  This replacement could be outside of your control, like maintenance by AWS for example.  Users that like to SSH on to the server and make changes or view logs should consider this notion.  If however you loathe SSH-ing on to your application servers, you may find this to be a benefit, not a hindrance.  In general, in order to benefit from Elastic Beanstalk, your application should be developed with the thought that any configuration information needs to live in the Beanstalk environment, application, or .ebextensions.  Any application state information needs to be kept somewhere other than the EC2 instance (Redis, Dynamo, etc.).

Posted in Amazon Web Services, AWS, AWS, Cloud, General, Public Cloud

Introducing FOG-OPS

The day has arrived when every company, not just those at Google-scale, can achieve high availability and scale without a linear increase in operational cost.  That’s what FOG-OPS is all about—leveraging Site Reliability Engineering, DevOps, automation and the cloud to achieve a truly scalable model for increasing site uptime through pro-active engineering instead of through reactive support operations.

I’ve heard from our customers for many years that they would love to leverage managed services in some way, but that the per server, per month (or % of IaaS) cost model just doesn’t make sense to them, especially as they grow.  This is why I’m so excited to be able to offer services packages that do pencil out and scale economically.  Not to mention that I love our new FOG-OPS logo 🙂 (thanks Justin!).fogops

To explain a bit further, signing up for FOG-OPS means that Foghorn takes responsibility for engineering your site to reduce downtime by leveraging automation and self-healing technologies fearlessly hardened by early adopters over recent years.  Foghorn will respond to alerts and remediate issues, but we don’t want to be up all night either!  Our mission is to engineer the reactive support requirement out of your site and our pricing model ensures we are motivated to do that.  And we will continue to invest in IP and technology that allows us to do this effectively.

Give us a call or drop us a note to learn more about FOG-OPS!

Tagged with: , , , ,
Posted in Amazon Web Services, AWS, AWS, Cloud, Public Cloud

When Security Best Practices Conflict – AWS KMS vs Whitelists

A while back AWS EBS encryption moved to using KMS (Key Management Service).  This was a welcome change as KMS is a great service that enables some interesting security models around different AWS customers sharing KMS keys and allowing each other to encrypt items they may hold on their behalf.  That said, we did find something interesting that eludes to what is going on behind the scenes.

tl;dr, if you are using KMS, you can’t whitelist by SourceIP in IAM policies (unless you whitelist all of AWS’ public IP space, including other AWS customers).

One of our customers was using custom AMIs with encrypted EBS volumes which were originally created prior to the release of KMS.  These AMIs were launched via knife-ec2 as this customer uses Chef.  We were provisioning new instances and the instances came up in Terminated state.  The State Transition Reason was Internal Server Error.

This led is through a series of follow up tasks to see what changed and why our AMI that was widely used in dev/test/prod was failing.  Those tasks (in order) were:

  • Launch the AMI manually in the AWS console, this worked
  • Launch a community AMI via knife-ec2 (no EBS encryption), this worked
  • Create a new custom AMI using the same EBS encryption (built off the latest Ubuntu 14.04 AMI), this had the same failure condition in knife-ec2, but worked in the AWS console
  • Create a new custom AMI not using EBS encryption and launch via knife-ec2, this worked

At this point we started working with Chef support and the knife-ec2 team to see if this was an issue with the tool or with us.  Those teams were not able to reproduce our issues so we decided to start looking in less obvious places.

We started looking at security in more detail.  It was odd that the AWS console worked to launch the AMI but not the knife-ec2 command which uses AWS access keys instead of an AWS console login.  Furthermore, we had not made any changes to security settings in IAM for a long time, so it was not an obvious place to look.  For this customer, AWS console logins are using on premise active directory federation and user logins are assuming roles.  The IAM users (console login not enabled) who have AWS access keys (for the sole purpose of knife-ec2 provisioning) were using IAM groups.  Those IAM groups were more restricted than the IAM roles (for federated console logins), but in both cases, access to KMS was allowed in IAM.

So we launched the encrypted EBS AMI using the aws cli run-instances command instead of knife-ec2 to see if the problem followed the use of the AWS access keys or if it stayed with knife-ec2.  The aws cli replicated the same end state as knife-ec2 (I should have tried the CLI before reaching out to the Chef / knife-ec2 team!).

That finally led us to the one IAM policy that was unique to IAM users with AWS access keys.  This policy was a blanket deny if certain conditions were not met.  Those conditions were:

  • The AWS access keys were being used from an instance inside our VPC
  • The AWS access keys were being used from a Source IP that is owned by the customer
  • The AWS access keys were being used from a Source IP that is our AWS EIPs used for NAT

I removed this deny condition and tested using the aws cli using the the encrypted EBS AMI.  This worked.  I tested the same AMI using knife-ec2, this worked.

So back to what is going on behind the scenes at KMS…

We opened a ticket with AWS to confirm that the KMS service is acting on our behalf, which was confirmed. This means that unless we open our whitelist policy to the entire AWS global IP range, the calls will fail. We also asked if we could get the public IP range for KMS to add them to our whitelist policy, which AWS was not able to provide.  Opening the whitelist policy in order to leverage KMS is a compromise in our security posture, but required to leverage the service.

Denying actions when conditions aren’t met, like Source IP or VPC ID are very powerful.  But they also bring you back to the reality of the public service that is AWS.  While it would be great if AWS could group their public IP usage by service (like they do in some cases), it’s obvious why that would be difficult for them to do for all services (especially those that leverage EC2). An alternate solution would be for AWS to offer a VPC endpoint for KMS, but this has not happened yet. Keep AWS in mind when you start thinking about how to impose global security controls.

Need help with your AWS security?

Although this example is very specific, providing increased security layers can be a powerful tool when implemented correctly.  Foghorn is here to help you with your cloud security, don’t hesitate to reach out!

 

Posted in Amazon Web Services, AWS, AWS, Cloud, Security

Optimizing Cloud Spend – When Context Counts

Piggy BankA little while ago, I was asked by a customer to help them lower their Amazon Web Services monthly spend.  They were currently spending about $8,000 per month, and set an internal goal to get that down to $7,000 in 30 days, and to $6,000 in 60 days.

At first glance, there was no easy fix.  Cost optimization tools are great for finding under-utilized servers, unattached volumes, and other clear opportunities to save.  But this customer was in decent shape. No server sprawl driving up costs, and although there was a significant amount of high performance block storage, nothing looked completely out of bounds.

After digging through the monthly bill on a day to day basis, and group by API call we were able to find a great opportunity to save money.  Although the customer was utilizing all of the services they had provisioned, there was a better way.  By changing type of storage for few of the volumes on some of their critical servers, we were able to decrease the bill by about $1,500 / month, achieving most of the 60 day cost optimization goal in about 15 minutes.

We do this all the time, for all of our customers. The reason this is possible is because we have context.  Analytics tools are critical for us to do our job, but do not stand alone.  We understand the application, the performance requirements, and the business impact to our customers if resources are overburdened. We can also offer alternative architectures which accomplish the same result for a lower price.

Foghorn offers monthly operations reviews for all of our platform customers at no additional cost. We take the best practices learned across our customer base, and apply them in context to your application and your business requirements.  

If you want more for less, talk to us about the benefits of working with a certified AWS Consulting Partner and Channel Reseller.

Tagged with: ,
Posted in Amazon Web Services, AWS, Public Cloud

To NAT or to Proxy, That is the Question…

A Better Way to Manage Internet Access for VPC Resources

Anyone who has run OpsWorks stacks with private instances relying on NAT for Internet access may have seen firsthand what danger lurks beneath the surface.  These instances will show as unhealthy to OpsWorks when NAT dies, and if you have Auto-Healing turned on (who wouldn’t?), get ready for OpsWorks auto-destruction (and to think your app was running just fine the whole time). (Read my NAT write-up).

What about that AWS Managed NAT Gateway?

When I started writing this post, AWS’ Managed NAT Gateway had not been released.  Such is the speed of innovation at AWS, I can’t even get my blog post out before they improve on VPC NAT infrastructure.  Even with this new and very welcome product, this blog post is still valid.  While Managed NAT Gateway solves many of the pains of running your own NAT instances, it still adheres to the same paradigms.  So, how about addressing Internet access an entirely different way than NAT and eliminating those limitations?  Enter the proxy.  Truthfully, a proxy is more complicated to manage than NAT (especially compared to a managed gateway), but the benefits may outweigh the complexities.  We solved these complexities by packaging up a proxy as a self-contained OpsWorks stack in a CloudFormation template to enable users to easily manage secure Internet access.  As a primer though, proxy?

Proxy?

Yes proxy.  Foghorn Web Services (FWS) proxy is a caching proxy to the Internet.  In normal use cases, this type of proxy is designed to increase performance through caching.  In our application, we use it for extended access controls and domain based whitelisting.  But the advantages don’t stop at security.

Everybody Gets Internet!

Transitive Relationships & VPC Peering

Many of our customers want unique VPCs for isolation, either based on their end customers, their own segmentation desires or for other reasons like dedicated tenancy within a VPC (but not within all of their VPCs).  In either case, once an architecture extends to more than one VPC, you naturally then find a reason to connect them.  This most commonly ends up being a shared services layer for things like VPN access, CI/CD tools, source code repositories, log aggregation, and monitoring.  It would be nice if you could manage Internet access in this same shared service tier, but VPC Peering does not support transitive relationships.  If you don’t feel like reading up on that, the basic premise is traffic can go from one VPC to another when peered, but cannot go through that peered VPC to somewhere else (like a NAT instance, a NAT Gateway, a Virtual Gateway, an Internet Gateway, etc.).

transitive_deny

This means that a NAT service layer in a shared services VPC cannot be leveraged by Peered VPCs for shared Internet access.  So each VPC now needs its own NAT service layer.  A proxy however does not behave like NAT.  The FWS proxy layer residing in a shared services VPC can enable Internet access for Peered VPCs.  Now you can manage Internet access like any other shared service.  Proxy 1, NAT 0.

You’re Trying to go Where?

Whitelisting

With NAT we can easily address port based traffic control by simply modifying the security group used by NAT.  So instead of all traffic, we can limit it to just 22/80/443 for example.  But the ports used are only part of the equation.  What if we wanted to whitelist by destination domains?  Why allow access to anywhere on the Internet over 443 when all we really need is GitHub and AWS?  I doubt anyone would argue that managing a whitelist for your environment is easier by domain names than IP addresses.  Again, this is where FWS proxy shines over NAT.  We can easily create whitelist files and have our OpsWorks stack pull them from S3 and update the proxy during a configure lifecycle recipe.  The inverse is true as well, we can also manage blacklists with our proxy.  But since we are going for a least necessary access list, we prefer whitelisting.  Consider for a minute how great it is to just drop traffic you don’t need.  Proxy 2, NAT 0.

Caching

While the main purpose of this blog post is to address the network and security benefits of a proxy over NAT and how we manage FWS proxy via OpsWorks, it doesn’t have to stop there.  Our proxy can be configured as a DNS cache and/or a web cache.  Storage is inexpensive at AWS, why not cache some static content and speed up deployment runs.  Same is true for DNS.  While it can be risky to have long cache times on DNS results, if we picked something reasonable, we can speed things up.  Either way, these are options, not requirements.  We are extending our Internet access service tier features.  Proxy 3, NAT 0.

Who is Pulling from THAT repo? 

Easy to manage?  Surely you can’t be serious?

I am serious, and don’t call me Shirley.  We built a CloudFormation template that creates an Elastic Load Balancer, OpsWorks stack, FWS Proxy layer and Load Scaling instances (as well as its own security group needs and IAM requirements).  It adds our custom recipes to the setup and configure lifecycle events.  Lastly, it writes the common configuration elements of the proxy to the Custom JSON of the Stack, enabling users to easily manage parameters without needing to understand how Chef works.  The most common change is editing the domain whitelist for the proxy.  This is accomplished via a simple text file stored in S3.  The configure recipe retrieves that file, overwrites the current whitelist with the new one and reloads the proxy service.  This is significantly faster than deploying a new proxy instance (say through an autoscaling group with the whitelist as part of the launch configuration).

But wait, there’s more…

Whitelisting is great but sometimes you need to figure out what traffic needs to be part of the whitelist.  We have a true/false configuration item in the OpsWorks Stack custom JSON which will either ignore (false) or enforce (true) the whitelist.  This along with CloudWatch logs integration (also part of the template) helps you determine what domains are currently being used.  You can go to the CloudWatch Logs interface and simply view the log streams.  They contain all the domains being accessed including the HTTP result code.  This can be used to build the whitelist, or audit Internet activity, or even possibly as a troubleshooting tool.

fws_proxy_log_stream_800

But I Need Persistent Source IPs

What if some of your destination services require whitelisting your addresses?  Not a problem, we have enabled Elastic IP Addresses to provide persistent public IP addressing of a dynamic proxy fleet (the default behavior, this can be disabled to save on the EIP costs).

The default configuration even accounts for all private networks supported by AWS VPC.  You don’t have to worry about what network ranges are chosen, just that they have outbound security group access to the FWS proxy ELB.

So what’s the catch?

Like anything, you don’t know what you don’t know.  If you don’t know where on the Internet your servers require, be prepared for some short term pain.  Furthermore, the adoption of a proxy server varies based on the operating system you are using and how you build your servers.  You will need to adjust the core OS to support the proxy server (if you plan to completely remove Internet access otherwise).  But not only do you need to know how to configure each service to use your proxy, you also need to know where this service is intending to go.  Domain based whitelisting is a powerful security tool, but it is predicated on you knowing the destination domains.  Refer to the earlier section on how we make that process easier with this solution.

On a closing note, while a proxy may be a great substitute for NAT in certain use cases, it is still simply an alternative tool.  So if the concepts discussed here are appealing, a proxy may be a better architecture than NAT.  But that is not to say that there is no place for NAT as there certainly still is (refer to my previous post on NAT).

AWS Marketplace Cluster Perhaps?

Not interested in the user focused OpsWorks managed environment?  No problem, thanks to AWS’ new Marketplace support for Clusters, Foghorn is creating a cluster solution encapsulating the core principles of this post available to launch via the Marketplace Cluster feature set.  Stay tuned for details on that offering.

 

Need help?

Although this packaged service is user friendly, the key is understanding your environment and Internet usage before migrating over.  Foghorn is here to help you with your cloud initiative, don’t hesitate to give us a call!

Posted in Amazon Web Services, AWS, Cloud, Security

Security Groups got you down? Get Security Flow!

Here at Foghorn Consulting, we’ve been designing, implementing, and managing point to point security with AWS security groups for years.  Security groups allow an amazingly granular method of controlling communications between instances without being bound to using networks as filter.  This allows us to design flat networks while still creating very granular and tightly controlled communications.

But with great power comes great responsibility.

We’ve seen many environments where many individuals have the capability of adding and modifying security groups.  After several years of organic change, it becomes difficult to really understand what all of those rules are doing.

Visualize your Rules with Security Flow

Security Flow

To help our customers to overcome this challenge, we’ve recently released a new feature in our AWS re-seller portal, available for free to all of our AWS direct customers.  It’s called Security Flow.  With Security Flow, you can instantly generate a visual representation of the security posture of your VPC.  It’s great for audits, design sessions, and even troubleshooting connectivity issues.

Learn more about the value added services available when you buy AWS directly from Foghorn Consulting.

Interested?  Contact us and get it for free!

 

Tagged with: , ,
Posted in Amazon Web Services, AWS, Cloud, Public Cloud, Security

Is your VMware humming, or is it leaking oil?

You’re paying good money for those VMware licenses, and the promise is to help you maximize resource utilization while decreasing management costs and providing high availability.  However, many VMware environments are either configured sub-optimally, designed haphazardly, or over-subscribed which may be putting your company at risk.  Is yours?

Common costly issues

I am sure every VMware administrator has experienced their share of vSphere issues, but these are the most common high-level areas that I often come across that could use improvement:

Design…what design?

I have seen many VMware environments that started with the free version. Then what happens? It grows and becomes a poorly designed production VMware installation. This causes the admins to spend more time managing it and typically creates configurations with a much higher chance of downtime.

No Growth plan

Many environments look at their current resources and provision ESXi hosts for a N+1 configuration. However, they don’t plan for the massive growth usually experienced utilizing virtualization. This often means that IT managers have to renegotiate VMware licensing and have to purchase new hardware that was not budgeted.

Lack of Standards and Planning for VM deployment

It is all too common for too many folks to have access to the VMware environment with no rules in place on how to use the environment. With this typical situation, bad things happen. For example, VMs end up being placed in the wrong locations on storage resources or on the wrong storage altogether.   Of course, this can be very bad if the VM was important but the directory it was created in was not backed up.

Poor Monitoring

Unfortunately in this age of modern IT, it is still common for infrastructure to be monitored poorly or not at all. It is very important to monitor all aspects of your VMware environment to ensure proper performance and avoid typical growth issues.

Over-subscribing your cluster

I have seen many environments that have too many servers provisioned to provide N+1 high availability. To make things worse, there are typically a number of virtual machines that are not being used anymore but are still taking up resources and putting production virtual machines at risk. This issue is often overlooked since there are no immediate obvious problems, but this becomes a huge issue when there is a hardware problem with one of the hosts.

Another common mistake is to incorrectly size the virtual machines. If you combine this mistake with over-subscription, you may run into CPU ready issues. CPU Ready refers to the amount of time a virtual machine is ready to use CPU but is unable to schedule time because all CPU resources are busy. Now your beefy virtual machine is suddenly not performing well even though you have provided it plenty of resources. (In this case, too many for the available resources)

Check your VMware Vitals

Here are some ways to check the vital signs.  If any of these are in the red, you can stand to save money and help avoid disasters by fixing them!

Review your VMware environment’s health

There are multiple ways to achieve this goal. There is a freeware script available from the VMware communities here: https://communities.vmware.com/docs/DOC-9842. VMware has an official health analyzer tool, but it is only available to the partner community. With this tool any VMware partner can collect a bunch of data in the VMware authorized manner.

However, collecting the data is the easy part, once you have collected the data, you should analyze the data to put together an action item list to ensure your environment can become healthy. I have yet to see a report where there was not something to fix.

Review your processes and procedures around VMware management

Even if your VMware environment is healthy, if your processes around usage are unhealthy, things can go belly-up quickly if unruly users with too much access make easy mistakes like using all of the storage or misconfiguring networking.

Ensure that your VMware environment is either locked down so only qualified administrators can make changes or deploy a self-service solution that will allow users to make changes without breaking the environment.

Ensure your design has been properly architected

This point is especially important for all companies that have deployments that were never really planned. Have you taken the time to examine resource allocation and usage? Examine your VM sizing and placement to ensure any virtual machines are configured to take advantage of your hardware resources. Ensure your virtual machine deployment is configured to avoid CPU Ready issues.

Has your licensing plan been upgraded, but your configuration does not take advantage of clusters and distributed switches? I have seen situations where the business has upgraded licensing, but has not made the investment to actually upgrade the design and configurations to take advantage of their licensing investment.

Examine your Monitoring Solution

If you have monitoring data, review all of the details of data collected. Sometimes important details are not red flagged but should still be addressed. Also, review the monitoring configuration to ensure all necessary data is being collected.

Take Action!

VMware vitals looking a bit sour? Don’t despair! This only means you have room to improve.  After remediation, you should have a more stable, manageable, and usable environment.

  • This may seem like a simple one, but I have seen companies sit on their health reports and not implement changes. Make the changes and ensure you have a well-configured VMware environment. If you are uncomfortable making the changes, get some experienced help. It is generally cheaper than production downtime.
  • Ensure your processes for managing your vSphere infrastructure are documented and all interested parties understand the process. If you are able to lock the environment down, it would be best to restrict access to the groups and provide only the access that they require.
  • Consider automation. With vRealize Automation and vRealize Orchestration, virtualization self-service can be setup to help avoid most of the pitfalls of a wild west environment.
  • Take a serious look at your current design and business requirements. It is usually cheaper to plan ahead and provision correctly than to try and fix it after your infrastructure has failed to meet the demands. Collect data around the company’s goals for virtualization, and make a plan to ensure the VMware environment can adapt meet the business needs.
  • Ensure you have proper monitoring and/or properly configured monitoring. This is the only way that you will obtain any real insight as to what is happening in your environment, ensure you are notified when there are problems, and provide historical data to use for resource planning.

Tell us more.

Got your own tips and tricks for optimizing Vmware? Post here in the comments.  Have any questions? Give us a ring, we’d love to share more with you!

 

 

 

 

 

 

 

Tagged with: , ,
Posted in Cloud, Private Cloud, VMware

The Reality around Cloud Lock-in Risk

Recently I’ve heard many in the industry discussing the risks of cloud lock-in, so I thought I’d give my take on it.  I hope you find these helpful, please feel free to comment below if you have had similar or contrary experiences!

The Current Talking Points

The promise of cloud is infinite scale, commodity prices, and portable workloads.  As the cloud vendors compete for your business, they are competing on two fronts:

  1. Differentiated services
  2. Price

Differentiated services allow us to benefit not just from infinitely scalable hardware, but also from automation and services higher up the stack.  But if these differentiated services are offered by a single cloud vendor, then by definition you will be locked in to that vendor.  There is a clear trade off, and many purists will argue that in order to protect your company from being locked in to a single vendor, you need to avoid using any services that are not industry standard.  After being held hostage by various legacy hardware and software providers for years, many companies are implementing this type of infrastructure policy, promising not to let it happen again.

How do we get locked in?

Before determining your strategy, it’s good to look at exactly how lock in occurs with a cloud provider.

Developing to the vendor’s API

One of the main benefits of cloud infrastructure is that you can interact with it programmatically instead of physically.  You can spin up and down servers, create and destroy networks, all with a simple script.  You can even integrate this capability into your application, allowing your application or your monitoring system to take these actions on your behalf. This is a very powerful capability, allowing you to guarantee service levels in a way that was virtually impossible in the physical infrastructure world.   But each vendor offers their own API.  Often they are similar, even identical (many open source cloud initiatives have decided to adopt the AWS API). But there is no guarantee that this will persist.  If you have code and scripts that leverage a specific vendor, those may need to be rewritten if you elect to move to a different cloud.

Leveraging Proprietary Components

Another benefit of the public cloud is to leverage application components higher up the stack.  Why build and manage your own message queue infrastructure? In the time it took you to do so, you could have instead been developing new features that differentiate you from your competition.  Some of these components have no real risk of lock in, as they are relatively standard (email, messaging, DNS, managed MySQL / PostGreSQL), and would only require finding a like service or building one prior to migration.  Other components may be proprietary, like Amazon’s DynamoDB, a horizontally scalable noSQL database.  This type of offering has a higher level of risk from a lock-in perspective.  But there is also a benefit. Building and managing your own highly scalable NoSQL database platform, and ensuring performance and availability across the board is no easy task.  That said, no other providers currently offer a NoSQL database that is compatible with the DynamoDB API.  Migration would mean selecting another product, and rewriting any application code that speaks to the database.

What else are we locked in to?

Cloud lock in is a reality, but I think it’s interesting that the topic of cloud lock in is so hot.  From my perspective, lock in is lock in. So, what other things are we locked in to?  We write our apps in a development language, and we are locked in to that language.  That language often has a preferred application server (or a few options), and sometimes only a single operating system that can effectively run those application servers.  Those operating systems run on single infrastructure architecture.  And if you are not using cloud infrastructure, that hardware physically sits with a colocation provider who requires term and volume commitments.

Some would argue that in order to avoid lock in at all these levels, you need to limit your technology choices to those that comply with industry standards.  So what would that really look like?  Ten years ago, you’d have decided on a lamp stack.  Then, after several years of development on MySQL, Oracle buys MySQL.  Oops.  Now what? Migrate off of MySQL?  Fork the open source and maintain it yourself?

Will my cloud vendor lock me in and raise my price?

This is a possibility.  We’ve covered the ways that you get locked in, and from my knowledge I don’t know a cloud provider who has promised to never raise prices.  That said, I don’t think I’ve seen a cloud infrastructure vendor raise their price.  What is our guarantee? Well, there are really no guarantees in life, but as an economics major I learned that competition breeds innovation and efficiency.  Although Amazon has bragged of lowering prices without the competitive pressure to do so, that has changed.  They are still lowering prices, but they now have competitive pressure. Their most recent price drop helped them stay on par (for the most part) with Google’s recent price drops for GCE.

From my perspective, the reality is that these vendors offer services which are similar enough that they will be directly competing with each other for new workloads for years to come.  That should continue to push these companies to be more efficient and enable them to lower prices and be more competitive.  These price drops to date have not been just for new workloads, they have been across the board.  So all customers, whether they are locked in or not, benefit from the competitive pressure that cloud infrastructure companies are generating on each other.

Is avoiding lock-in the goal?

We usually tolerate some level of lock in because the business benefit far outweighs the associated risks.  Any time we adopt a technology we are taking some risk of lock in, and we are benefiting from the value that technology is bringing our business.  In order to keep up with our competition, we need to constantly innovate and differentiate our products.  Time spent re-inventing the wheel or maintaining a completely vendor neutral technology environment is time that could otherwise have been spent innovating differentiated services.  The biggest risk to your business? Losing your competitive advantage.

Ok, so how should this effect my infrastructure strategy?

Like all business decisions, we will weigh the strengths and weaknesses of possible strategies, and pick those that benefit our business most.  For most of us, this will mean spending less time worrying about getting locked in to a cloud vendor, and spending more time evaluating the offerings, and choosing the one that most closely complements our workloads, technology stack, SLA requirements, and budget.  We will look at the vendor’s past actions, and make a realistic judgement about the likelihood of how they will act in the future.  Although some will limit their technology and vendor choices to only those that are truly portable, they will perhaps be suffering from a different form of lock in.  Locked in to a smaller feature set, and forced to recreate the wheel in many areas and at great cost, when a proprietary 3rd party solution would make much more business sense.  The rest of us will make sound decisions on where we should best spend our focus, time and capital, and accept some level of lock in for the business benefits that we receive in exchange.

Tagged with: , ,
Posted in Cloud, General

Highly Available Network Address Translation, that friend you love to hate…

if you care about security, you should care about NAT

Who is NAT and what is HA?

As we outlined in our last blog, Amazon Web Services (AWS) introduced Virtual Private Cloud (VPC) years ago and many advanced networking and security concepts are only available to VPC customers.  The push for VPC adoption has progressed to the point that “EC2 Classic” is now a basic VPC with public subnet setup for you by default.  Network Address Translation (NAT) is a basic concept you can associate with your home Internet router.  This device takes your public IP address and shares it with a private network of machines.  In AWS, an instance in a VPC subnet with a route that includes an Internet Gateway attachment must have an Elastic IP Address or Public IP Address associated with it in order to gain access to the Internet.  An instance in a VPC subnet without a route that includes an Internet Gateway (private subnet) needs a way to get on the Internet, and that is where NAT comes in.  A NAT instance in the Public subnet can be used for private subnet instances as the route to the Internet.  High Availability or HA (not so much HA-HA funny) is how you address the issue of single points of failure.  If NAT is required for availability of your workload, you need NAT to be Highly Available (HA) as well.  So what now?

I’m better off alone…

Before we get in to making NAT HA (and the risks of some methods), it would seem like an easy solution to just use subnets with an Internet Gateway route in your VPC (public subnets).  This way, you can give your instances public or elastic IP Addresses and everything works great.  This is true, and even if we ignore security risks, this is the recommended approach for any workload that is heavily using Internet bandwidth (get/puts to S3 immediately stands out).   That said, once you give your instance a public IP address, you are solely relying on the network ACL of that subnet and the Security Group of that instance to control access (and mitigate risk).  What if someone is troubleshooting and decides to change the SSH ingress rule to open it up to anyone instead of the private IP CIDR you previously had.  That instance is now open since it has a public IP address.   This may be outside your security policy (or possibly even a compliance issue) but the reason NAT is worth bothering is that private subnet instances have no public or elastic IP Address (and even if they did, it would not matter).  These instances are truly private and as such are more secure and more tolerant to inadvertent security group changes.

you need to define your own health checks with your entire stack in mind, and not simply rely on default values

Maybe I’m better off with NAT in my life…

There are two well-documented approaches to creating Highly Available NAT.  The first is an often-referenced article by Jinesh Varia where High Availability for Amazon VPC NAT is documented.  This is a great article, and the last paragraph is the most important component to internalize.  Appendix A outlines the risk of False Positives.  While Jinesh refers to this as an “edge case”, I would argue that this edge case is more likely than a simple instance failure.   The nat_monitor.sh script that Jinesh uses has a couple problems.  First, the script in its current state is syntactically wrong.  The script makes an AWS CLI call to describe the status of the other NAT instance in the other Availability Zone.  This output is then piped to awk to get just the value of the status (“running” for example).  The comments in the script even mention that you may need to modify this line, never ignore comments in code in this case they are right.  You do need to change the print value, and I believe on the current AWS CLI, it’s actually print $6 that you want.

HA_NAT

That said the issue with this script, as Jinesh points out, is that the default values for health check monitoring may be too aggressive and generate false positive results.  False Positive is actually a nice way of saying HA NAT Self Destruction.  Once this not-so-edge case triggers, the first step of the nat_monitor.sh script is to send an AWS CLI call to stop the other NAT instance.  Spoiler alert!  Once this happens both NAT instances are stopped and can’t continue on to step 2 that is to fail-over the NAT route (and subsequently reboot the failed NAT instance so it will take back over its route).  Highly Available becomes Highly Un-Available.  No running NAT means no Internet access for private instances, and when monitoring starts failing and auto-healing kicks in, get ready for cascading instance failures all because of a NAT health check failure.  Now that pesky NAT false positive monitoring event has destroyed working instances, someone is about to get in trouble.

If you were not running HA NAT with the nat_monitor.sh script, you would have a simple NAT instance in each Availability Zone (AZ).  On the edge case that this basic Linux instance with IP Tables and Route Forwarding configured should fail, you are only losing one AZ, and since you architected a highly available workload, losing one AZ does not mean you lost production service.  Impaired, but available.  You get your notification that NAT in AZ1 has failed; you reboot it and get on with life.  It almost seems like a more available approach than if you were running the HA nat_monitor.sh script.

Ok, so you still must have NAT monitoring and route takeover, what then?  The script mentioned earlier is great (if you fixed the syntax), but it needs some parameter changes from default.  To find the right values to enter, you need to determine the appropriate health check values that will persist across AWS network events that you have no visibility into (also known as “why did my health check fail when the instances are fine?).  For your respective region, and with your respective Availability Zones, start logging pings and build the right health check values.   Do not accept default values as a known quantity.  Test for failure or be prepared for the consequences.

I don’t have time to nurture this relationship…

The other approach to HA NAT commonly used is to have an auto-scaling group of 1 for NAT per AZ.  A Health check will be used to actually auto-heal NAT should it really have a system failure, or your non-reserved instance was taken away from you, or you ignored your events page when AWS scheduled your NAT instance for restart.  This approach may be better than the nat_monitor.sh script, as false positives are not an issue (at least not one resulting in AZ to AZ network failures).  You can bake the AMI with NAT configured, or just do it on the fly with user data (which you might as well so you can keep the OS up to date with security patches on boot).   Lastly, this instance will need to find the private subnet route table for its AZ and take over that route when it boots up (new instance means new instance ID, which means the previous route table will be black-holed till this script runs).

This approach however does not solve the same cascading auto-healing failure scenario outlined previously.  If your monitoring and auto-healing of workloads behind NAT is more aggressive than the auto-healing of NAT itself (as well as route take-over), you can still find yourself with a single AZ of failing instances trying to repair but are without Internet access.  Having a very clear understanding of recovery time is crucial in defining a holistic policy on monitoring-driven auto-repair (much like auto scaling in general).

architect for your workload and its complexities, but keep it simple 

What if the magic is dead?

squid_proxy

If you think you need to scale NAT for performance for elastic private workloads, there is another more applicable approach than those mentioned previously, Squid Proxy.  Again, Jinesh has a great write-up on creating a Squid Proxy Farm that uses an internal Elastic Load Balancer in front of an auto-scaling (based on network in) layer of squid proxy instances.  This provides high availability, automatic scaling for performance, and durability from a single squid instance failure across availability zones.  Like any high performance, highly available solution, it comes at a cost.  Not only are you running a minimum of 2 instances (like you would with a NAT per AZ), you also have the elastic load balancer (not a significant cost) and most importantly, you are now scaling automatically.  It is critical in any automatic scaling scenario that you understand your workload and how it is leveraging Internet resources (in this case) so you are not surprised by the fees associated with your solution.  Not to mention we have now added a proxy service on top of our normal Linux instance.  Not overly complicated, but more layers involved in the solution means more to manage.

test your solution and have a remediation plan ready

Not everyone is looking for the same thing…

In summary, if you are already highly available in your workload across availability zones and you have placed heavily Internet biased resources in public subnets, a simple approach to NAT is likely the best setup.  Make sure you are monitoring NAT instance status and have a plan to remediate a NAT instance failure.  Otherwise, let NAT do their job without failover in place and rely on the high availability of the workload, not NAT.

If you are required to have highly available NAT through health checks, ensure you have adjusted the monitoring and tested everything thoroughly.  Alternatively, and as an improvement, use a 3rd monitor and control instance that is determining NAT state health and acting independent of NAT itself.  Of course, don’t forget to monitor the health of the monitor!

Lastly, if you are required to have a heavily Internet biased workload be truly private, and you want the best performance for scaling the NAT function, use a Squid Proxy layer with the right Network In value in place for auto-scaling based on your workload.  And don’t forget to test, test and test again.

Need help?

Although most NAT solutions are well documented, simply following the tutorial does not always ensure a highly available production ready setup.  You need to failure test.  Foghorn is here to help you with your cloud initiative, don’t hesitate to give us a call!

Tagged with: , , , ,
Posted in Cloud, General

Get the most out of your VPC

A brief history of VPC

Amazon Web Services introduced the Virtual Private Cloud for general availability back in 2009, but VPC has undergone a major transformation. VPC was originally designed to meet the requirements of enterprise customers with legacy applications and hybrid operating environments.  It only supported a subset of AWS features, and opened AWS as an option for many workloads that previously could not run on cloud infrastructure.  Fast forward to January of 2014, when AWS stopped offering “EC2-Classic”, or EC2 outside of a VPC, for new accounts.  Now not only does VPC support all features of EC2 Classic, it offers many features not available otherwise.  Clearly VPC is the future of EC2.   It’s not hard to use, in fact all new accounts come with a default VPC.  I’m not going to make this post a beginners guide to VPC.  Instead, I’d like to share a few different ways that you can get the most out of your VPC.

Plan before you build, document what you plan

Although the default VPC is a good start, you are severely limiting the benefits you can realize from the available tools.  As you begin to build, there is no tool in the AWS portal that allows you to visualize what you have designed.  If you are ‘building as you go’, it will become increasingly difficult to visualize your environment.  When designing VPCs for our customers, we always do so with network diagrams.  These diagrams can quickly illustrate the purpose for each component, and ensure that what we are building will meet our customers’ needs.  We diagram both at the infrastructure and at the data flow level.  These diagrams come in handy long after we build the environment, as references to allow any competent admin to quickly get up to speed and manage systems or troubleshoot issues.

Security vs Functionality

Many of the VPC tools can be used to increase security, but make sure you are using the tools as they are intended. With tools like network ACLs and Security groups available to you, it might not make a lot of sense to try to ‘lock down’ your internal subnets by crippling route tables between them.  Belt and suspenders are great, and security should be implemented in layers, but avoid crippling the flexibility of your enviornment in exchange for little (if any) additional security.

Minimize instances in public subnets

One of the best ways to minimize exposure is to minimize instances that run in public subnets. No public IP drastically reduces the threat vectors that your servers are exposed to.  There are great reasons to run instances in public subnets, but you should always be asking, “Does this box really need a public IP?”  We’ve built many environments where the only resources running in public subnets are elastic load balancers, NAT, and security devices.  Do you need more than that?  Challenge yourself and your team to decide.

NAT as a bottleneck

Ok, you have a great design, with most of your instances in private subnets.  Traffic begins picking up, and your workloads are starting to slow down.  The confusing part is that the boxes that are responsible for the tasks that are slowing down seem to be running just fine.  Memory, CPU, network, IO.  Every monitor is showing green.  Time to follow the data trail.  If your servers are using services like S3, or other region based services, your instance will be communicating with the public endpoint of those services, which means that the traffic will need to flow through your NAT.  Time to either ditch those T1 micros, or consider isolating the service and moving it to a public subnet. Also consider the potential for your NAT to be a single point of failure if your production servers need to initiate connections with public endpoints.

 Protecting you from … yourself

We all think of security groups as a layer of protection from hackers.  Same goes for network ACLs. But many times the biggest threat to our production environment is accidental access.  Accidentally leaving a production hostname in a staging deployment, or vica-versa, can cause down time or irreversible damage to production data.  Consider using these tools to isolate production from staging, as well as from evil doers.  To make it very easy to track, we often refer to security groups within security group policies (i.e. Only servers in the Production-Web security group can access the Production-DB security group on port 3306, etc.)

 When one VPC just isn’t good enough

AWS offers many fine grained controls to allow you to limit access to specific resources, but coverage is not universal.  Because of this, you may find yourself struggling to create seemingly simple permissions policies that allow users to freely experiment and test while still protecting production workloads from accidental portal clicks.  In many cases, the solution is multiple VPCs.  With the new VPC peering features, you can construct elegant multi-vpc infrastructures and more easily offer a flexible environment where developers are free to experiment without worrying about accidentally making the front page of slashdot.

 Configuration Management via Cloud Formation

We’ve found cloud formation a great tool to use to deliver custom VPCs to our customer’s accounts.  We can build and test without interfering with their account, and once complete, we can deliver in a format that allows our customers to easily rebuild their environment from scratch without relying on us.  The speed at which a cloud formation template builds allows these templates to double as a great DR tool, allowing our customers to quickly rebuild their environment in a different region should a regional outage occur.

How can you benefit?

Although most VPCs are about 80% alike, it’s the 20% that is unique to your environment that can really unlock the potential of AWS for your specific workloads.  Want to learn more about how Foghorn can help you with your cloud initiative?  Give us a call!

Tagged with: , , , ,
Posted in Cloud, General