VNet-in-a-box. Get your Azure workloads moving!

azureAzure for the masses

Microsoft has come a long way in the past two years with their cloud offerings, and Azure is now a legit IaaS option that many of our customers are interested in. But the same blockers to other cloud providers still apply.

vnetStuck in Neutral?

Many of our customers have come to us with a problem of inertia.  Everyone in the company wants to get to the cloud, but the hurdles to get that first workload up and running are just too big. Easier to just keep adding to the stuff in the datacenter.  Each of these decisions makes sense on their own, but imagine if you would have taken the plunge a year ago?  You wouldn’t be dealing with the procurement nightmare right before the holidays to get that new ‘urgent’ project the resources they need.  The time is never right, you have to jump in at some point.

Are these your blockers?

Foghorn has helped lots of companies get past these hurdles.  The biggest ones we see are security and network integration. Companies unfamiliar with the Azure VNet features feel they need to make sure the VNet is configured to both protect their cloud workloads from the internet as well as protecting their coporate networks from cloud workloads.  At the same time, they need to understand how to best integrate an Azure VNet with their corporate network.

You are closer than you think!

finishFoghorn has developed a process, a set of best practices and a set of codified templates that allow us to help companies get over these hurdles in days instead of weeks or months.  We deliver it in a handy piece of code, and enable companies to extend their private network into a secure cloud environment and instantly benefit with new infrastructure available on demand. We call the offering VNet-in-a-box.  In a few days you can be spinning up servers, connecting to them from your corporate network, and configuring them for that urgent business need.  We have an easy to swallow fixed price for the entire engagement, and if you qualify, Microsoft might even pick up some.. or all.. of the tab.  Step 1?  Call Foghorn, or check out a few more details here.

Posted in General

TAC v TAM

We get questions regularly about the difference between the industry standard Infrastructure as a Service (IaaS) provider Technical Account Manager (TAM) and
the Foghorn specific
FogOps Technical Account Consultant (TAC).  The truth is that even though these acronyms are only one letter off, they couldn’t be farther apart in terms of what they deliver and the benefits that are realized.  In fact, most organizations that are leveraging an application stack in the public cloud could likely benefit from both TAM and TAC services.  This article explains the differences between the two offerings and how to ensure that you have a support
and engineering model that ensures success for your application’s full stack in the cloud.

 

TAM Backstory 

The traditional TAM offering was born from the need for additional, white-glove, manufacturer/provider support  that allowed enterprise customers to get an increased level of assistance for the products and services they used in their information technology environment.  Familiar examples include operating systems providers like Microsoft and RedHat, hardware companies like NetApp and cloud service providers like Amazon Web Services (AWS) and Google.  In the world of cloud, escalating vendor support tickets, providing additional non-public insight into bug fixes, escalating root cause analysis (RCA) and helping to provide product and service roadmap insights are common industry cloud IaaS provider TAM services.

Since the term TAM has been adopted by lots of technology companies for many different varying services, and since Foghorn is a cloud company, I’ll focus my comparison around IaaS provider TAM services.

The Challenge

warningLike the support services they oversee, the cloud IaaS provider TAM is able to provide guidance, advice and information.  But when it comes to leveraging those things to provide hands-on engineering, configuration, upgrades, etc., the TAM is usually not contractually able to assist.

Cloud vendor provided professional services are usually available to provide hands-on expertise to attempt to pick up where TAM’s leave off.  However, the many layers (and accompanying vendors) in most application stacks along with the desire for more enterprises to leverage multiple cloud providers results in the need to leverage multiple TAMs and multiple vendor-based professional services groups.  This is far from ideal given the expense and common finger pointing that occurs among providers.  Best case, your project’s velocity suffers while the various groups figure out how to work together towards a common objective.  Worst case, your site’s environment suffers from availability, security, and/or performance issues caused by gaps among vendors.

Enter FogOps TAC

image

A FogOps TAC picks up where the cloud TAM leaves off, providing a named Consultant who bridges the gap between advice and execution in a multi-cloud, full stack environment.

Similar to a TAM, but with multi-cloud capabilities, a TAC can join customer work sessions, presentations and meetings and ensure that financial commitments with pay-as-you-go services are continually optimized.   Additionally, the TAC can actually implement best practice advise and solve engineering challenges.

Best of all, there is no finger pointing, or lost time due to gaps among providers since there is one named resource looking out for issues site wide and stack deep.

The most visible and immediate benefits are increased velocity and inter-operability cloud wide and stack deep.  But easily taken for granted is the value realized from the TAC’s leadership and technical project management capabilities–none of these projects matter if business value is not realized and the TAC ensures that happens.

A few brief examples to help illustrate:

  • Resource Emergency! – Let’s consider the example of a cloud provider who has some sort of cloud resource limit that cannot be exceeded by their customer.  Theoretically, a cloud provider TAM may work to escalate an exception to this limit, but in some cases an exception may not be possible.  The TAM will explain the situation and possibly advise the customer on a strategy to circumvent the limit through best-practice, cloud-provider recommended engineering and architecture.  The TAM’s role in this situation likely ends here and that’s where the FogOps TAC continues.  After confirming the strategy won’t have adverse impact, the FogOps TAC executes the required engineering, likely involving changes to automation code.   No finger pointing, no delays, with the breadth and depth of the site taken into account.
  • Scaling is Broken! – Now consider an auto-scaling environment built according to IaaS provider best practice.  All works well except for the larger than anticipated IaaS usage charges.  Streamlining the entire system from web server configuration, to auto scaling rules, to server bootstrap process would be a typical FogOps TAC activity.  The result?  Seamless scaling and more connections per server, increased site performance and reliability while simultaneously reducing IaaS cost and usage.

Engineering and Support Models

A common and successful enterprise cloud support model includes IaaS provider enterprise support along with a FogOps TAC for hands-on, full stack architecture, engineering and execution.

But what happens when one named resource is not enough and your desired velocity exceeds your ability to execute?

That’ll be the topic of a future  blog post.  For now, I’m off to enjoy the next episode of Westworld!

Tagged with: , ,
Posted in Cloud, General, Public Cloud

Infrastructure as Code in Google Cloud

Terraform PlanWhy Foghorn Codes Infrastructure

At Foghorn, we manage lots of customer infrastructure via our FogOps offerings.  Code is a great way to help make sure that we deliver consistent infrastructure that is repeatable and reusable.  We can version our infrastructure. We can spin up new environments (staging, qa, dev) in minutes with the confidence that they are exact duplicates.  We can even ‘roll back’ to a degree, or enable blue/green deploys.

Our Favorite Tool

Each cloud provider has their own tool(s) for defining, provisioning, and managing infrastructure as code.  For us, since we work in so many different environments, we’ve chosen to standardize on Terraform by HashiCorp.  Terraform has great provisioners for all of the major cloud providers, and has some really cool features that you won’t see across the board from the cloud specific options.   Although we are competent in all of the tools, Terraform gives us the unique opportunity to build multi-cloud infrastructure with a single deployment.

Example – Hello Google

Google Cloud Platform has come a long way in the last couple of years. My colleague Ryan Fackett recently put together a modified “hello world” example for Google Cloud, so I’ll use that to get us from the 10,000 foot view straight down to taking a peek at some code.  In order to spin up a workload in Google Cloud, we first need a network and a few other infrastructure dependencies. These include:

  • Network
  • Subnetwork
  • Firewall Ruleset

With these basics in place, we can spin up a server, configure it as a web server, and write our “hello Google” app.  To get traffic to it, we’ll write an IP forwarding rule, and note the IP address.  If all goes well, the code we write will create the resources, configure the server, and we’ll be able to hit the IP address with a web browser and see our site.

A Look at the Code

Let’s take a look at the code that creates the network.  We need a network and at least one subnet.  Since a subnet lives in a single region, we’ll create a few subnets in different regions to allow us to spin up a server in various parts of the world.  The code has been snipped for brevity, but it should give you a good idea of the code you may write to form your infrastructure.

The Network

resource "google_compute_network" "vpmc" {
name                    = "vpmc"
description             = "VPMC Google Cloud"
auto_create_subnetworks = false
}

resource "google_compute_subnetwork" "vpmc-us-east1" {
name          = "vpmc-us-east1"
ip_cidr_range = "${var.cidr_block["net1"]}"
network       = "${google_compute_network.vpmc.self_link}"
region        = "${var.subnetworks["net1"]}"
}

resource "google_compute_subnetwork" "vpmc-us-central1" {
name          = "vpmc-us-central1"
ip_cidr_range = "${var.cidr_block["net2"]}"
network       = "${google_compute_network.vpmc.self_link}"
region        = "${var.subnetworks["net2"]}"
}

You might notice some variables in here instead of hard coded values ${var.foo[“bar”]}. We are using variables for the subnet CIDR blocks as well as the regions. This allows us to leverage the same code across multiple workloads, and set the variables accordingly.

You will also notice we use a reference to associate the subnets with the network. This is a requirement because as of the time we write the code, the network does not exist yet. ${foo.bar}.

Firewall Access

Next we will need a firewall policy to grant incoming connections to a server:

resource "google_compute_firewall" "http" {
name = "vpmc-http"
network = "${google_compute_network.vpmc.name}"

allow {
protocol = "tcp"
ports = ["80"]
}

source_ranges = ["0.0.0.0/0"]
target_tags = ["demo"]
}

By setting the target_tags, any instance with that tag will inherit the firewall policy.

Forwarding and Load Balancing

Next we need a front door with a public IP so we can hit our web site. Most web sites will be load balanced, so this code puts us in a position to run HA by spinning up multiple servers in multiple regions. I won’t go through it in detail, but it’s here for your reference:


resource "google_compute_http_health_check" "vpmc-healthcheck" {
name = "vpmc-healthcheck"
request_path = "/"
check_interval_sec = 30
healthy_threshold = 2
unhealthy_threshold = 6
timeout_sec = 10
}

resource "google_compute_target_pool" "vpmc-pool-demo" {
name = "vpmc-pool-demo"
health_checks = ["${google_compute_http_health_check.vpmc-healthcheck.name}"]
}

resource "google_compute_forwarding_rule" "vpmc-http-lb" {
name = "vpmc-http-lb"
target = "${google_compute_target_pool.vpmc-pool-demo.self_link}"
port_range = "80"
}

resource "google_compute_instance_group_manager" "vpmc-instance-manager-demo" {
name = "vpmc-instance-manager-demo"
description = "Hello Google Group"
base_instance_name = "vpmc-demo-instance"
instance_template = "${google_compute_instance_template.vpmc-template-demo.self_link}"
base_instance_name = "vpmc-instance-manager-demo"
zone = "${var.subnetworks["net1"]}-d"
target_pools = ["${google_compute_target_pool.vpmc-pool-demo.self_link}"]
target_size = 1

named_port {
name = "http"
port = 80
}
}

You’ll notice there is no IP address in here. That’s because google hasn’t assigned it yet. We’ll need to query it later to know where our site lives.

Our Web Server

Finally, we spin up a server and some associated resources. You’ll see that we boot from a default ubuntu template and configure the instance with a bootstrap script:

resource "google_compute_instance_template" "vpmc-template-demo" {
name_prefix = "vpmc-template-demo-"
description = "hello google template"
instance_description = "hello google"
machine_type = "n1-standard-1"
can_ip_forward = false
tags = ["demo"]
disk { source_image = "ubuntu-1404-trusty-v20160406" auto_delete = true boot = true } network_interface { subnetwork = "${google_compute_subnetwork.vpmc-us-east1.name}" access_config { // Ephemeral IP } } metadata { name = "demo" startup-script = <<SCRIPT #! /bin/bash sudo apt-get update sudo apt-get install -y apache2 echo '<!doctype html>
<h1>Hello Google!</h1>
' | sudo tee /var/www/html/index.html SCRIPT }   scheduling { automatic_restart = true on_host_maintenance = "MIGRATE" preemptible = false } service_account { scopes = ["userinfo-email", "compute-ro", "storage-ro"] } lifecycle { create_before_destroy = true } }

By adding the “demo” tag to the instance, we automatically associate it with the firewall rule we created earlier.

Google Authentication

In order to actually spin up an environment, we’ll need a google account, and we’ll need to give Terraform access to credentials. A simple test can be done with this code:


provider "google" {
credentials = "${file("/path/to/credentials.json")}"
project = "test-project-1303"
region = "us-east1"
}

Terraform Plan

Running terraform plan will prep us for running an apply. Terraform will look at our code, and compare the requested resources to the existing state file. New resources will be created. Absent resources will be destroyed. Plan tells us which changes will be made when the next apply is executed. It is especially for this reason that plan is so valuable. Let’s take a look at a portion of the response:

+ google_compute_subnetwork.vpmc-us-east1
    gateway_address: ""
    ip_cidr_range:   "10.1.0.0/18"
    name:            "vpmc-us-east1"
    network:         "${google_compute_network.vpmc.self_link}"
    region:          "us-east1"
    self_link:       ""
 
+ google_compute_target_pool.vpmc-pool-demo
    health_checks.#: "1"
    health_checks.0: "vpmc-healthcheck"
    instances.#:     ""
    name:            "vpmc-pool-demo"
    project:         ""
    region:          ""
    self_link:       ""
 
Plan: 16 to add, 0 to change, 0 to destroy.

 

Terraform Apply

Ok, plan told us all the changes Terraform will make when we run an apply. We approve of the list, so we run apply to make the changes. The response:

google_compute_address.vpmc-ip-demo: Creating...
  address:   "" => ""
  name:      "" => "vpmc-ip-demo"
  self_link: "" => ""
google_compute_network.vpmc: Creating...
 
...
 
google_compute_instance_group_manager.vpmc-instance-manager-demo: Still creating... (10s elapsed)
google_compute_instance_group_manager.vpmc-instance-manager-demo: Creation complete
 
Apply complete! Resources: 16 added, 0 changed, 0 destroyed.
 
The state of your infrastructure has been saved to the path
below. This state is required to modify and destroy your
infrastructure, so keep it safe. To inspect the complete state
use the `terraform show` command.
 
State path: terraform.tfstate

We can now go to our Google Console, find the IP address for the forwarding rule that we created, and hit it in a web browser:

Hello Google

Automatic CMDB

ITIL processes recommend tracking all of our IT configuration items in a Configuration Management Database. The usual method to ensure this happens is to put in place a change control process. All changes go through the change control process, which includes updating the CMDB. As time goes by, human error tends to create drift between what is in the CMDB and what actually exists. This can cause difficulties in troubleshooting, and can create time-bombs like hardware that falls out of support, or production changes that have not been replicated to the DR environment.

Consider the .tfstate file that Terraform creates/updates after running an apply. It includes all of the details of the infrastructure “as built” by Terraform. This is, in effect, your CMDB. If Terraform is the only deployment tool used (this can be enforced with cloud API permissions), the accuracy of your CMDB is effectively 100%. Add this to the fact that this code can, sometimes in minutes, spin up from scratch a complete DR site (less your stateful data), and you can see the benefits. You also protect against needing ‘tribal knowledge’ to maintain your site.

In a Nutshell

The whole reason we didn’t treat infrastructure as code for many years is that we simply couldn’t. Cloud infrastructure API’s have completely abstracted the operator from the hardware. Although this creates constraints for the operator, it also creates opportunities. Infrastructure as Code is one of the major potential benefits of Cloud Infrastructure. If you aren’t doing it, you should be.

Foghorn Consulting offers FogOps, an alternative to managed services for cloud infrastructure. We build and manage our customer environments strictly with code, and our customers see the benefits in the form of lower ongoing management costs, higher availability, and more confidence in the infrastructure that powers their mission critical workloads.

Next Up – Pipelining your Infrastructure

There is a ton of additional benefits to treating infrastructure as code. In addition to having self documented, versioned, and re-useable infrastructure modules, we can extend our toolset to use CI/CD tools to build a full infrastructure test and deployment pipeline.  I’ll give an example in a future post.  Stay tuned!

Posted in Cloud, General, Public Cloud

ChatOps – Contender or Pretender?

TL;DR FogOps is built around ChatOps, I thought I’d share why we chose it.

Every time a new word is born, I can’t wait to explain why we didn’t need that word.  Especially words that aren’t really words; they’re just two words, or parts of words, glommed together, and deemed the ‘next big thing’.   My first reaction to the term “ChatOps” was “You’ve got to be kidding me”.  Chat has been around since the IRC days, and now someone puts a pretty face on it, and suddenly it’s going to change the way we do things?  I’m not buying it.  But as a leader in our organization, I obviously had to do my due diligence.  After seeing how ChatOps was implemented at GitHub, it started to sink in.

We all need some place to live.  For years we used email as our digital home.  Newsgroups were available, but it was so much easier to just subscribe to an email list.  These were our ‘social networks’ before Facebook.  All of our forums were email enabled, to make sure that we were notified when something cool happened.  In business, Microsoft made Outlook our home.  We ‘live’ in Outlook so much, that some people send urgent, time sensitive requests via email, and assume they will be read in a timely manner.  We integrated our ticketing systems with email, to ensure everyone was notified of the important items.

email-inboxThe problems with email as a place to live are many, the largest being organization and control.  Many tools exist to try to control spam and clutter, but we still cannot control what initiatives and subjects grab our attention at any time.  The publish/subscribe model was made popular by the next generation of social networks, and really made a ton of sense in business to help fix the issues associated with email as our main communication and collaboration tool at the office.

Enter Chat

slackOperations teams need somewhere to live as well.  They are extremely reliant on collaboration, communication, and documentation.  They also have a penchant for doing things, not writing about them.  Who moved my cheese?  It’s vital in an operations team to ensure that everyone can see change history.  Chat rooms are a great way to solve this.  Got a crazy ticket assigned to you when you get in to work? Just check the ops channel, and I bet reading the history of the last shift will give you context.

But we need to do more than just chat.  We need to do things.  And if we do things, and we don’t record them by telling people we did them in the Chat channel, we are back to where we started.  The only way to ensure that people record these events is to have them do the events in the channel itself!

ChatOps is Born

If I truly want to live in Chat, and want to record and collaborate in one place, then I need to use the Chat tool to both talk about things as well as use it as my operations interface.  Since our chat tools are made to write text, they can be used pretty easily as a command line interface.  If we make all of the basic operations tasks into scripts that can be run from within the chat window, we could all see things happening real-time.

Meet the Chat Bots

ChatOps is not ChatOps without bots.  Bots enable us to receive information, like alerts, in push notification fashion.  They enable us to query for current status.  And they enable us to run common tasks without becoming experts in every tool.  This is what enables us to live in our chat. And since chat can be organized by function or initiative, and membership can be controlled, it allows us to focus on the task at hand.

DevOps without the DevOps

Everyone’s got their opinion of DevOps, but I’ll share mine.  This was a rudimentary organization solution to the divide between development and operations. Nix operations and make the developers do it.  It was simple but transformative and effective.  But it forces great developers to learn full stack, which is less than optimal, and potentially can lead to churn of the most talented developers.  With ChatOps, we can enable developers to see status, push code, and do other “ops-y” tasks, without forcing them to learn all the tools and engineer all the systems.

Long Live SRE

By creating a simplified toolset of commands specifically for them, our SRE teams can collaborate with development teams and enable them, without forcing them to take pager duty, and without  slowing them down.  This allows the benefits of DevOps while still allowing team members to contribute where they are strongest.  This enables an SRE organization to exist without recreating the issues that occur when there is a divide between Development and Operations.

ChatOps Example – Hello World

This is easier to understand by example, so I’ll mock a real world situation with the world famous Hello World example.  So, imagine you are running an SRE team responsible for keeping your company’s Hello World web site up and running, and working with the developers who need to push changes to font type, size, and color all day long.  How could ChatOps help you?  For this example I’ll use Foggy, our chat bot.

Well, we first want to make sure we know that our site is available to the world.  So we set up an external DNS based health check to helloworld.fogops.io.  If the check fails, we are in urgent crisis, so we make sure to alert on it.  We page the on-call SRE, but we also notify our chat bot, and our chat bot posts the alert in our  #hello-world-ops slack channel.  The event is effectively logged with time stamp for the team to use later, without having to dig through log files or log in to the monitoring system.  Only #hello-world-ops alerts are posted here, so everything we see is in context to the initiative, which is to maintain the best hello world site known to man.

Screen Shot 2016-09-27 at 9.20.59 PM

The team begins troubleshooting the problem.  They check the status of the web server with a quick slack command and get the OK that the web server is up and running, but can’t establish a connection to the database.

Screen Shot 2016-09-27 at 9.39.53 PM

The team checks status of the database server and finds the problem.

Screen Shot 2016-09-27 at 9.41.43 PM

Looks like our SQL service died.  Foggy can handle this one for us…

Screen Shot 2016-09-27 at 9.44.06 PM

The team completes the fix, the site comes back up, and the monitoring system posts the green light to confirm.

Screen Shot 2016-09-27 at 9.46.34 PM

The dev team is happy their site is back up, and they immediately push a new font.

Screen Shot 2016-09-27 at 9.50.46 PM

This is a pretty simple example, with the goal of helping get your creative juices going.

FogOps – Powered by ChatOps

Our FogOps operating model is built around the close collaboration between our SRE team and your Development team.  I could not find a better engagement methodology than ChatOps.  Foghorn has embraced ChatOps. We enable our customers’ development teams to push code, see alerts, and communicate with the SRE team, all while being able to focus on development and get out of the rut of trying to stay current on every cloud, integration, deployment, monitoring, and configuration management technology.    We bought into ChatOps.  Will you?

Posted in Cloud, General, Private Cloud, Public Cloud

6 Things to know about AWS Elastic Beanstalk

Elastic BeanstalkElasticBeanstalk is AWS’ PaaS, and it’s a powerful platform to accelerate application delivery in AWS’ cloud.  That said, I have come across a few things that have been stumbling blocks for some users of the service, especially as they try to deploy production workloads.  In an effort to help you leverage beanstalk for production environments, I’ll go through the most common issues and some workarounds.

Security Groups

lockBeanstalk creates its own security groups so you don’t have to.  That’s great if you don’t want to create security groups, and an important concept to keep in mind.  While you can certainly modify the ingress and egress rules of these security groups, what you can’t easily do is use your own security groups.  Beanstalk does allow you to add security groups to your EC2 instances through their interface, but you cannot remove the security group they created for the EC2 instances.  This in contrast to AWS’ OpsWorks service, which asks you if you want to use OpsWorks security groups or not.  By saying no, you can use your own custom security groups.  In order to use custom security groups with Beanstalk, you would:

  • Create your custom security groups for the application layers
  • Create your Beanstalk environment
  • In the Beanstalk configuration, add your application security groups to the configuration
  • Modify the ELB manually to use your security group instead of the Beanstalk security group (replacement)
  • Manually remove all ingress and egress rules in the security group Beanstalk created for your EC2 instances

Configuration Management

ebextensionsBeanstalk does provide a hook to modify the underlying EC2 instances on creation.  Things like installing a log aggregation agent for example can be done through a feature called .ebextensions in your application code.  Which brings us to .ebextensions in your application code.  You are now bundling any configuration management in your application code.  In the case where the development team is small, this might make sense (one place for all code). Once you have different teams managing system configuration and application features, they are now using a single application repository.   Further, in order to test and deploy a system configuration change, you need to do an application release and test.  If for example your application is Java, that means something as trivial as modifying the log aggregation agent configuration file entry means an application compilation, test and release.

Configuration Changes

yellow-lightBeanstalk has a wide set of configuration settings that provide a single place to make changes like the EC2 instance type, auto scaling group size, etc.  However, not all changes just work.  For example, moving from a single instance to a load balanced setup works well.  Moving from a load balanced setup to a single instance, that can generate some errors.  In any case though, it takes time to apply a change.  Expect a significant amount of time for some environment changes, or just make sure you got the setup right the first time (for those values you are able to set during initial creation).  Finally, most changes require the termination and recreation of the EC2 instances.  This is a core concept for Beanstalk users to understand.  The EC2 instances can and will be replaced as needed by Beanstalk.  More on that later.

Tagging

tagging

Beanstalk supports adding custom tags during environment creation.  This is a great feature, especially for those users who have setup Cost Allocation Tagging and want to accurately report on AWS fees by tag.  However, once you set your tags, you cannot change them.  There is no workflow to modify tags (keys or values) once the Beanstalk environment is created in Elastic Beanstalk.  You would need to backtrace all those resources and modify the tagging manually (ELB, Auto Scaling Group, etc.).

Custom AMI

Beanstalk supports using your Custom AMI (maybe you baked in your server changes into your custom AMI).  That said, you can’t use your custom AMI until after your environment is created.  You have to build off the base Beanstalk AMI, wait for the environment to build, then replace the AMI with your custom one (thus rebuilding the environment to use the new AMI).  Not terrible, but not terribly efficient either.  Because Beanstalk uses agent software, you also need to follow a specific procedure to build a custom AMI that will function as expected in AWS.

Server Changes

revertOne really critical thing to always remember is, things you do on the EC2 instance manually are temporary changes.  Any environment change or scaling activity can remove your changes since they are not part of your application code (configuration management).  This replacement could be outside of your control, like maintenance by AWS for example.  Users that like to SSH on to the server and make changes or view logs should consider this notion.  If however you loathe SSH-ing on to your application servers, you may find this to be a benefit, not a hindrance.  In general, in order to benefit from Elastic Beanstalk, your application should be developed with the thought that any configuration information needs to live in the Beanstalk environment, application, or .ebextensions.  Any application state information needs to be kept somewhere other than the EC2 instance (Redis, Dynamo, etc.).

Posted in Amazon Web Services, AWS, AWS, Cloud, General, Public Cloud

Introducing FOG-OPS

The day has arrived when every company, not just those at Google-scale, can achieve high availability and scale without a linear increase in operational cost.  That’s what FOG-OPS is all about—leveraging Site Reliability Engineering, DevOps, automation and the cloud to achieve a truly scalable model for increasing site uptime through pro-active engineering instead of through reactive support operations.

I’ve heard from our customers for many years that they would love to leverage managed services in some way, but that the per server, per month (or % of IaaS) cost model just doesn’t make sense to them, especially as they grow.  This is why I’m so excited to be able to offer services packages that do pencil out and scale economically.  Not to mention that I love our new FOG-OPS logo 🙂 (thanks Justin!).fogops

To explain a bit further, signing up for FOG-OPS means that Foghorn takes responsibility for engineering your site to reduce downtime by leveraging automation and self-healing technologies fearlessly hardened by early adopters over recent years.  Foghorn will respond to alerts and remediate issues, but we don’t want to be up all night either!  Our mission is to engineer the reactive support requirement out of your site and our pricing model ensures we are motivated to do that.  And we will continue to invest in IP and technology that allows us to do this effectively.

Give us a call or drop us a note to learn more about FOG-OPS!

Tagged with: , , , ,
Posted in Amazon Web Services, AWS, AWS, Cloud, Public Cloud

When Security Best Practices Conflict – AWS KMS vs Whitelists

A while back AWS EBS encryption moved to using KMS (Key Management Service).  This was a welcome change as KMS is a great service that enables some interesting security models around different AWS customers sharing KMS keys and allowing each other to encrypt items they may hold on their behalf.  That said, we did find something interesting that eludes to what is going on behind the scenes.

tl;dr, if you are using KMS, you can’t whitelist by SourceIP in IAM policies (unless you whitelist all of AWS’ public IP space, including other AWS customers).

One of our customers was using custom AMIs with encrypted EBS volumes which were originally created prior to the release of KMS.  These AMIs were launched via knife-ec2 as this customer uses Chef.  We were provisioning new instances and the instances came up in Terminated state.  The State Transition Reason was Internal Server Error.

This led is through a series of follow up tasks to see what changed and why our AMI that was widely used in dev/test/prod was failing.  Those tasks (in order) were:

  • Launch the AMI manually in the AWS console, this worked
  • Launch a community AMI via knife-ec2 (no EBS encryption), this worked
  • Create a new custom AMI using the same EBS encryption (built off the latest Ubuntu 14.04 AMI), this had the same failure condition in knife-ec2, but worked in the AWS console
  • Create a new custom AMI not using EBS encryption and launch via knife-ec2, this worked

At this point we started working with Chef support and the knife-ec2 team to see if this was an issue with the tool or with us.  Those teams were not able to reproduce our issues so we decided to start looking in less obvious places.

We started looking at security in more detail.  It was odd that the AWS console worked to launch the AMI but not the knife-ec2 command which uses AWS access keys instead of an AWS console login.  Furthermore, we had not made any changes to security settings in IAM for a long time, so it was not an obvious place to look.  For this customer, AWS console logins are using on premise active directory federation and user logins are assuming roles.  The IAM users (console login not enabled) who have AWS access keys (for the sole purpose of knife-ec2 provisioning) were using IAM groups.  Those IAM groups were more restricted than the IAM roles (for federated console logins), but in both cases, access to KMS was allowed in IAM.

So we launched the encrypted EBS AMI using the aws cli run-instances command instead of knife-ec2 to see if the problem followed the use of the AWS access keys or if it stayed with knife-ec2.  The aws cli replicated the same end state as knife-ec2 (I should have tried the CLI before reaching out to the Chef / knife-ec2 team!).

That finally led us to the one IAM policy that was unique to IAM users with AWS access keys.  This policy was a blanket deny if certain conditions were not met.  Those conditions were:

  • The AWS access keys were being used from an instance inside our VPC
  • The AWS access keys were being used from a Source IP that is owned by the customer
  • The AWS access keys were being used from a Source IP that is our AWS EIPs used for NAT

I removed this deny condition and tested using the aws cli using the the encrypted EBS AMI.  This worked.  I tested the same AMI using knife-ec2, this worked.

So back to what is going on behind the scenes at KMS…

We opened a ticket with AWS to confirm that the KMS service is acting on our behalf, which was confirmed. This means that unless we open our whitelist policy to the entire AWS global IP range, the calls will fail. We also asked if we could get the public IP range for KMS to add them to our whitelist policy, which AWS was not able to provide.  Opening the whitelist policy in order to leverage KMS is a compromise in our security posture, but required to leverage the service.

Denying actions when conditions aren’t met, like Source IP or VPC ID are very powerful.  But they also bring you back to the reality of the public service that is AWS.  While it would be great if AWS could group their public IP usage by service (like they do in some cases), it’s obvious why that would be difficult for them to do for all services (especially those that leverage EC2). An alternate solution would be for AWS to offer a VPC endpoint for KMS, but this has not happened yet. Keep AWS in mind when you start thinking about how to impose global security controls.

Need help with your AWS security?

Although this example is very specific, providing increased security layers can be a powerful tool when implemented correctly.  Foghorn is here to help you with your cloud security, don’t hesitate to reach out!

 

Posted in Amazon Web Services, AWS, AWS, Cloud, Security

Optimizing Cloud Spend – When Context Counts

Piggy BankA little while ago, I was asked by a customer to help them lower their Amazon Web Services monthly spend.  They were currently spending about $8,000 per month, and set an internal goal to get that down to $7,000 in 30 days, and to $6,000 in 60 days.

At first glance, there was no easy fix.  Cost optimization tools are great for finding under-utilized servers, unattached volumes, and other clear opportunities to save.  But this customer was in decent shape. No server sprawl driving up costs, and although there was a significant amount of high performance block storage, nothing looked completely out of bounds.

After digging through the monthly bill on a day to day basis, and group by API call we were able to find a great opportunity to save money.  Although the customer was utilizing all of the services they had provisioned, there was a better way.  By changing type of storage for few of the volumes on some of their critical servers, we were able to decrease the bill by about $1,500 / month, achieving most of the 60 day cost optimization goal in about 15 minutes.

We do this all the time, for all of our customers. The reason this is possible is because we have context.  Analytics tools are critical for us to do our job, but do not stand alone.  We understand the application, the performance requirements, and the business impact to our customers if resources are overburdened. We can also offer alternative architectures which accomplish the same result for a lower price.

Foghorn offers monthly operations reviews for all of our platform customers at no additional cost. We take the best practices learned across our customer base, and apply them in context to your application and your business requirements.  

If you want more for less, talk to us about the benefits of working with a certified AWS Consulting Partner and Channel Reseller.

Tagged with: ,
Posted in Amazon Web Services, AWS, Public Cloud

To NAT or to Proxy, That is the Question…

A Better Way to Manage Internet Access for VPC Resources

Anyone who has run OpsWorks stacks with private instances relying on NAT for Internet access may have seen firsthand what danger lurks beneath the surface.  These instances will show as unhealthy to OpsWorks when NAT dies, and if you have Auto-Healing turned on (who wouldn’t?), get ready for OpsWorks auto-destruction (and to think your app was running just fine the whole time). (Read my NAT write-up).

What about that AWS Managed NAT Gateway?

When I started writing this post, AWS’ Managed NAT Gateway had not been released.  Such is the speed of innovation at AWS, I can’t even get my blog post out before they improve on VPC NAT infrastructure.  Even with this new and very welcome product, this blog post is still valid.  While Managed NAT Gateway solves many of the pains of running your own NAT instances, it still adheres to the same paradigms.  So, how about addressing Internet access an entirely different way than NAT and eliminating those limitations?  Enter the proxy.  Truthfully, a proxy is more complicated to manage than NAT (especially compared to a managed gateway), but the benefits may outweigh the complexities.  We solved these complexities by packaging up a proxy as a self-contained OpsWorks stack in a CloudFormation template to enable users to easily manage secure Internet access.  As a primer though, proxy?

Proxy?

Yes proxy.  Foghorn Web Services (FWS) proxy is a caching proxy to the Internet.  In normal use cases, this type of proxy is designed to increase performance through caching.  In our application, we use it for extended access controls and domain based whitelisting.  But the advantages don’t stop at security.

Everybody Gets Internet!

Transitive Relationships & VPC Peering

Many of our customers want unique VPCs for isolation, either based on their end customers, their own segmentation desires or for other reasons like dedicated tenancy within a VPC (but not within all of their VPCs).  In either case, once an architecture extends to more than one VPC, you naturally then find a reason to connect them.  This most commonly ends up being a shared services layer for things like VPN access, CI/CD tools, source code repositories, log aggregation, and monitoring.  It would be nice if you could manage Internet access in this same shared service tier, but VPC Peering does not support transitive relationships.  If you don’t feel like reading up on that, the basic premise is traffic can go from one VPC to another when peered, but cannot go through that peered VPC to somewhere else (like a NAT instance, a NAT Gateway, a Virtual Gateway, an Internet Gateway, etc.).

transitive_deny

This means that a NAT service layer in a shared services VPC cannot be leveraged by Peered VPCs for shared Internet access.  So each VPC now needs its own NAT service layer.  A proxy however does not behave like NAT.  The FWS proxy layer residing in a shared services VPC can enable Internet access for Peered VPCs.  Now you can manage Internet access like any other shared service.  Proxy 1, NAT 0.

You’re Trying to go Where?

Whitelisting

With NAT we can easily address port based traffic control by simply modifying the security group used by NAT.  So instead of all traffic, we can limit it to just 22/80/443 for example.  But the ports used are only part of the equation.  What if we wanted to whitelist by destination domains?  Why allow access to anywhere on the Internet over 443 when all we really need is GitHub and AWS?  I doubt anyone would argue that managing a whitelist for your environment is easier by domain names than IP addresses.  Again, this is where FWS proxy shines over NAT.  We can easily create whitelist files and have our OpsWorks stack pull them from S3 and update the proxy during a configure lifecycle recipe.  The inverse is true as well, we can also manage blacklists with our proxy.  But since we are going for a least necessary access list, we prefer whitelisting.  Consider for a minute how great it is to just drop traffic you don’t need.  Proxy 2, NAT 0.

Caching

While the main purpose of this blog post is to address the network and security benefits of a proxy over NAT and how we manage FWS proxy via OpsWorks, it doesn’t have to stop there.  Our proxy can be configured as a DNS cache and/or a web cache.  Storage is inexpensive at AWS, why not cache some static content and speed up deployment runs.  Same is true for DNS.  While it can be risky to have long cache times on DNS results, if we picked something reasonable, we can speed things up.  Either way, these are options, not requirements.  We are extending our Internet access service tier features.  Proxy 3, NAT 0.

Who is Pulling from THAT repo? 

Easy to manage?  Surely you can’t be serious?

I am serious, and don’t call me Shirley.  We built a CloudFormation template that creates an Elastic Load Balancer, OpsWorks stack, FWS Proxy layer and Load Scaling instances (as well as its own security group needs and IAM requirements).  It adds our custom recipes to the setup and configure lifecycle events.  Lastly, it writes the common configuration elements of the proxy to the Custom JSON of the Stack, enabling users to easily manage parameters without needing to understand how Chef works.  The most common change is editing the domain whitelist for the proxy.  This is accomplished via a simple text file stored in S3.  The configure recipe retrieves that file, overwrites the current whitelist with the new one and reloads the proxy service.  This is significantly faster than deploying a new proxy instance (say through an autoscaling group with the whitelist as part of the launch configuration).

But wait, there’s more…

Whitelisting is great but sometimes you need to figure out what traffic needs to be part of the whitelist.  We have a true/false configuration item in the OpsWorks Stack custom JSON which will either ignore (false) or enforce (true) the whitelist.  This along with CloudWatch logs integration (also part of the template) helps you determine what domains are currently being used.  You can go to the CloudWatch Logs interface and simply view the log streams.  They contain all the domains being accessed including the HTTP result code.  This can be used to build the whitelist, or audit Internet activity, or even possibly as a troubleshooting tool.

fws_proxy_log_stream_800

But I Need Persistent Source IPs

What if some of your destination services require whitelisting your addresses?  Not a problem, we have enabled Elastic IP Addresses to provide persistent public IP addressing of a dynamic proxy fleet (the default behavior, this can be disabled to save on the EIP costs).

The default configuration even accounts for all private networks supported by AWS VPC.  You don’t have to worry about what network ranges are chosen, just that they have outbound security group access to the FWS proxy ELB.

So what’s the catch?

Like anything, you don’t know what you don’t know.  If you don’t know where on the Internet your servers require, be prepared for some short term pain.  Furthermore, the adoption of a proxy server varies based on the operating system you are using and how you build your servers.  You will need to adjust the core OS to support the proxy server (if you plan to completely remove Internet access otherwise).  But not only do you need to know how to configure each service to use your proxy, you also need to know where this service is intending to go.  Domain based whitelisting is a powerful security tool, but it is predicated on you knowing the destination domains.  Refer to the earlier section on how we make that process easier with this solution.

On a closing note, while a proxy may be a great substitute for NAT in certain use cases, it is still simply an alternative tool.  So if the concepts discussed here are appealing, a proxy may be a better architecture than NAT.  But that is not to say that there is no place for NAT as there certainly still is (refer to my previous post on NAT).

AWS Marketplace Cluster Perhaps?

Not interested in the user focused OpsWorks managed environment?  No problem, thanks to AWS’ new Marketplace support for Clusters, Foghorn is creating a cluster solution encapsulating the core principles of this post available to launch via the Marketplace Cluster feature set.  Stay tuned for details on that offering.

 

Need help?

Although this packaged service is user friendly, the key is understanding your environment and Internet usage before migrating over.  Foghorn is here to help you with your cloud initiative, don’t hesitate to give us a call!

Posted in Amazon Web Services, AWS, Cloud, Security

Security Groups got you down? Get Security Flow!

Here at Foghorn Consulting, we’ve been designing, implementing, and managing point to point security with AWS security groups for years.  Security groups allow an amazingly granular method of controlling communications between instances without being bound to using networks as filter.  This allows us to design flat networks while still creating very granular and tightly controlled communications.

But with great power comes great responsibility.

We’ve seen many environments where many individuals have the capability of adding and modifying security groups.  After several years of organic change, it becomes difficult to really understand what all of those rules are doing.

Visualize your Rules with Security Flow

Security Flow

To help our customers to overcome this challenge, we’ve recently released a new feature in our AWS re-seller portal, available for free to all of our AWS direct customers.  It’s called Security Flow.  With Security Flow, you can instantly generate a visual representation of the security posture of your VPC.  It’s great for audits, design sessions, and even troubleshooting connectivity issues.

Learn more about the value added services available when you buy AWS directly from Foghorn Consulting.

Interested?  Contact us and get it for free!

 

Tagged with: , ,
Posted in Amazon Web Services, AWS, Cloud, Public Cloud, Security
Follow Foghorn