Drift management in cloud infrastructure

Over the past few years, the number of infrastructure services has grown, and more applications are being released to production on a daily basis while infrastructure needs to be able to be spun up, scaled, and taken down frequently. The adoption of CI/CD and DevOps practices emphasizes the importance of having similar runtime environments. Without an Infrastructure as Code (IaC) practice in place, it becomes increasingly difficult to manage the scale of today’s average infrastructure environment.

IaC safeguards the entire process of cloud provisioning and ensures consistency across different environments by codifying and documenting configuration specifications. IaC tools like Terraform helped the dev and ops teams align as they both use the same description of application deployment. In an ideal world, you want everything to be managed by your IaC stack but expectations do not always line up with reality, resources are still being provisioned manually or through the cloud provider’s consoles, causing infrastructure drift and a growing number of untracked assets.

Understanding the resources that are not managed by IaC in the cloud is a challenge and finding whether they remain in the same configuration defined in the code is yet another task. This blog post will explore the various tools available for detecting and managing infrastructure drift.

What’s Infrastructure Drift

Infrastructure drift occurs when the configuration of the infrastructure deviates from its intended or documented state. This deviation can be attributed to several factors such as human error, lack of automation, manual intervention, applications making unwanted changes, changes applied to some environments but not propagated to others, and so on, leading to inconsistencies in the infrastructure. Additionally, CI/CD workflows can result in failed pipelines, causing the infrastructure state to become corrupted, leading to orphaned resources and drift.

One major cause of infrastructure drift is the creation of resources outside of the established IaC tools such as Terraform, CloudFormation, and Pulumi. When this happens, the infrastructure state is not adequately described or persisted, and the changes made to the infrastructure go unnoticed. This opens the door to security vulnerabilities, wasted costs, and compliance issues.

In some cases, there may be production incidents or emergencies that require quick action, and manual adjustments to the infrastructure via web consoles may be necessary to achieve a better state as soon as possible (and keep the customers satisfied). However, this becomes a problem when those changes are not backported to Terraform which often stems from poor education on best IaC practices, loose access permissions, and a lack of proper communication regarding the infrastructure management process.

Why it’s bad

Infrastructure drift can have a significant impact on the reliability, security, and cost-effectiveness of the infrastructure. One major issue caused by infrastructure drift is the wastage of cloud resources, which can lead to increased costs. Drifting can result in the creation of duplicate resources or the failure to delete unused ones, leading to an unoptimized cloud environment.

Furthermore, infrastructure drift can pose a significant threat to the security of the infrastructure. Inconsistent configurations can make the infrastructure vulnerable to security breaches and data leaks. Such inconsistencies can lead to essential resources unintentionally being made publicly accessible, and unsecured resources may go unnoticed. However, if changes to infrastructure were made through IaC, it would be possible to set up compliance policies and security controls, preventing or mitigating issues such as an S3 bucket being accessible to the public and making sure all resources are properly tagged.

Infrastructure fragmentation is another problem that can arise from infrastructure drift. As the infrastructure becomes more complex, it becomes more difficult to track all resources and changes. This can lead to situations where development teams are unaware of production environment changes, which can cause applications to crash and deployment projects to fail unexpectedly. Moreover, when the IaC tool does not cover the entire infrastructure, it can cause discrepancies between the different environments, leading to inconsistent behavior. This inconsistency can be particularly problematic between the development, staging, and production environments.

Without a single, shared source of truth, intentional infrastructure changes to remediate incidents could be reverted or temporary changes left unnoticed, wasting thousands of dollars in monthly costs due to unused resources.

Cloud workloads undergo frequent changes as more workloads and services are deployed to the infrastructure, resulting in more developers and authenticated services interacting with the infrastructure across various cloud environments and providers. Drift is inevitable, just like incidents, and is a part of the infrastructure’s life cycle. Therefore, it’s crucial to be able to easily and quickly detect and possibly revert drift.

Drift Management

Preventing and resolving infrastructure drift is crucial to maintain the stability and security of the infrastructure. Increasing the adoption of IaC is one of the most effective ways to prevent infrastructure drift. Teams should ensure that a greater percentage of the infrastructure is managed by IaC and leverage code versioning, code reviews, static analysis, automated tests, and so on.

When resources are created using IaC tools, drift can be detected and resolved promptly. For instance, running a command like “terraform plan” can reveal any drift in resources described in the Terraform files.

Terraform drift detection

In the screenshot above, we can see that the EC2 instance owner has changed outside of Terraform which is drift.

CloudFormation has a built-in drift detection feature that can be used either via the AWS Console or via the AWS CLI command.

Regular testing and monitoring are also critical to detect and resolve any issues that may arise due to infrastructure drift. Open-source tools like driftctl, terrascan, and cloud custodian can also be leveraged to detect all changes outside of regular IaC workflow and ensure prompt remediation.

In addition to tracking infrastructure changes, it is crucial to track who is provisioning what, where, and how often. This is especially important since it can be challenging to track those changes across multiple cloud providers and accounts, and manually checking provisioned resources can be time-consuming. Tools like Komiser can be used to build a queryable asset inventory and get a clear picture of the cloud infrastructure. Komiser can detect the drift of managed resources and unmanaged resources in multi-cloud environments, which can be brought under control to maintain consistency and prevent security risks.

Cloud asset inventory

After loading the cloud assets into the Komiser dashboard, teams can use filters and views to query the inventory and identify any unmanaged resources. This feature enables you to efficiently manage your cloud infrastructure and ensure that all resources are tracked and appropriately accounted for through your IaC workflows.

Cloud Resource Coverage

In addition to the previous points, it is important to regularly schedule drift detection checks to identify any changes that may have occurred. For instance, an hourly check may be appropriate for detecting any changes in IAM roles, while a daily check may suffice for less critical cloud services. Additionally, to minimize the possibility of infrastructure drift due to manual changes, it is recommended to follow the Least Privilege Principle and restrict permissions to cloud practitioners only for necessary tasks. This approach reduces the number of individuals who can make manual changes to the infrastructure.

In summary, preventing infrastructure drift requires a proactive approach, and a combination of practices and tools can be leveraged to achieve this goal. By increasing IaC adoption, regularly testing and monitoring the infrastructure, and leveraging tools like driftctl and Komiser, teams can detect and resolve drift promptly, maintain consistency, and prevent security risks and bill shocks.

Regardless if you are a Developer, DevOps, or Cloud engineer. Dealing with the cloud can be tough at times, especially on your own. If you are using Tailwarden or Komiser and want to share your thoughts doubts and insights with other cloud practitioners feel free to join our Tailwarden discord server. Where you will find tips, community calls, and much more.

Drift management in cloud infrastructure

How to use Komiser to identify any deviations in your managed resources, as well as to detect any unmanaged resources within your cloud environment.

What’s Infrastructure Drift

Why it’s bad

Drift Management

Cloud Resource Coverage