In the SDLC, deployment is the final lever that must be pulled to make an application or system ready for use. Whether it's a bug fix or new release, the deployment phase is the culminating event to see how something works in production. This Zone covers resources on all developers’ deployment necessities, including configuration management, pull requests, version control, package managers, and more.
In this blog post, you will be using AWS Controllers for Kubernetes on an Amazon EKS cluster to put together a solution wherein data from an Amazon SQS queue is processed by an AWS Lambda function and persisted to a DynamoDB table. AWS Controllers for Kubernetes (also known as ACK) leverage Kubernetes Custom Resource and Custom Resource Definitions and give you the ability to manage and use AWS services directly from Kubernetes without needing to define resources outside of the cluster. The idea behind ACK is to enable Kubernetes users to describe the desired state of AWS resources using the Kubernetes API and configuration language. ACK will then take care of provisioning and managing the AWS resources to match the desired state. This is achieved by using Service controllers that are responsible for managing the lifecycle of a particular AWS service. Each ACK service controller is packaged into a separate container image that is published in a public repository corresponding to an individual ACK service controller. There is no single ACK container image. Instead, there are container images for each individual ACK service controller that manages resources for a particular AWS API. This blog post will walk you through how to use the SQS, DynamoDB, and Lambda service controllers for ACK. Prerequisites To follow along step-by-step, in addition to an AWS account, you will need to have AWS CLI, kubectl, and Helm installed. There are a variety of ways in which you can create an Amazon EKS cluster. I prefer using eksctl CLI because of the convenience it offers. Creating an EKS cluster using eksctl can be as easy as this: eksctl create cluster --name my-cluster --region region-code For details, refer to Getting started with Amazon EKS – eksctl. Clone this GitHub repository and change it to the right directory: git clone https://github.com/abhirockzz/k8s-ack-sqs-lambda cd k8s-ack-sqs-lambda Ok, let's get started! Setup the ACK Service Controllers for AWS Lambda, SQS, and DynamoDB Install ACK Controllers Log into the Helm registry that stores the ACK charts: aws ecr-public get-login-password --region us-east-1 | helm registry login --username AWS --password-stdin public.ecr.aws Deploy the ACK service controller for Amazon Lambda using the lambda-chart Helm chart: RELEASE_VERSION_LAMBDA_ACK=$(curl -sL "https://api.github.com/repos/aws-controllers-k8s/lambda-controller/releases/latest" | grep '"tag_name":' | cut -d'"' -f4) helm install --create-namespace -n ack-system oci://public.ecr.aws/aws-controllers-k8s/lambda-chart "--version=${RELEASE_VERSION_LAMBDA_ACK}" --generate-name --set=aws.region=us-east-1 Deploy the ACK service controller for SQS using the sqs-chart Helm chart: RELEASE_VERSION_SQS_ACK=$(curl -sL "https://api.github.com/repos/aws-controllers-k8s/sqs-controller/releases/latest" | grep '"tag_name":' | cut -d'"' -f4) helm install --create-namespace -n ack-system oci://public.ecr.aws/aws-controllers-k8s/sqs-chart "--version=${RELEASE_VERSION_SQS_ACK}" --generate-name --set=aws.region=us-east-1 Deploy the ACK service controller for DynamoDB using the dynamodb-chart Helm chart: RELEASE_VERSION_DYNAMODB_ACK=$(curl -sL "https://api.github.com/repos/aws-controllers-k8s/dynamodb-controller/releases/latest" | grep '"tag_name":' | cut -d'"' -f4) helm install --create-namespace -n ack-system oci://public.ecr.aws/aws-controllers-k8s/dynamodb-chart "--version=${RELEASE_VERSION_DYNAMODB_ACK}" --generate-name --set=aws.region=us-east-1 Now, it's time to configure the IAM permissions for the controller to invoke Lambda, DynamoDB, and SQS. Configure IAM Permissions Create an OIDC Identity Provider for Your Cluster For the steps below, replace the EKS_CLUSTER_NAME and AWS_REGION variables with your cluster name and region. export EKS_CLUSTER_NAME=demo-eks-cluster export AWS_REGION=us-east-1 eksctl utils associate-iam-oidc-provider --cluster $EKS_CLUSTER_NAME --region $AWS_REGION --approve OIDC_PROVIDER=$(aws eks describe-cluster --name $EKS_CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text | cut -d '/' -f2- | cut -d '/' -f2-) Create IAM Roles for Lambda, SQS, and DynamoDB ACK Service Controllers ACK Lambda Controller Set the following environment variables: ACK_K8S_SERVICE_ACCOUNT_NAME=ack-lambda-controller ACK_K8S_NAMESPACE=ack-system AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text) Create the trust policy for the IAM role: read -r -d '' TRUST_RELATIONSHIP <<EOF { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "${OIDC_PROVIDER}:sub": "system:serviceaccount:${ACK_K8S_NAMESPACE}:${ACK_K8S_SERVICE_ACCOUNT_NAME}" } } } ] } EOF echo "${TRUST_RELATIONSHIP}" > trust_lambda.json Create the IAM role: ACK_CONTROLLER_IAM_ROLE="ack-lambda-controller" ACK_CONTROLLER_IAM_ROLE_DESCRIPTION="IRSA role for ACK lambda controller deployment on EKS cluster using Helm charts" aws iam create-role --role-name "${ACK_CONTROLLER_IAM_ROLE}" --assume-role-policy-document file://trust_lambda.json --description "${ACK_CONTROLLER_IAM_ROLE_DESCRIPTION}" Attach IAM policy to the IAM role: # we are getting the policy directly from the ACK repo INLINE_POLICY="$(curl https://raw.githubusercontent.com/aws-controllers-k8s/lambda-controller/main/config/iam/recommended-inline-policy)" aws iam put-role-policy \ --role-name "${ACK_CONTROLLER_IAM_ROLE}" \ --policy-name "ack-recommended-policy" \ --policy-document "${INLINE_POLICY}" Attach ECR permissions to the controller IAM role. These are required since Lambda functions will be pulling images from ECR. aws iam put-role-policy \ --role-name "${ACK_CONTROLLER_IAM_ROLE}" \ --policy-name "ecr-permissions" \ --policy-document file://ecr-permissions.json Associate the IAM role to a Kubernetes service account: ACK_CONTROLLER_IAM_ROLE_ARN=$(aws iam get-role --role-name=$ACK_CONTROLLER_IAM_ROLE --query Role.Arn --output text) export IRSA_ROLE_ARN=eks.amazonaws.com/role-arn=$ACK_CONTROLLER_IAM_ROLE_ARN kubectl annotate serviceaccount -n $ACK_K8S_NAMESPACE $ACK_K8S_SERVICE_ACCOUNT_NAME $IRSA_ROLE_ARN Repeat the steps for the SQS controller. ACK SQS Controller Set the following environment variables: ACK_K8S_SERVICE_ACCOUNT_NAME=ack-sqs-controller ACK_K8S_NAMESPACE=ack-system AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text) Create the trust policy for the IAM role: read -r -d '' TRUST_RELATIONSHIP <<EOF { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "${OIDC_PROVIDER}:sub": "system:serviceaccount:${ACK_K8S_NAMESPACE}:${ACK_K8S_SERVICE_ACCOUNT_NAME}" } } } ] } EOF echo "${TRUST_RELATIONSHIP}" > trust_sqs.json Create the IAM role: ACK_CONTROLLER_IAM_ROLE="ack-sqs-controller" ACK_CONTROLLER_IAM_ROLE_DESCRIPTION="IRSA role for ACK sqs controller deployment on EKS cluster using Helm charts" aws iam create-role --role-name "${ACK_CONTROLLER_IAM_ROLE}" --assume-role-policy-document file://trust_sqs.json --description "${ACK_CONTROLLER_IAM_ROLE_DESCRIPTION}" Attach IAM policy to the IAM role: # for sqs controller, we use the managed policy ARN instead of the inline policy (unlike the Lambda controller) POLICY_ARN="$(curl https://raw.githubusercontent.com/aws-controllers-k8s/sqs-controller/main/config/iam/recommended-policy-arn)" aws iam attach-role-policy --role-name "${ACK_CONTROLLER_IAM_ROLE}" --policy-arn "${POLICY_ARN}" Associate the IAM role to a Kubernetes service account: ACK_CONTROLLER_IAM_ROLE_ARN=$(aws iam get-role --role-name=$ACK_CONTROLLER_IAM_ROLE --query Role.Arn --output text) export IRSA_ROLE_ARN=eks.amazonaws.com/role-arn=$ACK_CONTROLLER_IAM_ROLE_ARN kubectl annotate serviceaccount -n $ACK_K8S_NAMESPACE $ACK_K8S_SERVICE_ACCOUNT_NAME $IRSA_ROLE_ARN Repeat the steps for the DynamoDB controller. ACK DynamoDB Controller Set the following environment variables: ACK_K8S_SERVICE_ACCOUNT_NAME=ack-dynamodb-controller ACK_K8S_NAMESPACE=ack-system AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text) Create the trust policy for the IAM role: read -r -d '' TRUST_RELATIONSHIP <<EOF { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "${OIDC_PROVIDER}:sub": "system:serviceaccount:${ACK_K8S_NAMESPACE}:${ACK_K8S_SERVICE_ACCOUNT_NAME}" } } } ] } EOF echo "${TRUST_RELATIONSHIP}" > trust_dynamodb.json Create the IAM role: ACK_CONTROLLER_IAM_ROLE="ack-dynamodb-controller" ACK_CONTROLLER_IAM_ROLE_DESCRIPTION="IRSA role for ACK dynamodb controller deployment on EKS cluster using Helm charts" aws iam create-role --role-name "${ACK_CONTROLLER_IAM_ROLE}" --assume-role-policy-document file://trust_dynamodb.json --description "${ACK_CONTROLLER_IAM_ROLE_DESCRIPTION}" Attach IAM policy to the IAM role: # for dynamodb controller, we use the managed policy ARN instead of the inline policy (like we did for Lambda controller) POLICY_ARN="$(curl https://raw.githubusercontent.com/aws-controllers-k8s/dynamodb-controller/main/config/iam/recommended-policy-arn)" aws iam attach-role-policy --role-name "${ACK_CONTROLLER_IAM_ROLE}" --policy-arn "${POLICY_ARN}" Associate the IAM role to a Kubernetes service account: ACK_CONTROLLER_IAM_ROLE_ARN=$(aws iam get-role --role-name=$ACK_CONTROLLER_IAM_ROLE --query Role.Arn --output text) export IRSA_ROLE_ARN=eks.amazonaws.com/role-arn=$ACK_CONTROLLER_IAM_ROLE_ARN kubectl annotate serviceaccount -n $ACK_K8S_NAMESPACE $ACK_K8S_SERVICE_ACCOUNT_NAME $IRSA_ROLE_ARN Restart ACK Controller Deployments and Verify the Setup Restart the ACK service controller Deployment using the following commands. It will update service controller Pods with IRSA environment variables. Get list of ACK service controller deployments: export ACK_K8S_NAMESPACE=ack-system kubectl get deployments -n $ACK_K8S_NAMESPACE Restart Lambda, SQS, and DynamoDB controller Deployments: DEPLOYMENT_NAME_LAMBDA=<enter deployment name for lambda controller> kubectl -n $ACK_K8S_NAMESPACE rollout restart deployment $DEPLOYMENT_NAME_LAMBDA DEPLOYMENT_NAME_SQS=<enter deployment name for sqs controller> kubectl -n $ACK_K8S_NAMESPACE rollout restart deployment $DEPLOYMENT_NAME_SQS DEPLOYMENT_NAME_DYNAMODB=<enter deployment name for dynamodb controller> kubectl -n $ACK_K8S_NAMESPACE rollout restart deployment $DEPLOYMENT_NAME_DYNAMODB List Pods for these Deployments. Verify that the AWS_WEB_IDENTITY_TOKEN_FILE and AWS_ROLE_ARN environment variables exist for your Kubernetes Pod using the following commands: kubectl get pods -n $ACK_K8S_NAMESPACE LAMBDA_POD_NAME=<enter Pod name for lambda controller> kubectl describe pod -n $ACK_K8S_NAMESPACE $LAMBDA_POD_NAME | grep "^\s*AWS_" SQS_POD_NAME=<enter Pod name for sqs controller> kubectl describe pod -n $ACK_K8S_NAMESPACE $SQS_POD_NAME | grep "^\s*AWS_" DYNAMODB_POD_NAME=<enter Pod name for dynamodb controller> kubectl describe pod -n $ACK_K8S_NAMESPACE $DYNAMODB_POD_NAME | grep "^\s*AWS_" Now that the ACK service controller has been set up and configured, you can create AWS resources! Create SQS Queue, DynamoDB Table, and Deploy the Lambda Function Create SQS Queue In the file sqs-queue.yaml, replace the us-east-1 region with your preferred region as well as the AWS account ID. This is what the ACK manifest for the SQS queue looks like: apiVersion: sqs.services.k8s.aws/v1alpha1 kind: Queue metadata: name: sqs-queue-demo-ack annotations: services.k8s.aws/region: us-east-1 spec: queueName: sqs-queue-demo-ack policy: | { "Statement": [{ "Sid": "__owner_statement", "Effect": "Allow", "Principal": { "AWS": "AWS_ACCOUNT_ID" }, "Action": "sqs:SendMessage", "Resource": "arn:aws:sqs:us-east-1:AWS_ACCOUNT_ID:sqs-queue-demo-ack" }] } Create the queue using the following command: kubectl apply -f sqs-queue.yaml # list the queue kubectl get queue Create DynamoDB Table This is what the ACK manifest for the DynamoDB table looks like: apiVersion: dynamodb.services.k8s.aws/v1alpha1 kind: Table metadata: name: customer annotations: services.k8s.aws/region: us-east-1 spec: attributeDefinitions: - attributeName: email attributeType: S billingMode: PAY_PER_REQUEST keySchema: - attributeName: email keyType: HASH tableName: customer You can replace the us-east-1 region with your preferred region. Create a table (named customer) using the following command: kubectl apply -f dynamodb-table.yaml # list the tables kubectl get tables Build Function Binary and Create Docker Image GOARCH=amd64 GOOS=linux go build -o main main.go aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws docker build -t demo-sqs-dynamodb-func-ack . Create a private ECR repository, tag and push the Docker image to ECR: AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text) aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com aws ecr create-repository --repository-name demo-sqs-dynamodb-func-ack --region us-east-1 docker tag demo-sqs-dynamodb-func-ack:latest $AWS_ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/demo-sqs-dynamodb-func-ack:latest docker push $AWS_ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/demo-sqs-dynamodb-func-ack:latest Create an IAM execution Role for the Lambda function and attach the required policies: export ROLE_NAME=demo-sqs-dynamodb-func-ack-role ROLE_ARN=$(aws iam create-role \ --role-name $ROLE_NAME \ --assume-role-policy-document '{"Version": "2012-10-17","Statement": [{ "Effect": "Allow", "Principal": {"Service": "lambda.amazonaws.com"}, "Action": "sts:AssumeRole"}]}' \ --query 'Role.[Arn]' --output text) aws iam attach-role-policy --role-name $ROLE_NAME --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole Since the Lambda function needs to write data to DynamoDB and invoke SQS, let's add the following policies to the IAM role: aws iam put-role-policy \ --role-name "${ROLE_NAME}" \ --policy-name "dynamodb-put" \ --policy-document file://dynamodb-put.json aws iam put-role-policy \ --role-name "${ROLE_NAME}" \ --policy-name "sqs-permissions" \ --policy-document file://sqs-permissions.json Create the Lambda Function Update function.yaml file with the following info: imageURI - The URI of the Docker image that you pushed to ECR, e.g., <AWS_ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com/demo-sqs-dynamodb-func-ack:latest role - The ARN of the IAM role that you created for the Lambda function, e.g., arn:aws:iam::<AWS_ACCOUNT_ID>:role/demo-sqs-dynamodb-func-ack-role This is what the ACK manifest for the Lambda function looks like: apiVersion: lambda.services.k8s.aws/v1alpha1 kind: Function metadata: name: demo-sqs-dynamodb-func-ack annotations: services.k8s.aws/region: us-east-1 spec: architectures: - x86_64 name: demo-sqs-dynamodb-func-ack packageType: Image code: imageURI: AWS_ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/demo-sqs-dynamodb-func-ack:latest environment: variables: TABLE_NAME: customer role: arn:aws:iam::AWS_ACCOUNT_ID:role/demo-sqs-dynamodb-func-ack-role description: A function created by ACK lambda-controller To create the Lambda function, run the following command: kubectl create -f function.yaml # list the function kubectl get functions Add SQS Trigger Configuration Add SQS trigger which will invoke the Lambda function when an event is sent to the SQS queue. Here is an example using AWS Console: Open the Lambda function in the AWS Console and click on the Add trigger button. Select SQS as the trigger source, select the SQS queue, and click on the Add button. Now you are ready to try out the end-to-end solution! Test the Application Send a few messages to the SQS queue. For the purposes of this demo, you can use the AWS CLI: export SQS_QUEUE_URL=$(kubectl get queues/sqs-queue-demo-ack -o jsonpath='{.status.queueURL}') aws sqs send-message --queue-url $SQS_QUEUE_URL --message-body user1@foo.com --message-attributes 'name={DataType=String, StringValue="user1"}, city={DataType=String,StringValue="seattle"}' aws sqs send-message --queue-url $SQS_QUEUE_URL --message-body user2@foo.com --message-attributes 'name={DataType=String, StringValue="user2"}, city={DataType=String,StringValue="tel aviv"}' aws sqs send-message --queue-url $SQS_QUEUE_URL --message-body user3@foo.com --message-attributes 'name={DataType=String, StringValue="user3"}, city={DataType=String,StringValue="new delhi"}' aws sqs send-message --queue-url $SQS_QUEUE_URL --message-body user4@foo.com --message-attributes 'name={DataType=String, StringValue="user4"}, city={DataType=String,StringValue="new york"}' The Lambda function should be invoked and the data should be written to the DynamoDB table. Check the DynamoDB table using the CLI (or AWS console): aws dynamodb scan --table-name customer Clean Up After you have explored the solution, you can clean up the resources by running the following commands: Delete SQS queue, DynamoDB table and the Lambda function: kubectl delete -f sqs-queue.yaml kubectl delete -f function.yaml kubectl delete -f dynamodb-table.yaml To uninstall the ACK service controllers, run the following commands: export ACK_SYSTEM_NAMESPACE=ack-system helm ls -n $ACK_SYSTEM_NAMESPACE helm uninstall -n $ACK_SYSTEM_NAMESPACE <enter name of the sqs chart> helm uninstall -n $ACK_SYSTEM_NAMESPACE <enter name of the lambda chart> helm uninstall -n $ACK_SYSTEM_NAMESPACE <enter name of the dynamodb chart> Conclusion and Next Steps In this post, we have seen how to use AWS Controllers for Kubernetes to create a Lambda function, SQS, and DynamoDB table and wire them together to deploy a solution. All of this (almost) was done using Kubernetes! I encourage you to try out other AWS services supported by ACK. Here is a complete list. Happy building!
Fargate vs. Lambda has recently been a trending topic in the serverless space. Fargate and Lambda are two popular serverless computing options available within the AWS ecosystem. While both tools offer serverless computing, they differ regarding use cases, operational boundaries, runtime resource allocations, price, and performance. This blog aims to take a deeper look into the Fargate vs. Lambda battle. What Is AWS Fargate? AWS Fargate is a serverless computing engine offered by Amazon that enables you to efficiently manage containers without the hassles of provisioning servers and the underlying infrastructure. When cluster capacity management, infrastructure management, patching, and provisioning resource tasks are removed, you can finally focus on delivering faster and better quality applications. AWS Fargate works with Amazon Elastic Container Service (ECS) and Amazon Elastic Kubernetes Service (EKS), supporting a range of container use cases such as machine learning applications, microservices architecture apps, on-premise app migration to the cloud, and batch processing tasks. Without AWS Fargate Developers build container images Define EC2 instances and deploy them Provision memory and compute resources and manage them Create separate VMs to isolate applications Run and manage applications Run and manage the infrastructure Pay EC2 instances usage charges When AWS Fargate Is Implemented Developers build container images Define compute and memory resources Run and manage apps Pay compute resource usage charges In the Fargate vs. Lambda context, Fargate is the serverless compute option in AWS used when you already have containers for your application and simply want to orchestrate them easier and faster. It works with Elastic Kubernetes Service (EKS) as well as Elastic Container Service (ECS). EKS and ECS have two types of computing options: 1. EC2 type: With this option, you need to deal with the complexity of configuring Instances/Servers. This can be a challenge for inexperienced users. You must set up your EC2 instances and put containers inside the servers with some help from the ECS or EKS configurations. 2. Fargate type: This option allows you to reduce the server management burden while easily updating and increasing the configuration limits required to run Fargate. What Is Serverless? Before delving deep into the serverless computing battle of Lambda vs. Fargate or Fargate vs. Lambda, it’s important first to gain a basic understanding of the serverless concept. Serverless computing is a technology that enables developers to run applications without needing to provision server infrastructure. The cloud provider will provide the backend infrastructure on-demand and charge you according to a pay-as-you-go model. The term “serverless” might be misleading for some people. Indeed, it’s important to note that serverless technology doesn’t imply the absence of servers. Rather, the cloud provider will manage the server infrastructure with this technology, allowing developers to concentrate their efforts on an app’s front-end code and logic. Resources are spun when the code executes a function and terminates when the function stops. Billing is based on the duration of the execution time of the resources. Therefore, operational costs are optimized because you don’t pay for idle resources. With serverless technology, you can say goodbye to capacity planning, administrative burdens, and maintenance. Furthermore, you can enjoy high availability and disaster recovery at zero cost. Auto-scaling to zero is also available. Finally, resource utilization is 100%, and billing is done granularly, measuring 100 milliseconds as a unit. What Is AWS Lambda? AWS Lambda is an event-driven serverless computing service. Lambda runs predefined code in response to an event or action, enabling developers to perform serverless computing. This cross-platform was developed by Amazon and first released in 2014. It supports major programming languages such as C#, Python, Java, Ruby, Go, and Node.js. It also supports custom runtime. Some of the popular use cases of Lambda include updating a DynamoDB table, uploading data to S3 buckets, and running events in response to IoT sensor data. The pricing is based on milliseconds of usage, rounding off to the nearest millisecond. Moreover, Lambda allows you to manage Docker containers of sizes up to 50 GB via ECR. When you compare Fargate vs. Lambda, Fargate is for containerized applications running for days, weeks, or years. Lambda is designed specifically to handle small portions of an application, such as a function. For instance, a function that clears the cache every 6 hours and lasts for 30 seconds can be executed using Lambda. A Typical AWS Lambda Architecture AWS Lambda is a Function-as-a-Service (FaaS) that helps developers build event-driven apps. In the app’s compute layer, Lambda triggers AWS events. What are the three core components of Lambda architecture? 1) Function: A function is a piece of code written by developers to perform a task. The code also contains the details of the runtime environment of the function. The runtime environments are based on Amazon Linux AMI and contain all required libraries and packages. Capacity and maintenance are handled by AWS. a. Code Package: The packaged code containing the binaries and assets required for the code to run. The maximum size is 250 MB or 50 MB in a compressed version. b. Handler: The starting point of the invoked function running a task based on parameters provided by event objects. c. Event Object: A parameter provided to the Handler to perform the logic for an operation. d. Context Object: Facilitates interaction between the function code and the execution environment. The data available for Context Objects include: i. AWS Request ID ii. Remaining time for the function to time out iii. Logging statements to CloudWatch 2) Configuration: Rules that specify how a function is executed. a. IAM Roles: Assigns permissions for functions to interact with AWS services. b. Network Configuration: Specifies rules to run functions inside a VPC or outside a VPC. c. Version: Reverts functions to previous versions. d. Memory Dial: Controls resource allocations to functions. e. Environment Variables: Values injected into the code during the runtime. f. Timeout: Time for a function to run. 3) Event Source: The event that triggers the function. a. Push Model: Functions triggered via S3 objects, API Gateway and Amazon Alexa. b. Pull Model: Lambda pulls events from DynamoDB or Kinesis. A Typical AWS Fargate Architecture What are the four core components of the AWS Fargate architecture? 1) Task Definition: A JSON file that describes definitions for at least one of the application containers. 2) Task: Instantiation of a task definition at a cluster level. 3) Cluster: Tasks or services logically grouped in Amazon ECS. 4) Service: A process that runs tasks in Amazon ECS cluster based on task definitions. Fargate vs Lambda: Performance As far as performance is concerned in the AWS Fargate vs. Lambda debate, AWS Fargate is the winner, as it runs on dedicated resources. Lambda has certain limitations when it comes to allocating computing and memory resources. Based on the selected amount of RAM, AWS allocates the corresponding CPU resources meaning that the user cannot customize CPU resources. Moreover, the maximum available memory for Lambda functions is 10 GB, whereas Fargate allows for 120 GB of memory. Furthermore, Fargate allows you to choose up to 16 vCPU resources. Another notable issue is that a Lambda function only has a run time of 15 minutes for every invocation. On the other hand, in the absence of runtime limitations, the Fargate environment is always in a warm state. Fargate functions must be packaged into containers, increasing the load time to around 60 seconds. This is a very long time compared to Lambda functions which can get started within 5 seconds. Fargate allows you to launch 20 tasks per second using ECS RunTask API. Moreover, you can launch 500 tasks per service in 120 seconds with ECS Service Scheduler. That said, scaling the environment during unexpected spike requests and health monitoring tends to cause a bit of a delay in start-up time. Lambda Cold Starts When Lambda receives a request to execute a task, it starts by downloading the code from S3 buckets and creating an execution environment based on the predefined memory and its corresponding compute resources. If there is any initialization code, Lambda runs it outside the environment and then runs the handler code. The time required for downloading the code and preparing the execution environment is counted as the cold start duration. After executing the code, Lambda freezes the environment so that the same function can run quickly if invoked again. If you run the function concurrently, each invocation gets a cold start. There will also be a code start if the code is updated. The typical time for cold starts falls between 100 ms and 1 second. In light of the foregoing, Lambda falls short in the Lambda vs. Fargate race regarding cold starts. However, Provisioned Concurrency is a solution to reduce cold starts. The runtime choice will also have an impact on Lambda cold starts. For instance, Java runtime involves multiple resources to run the JVM environment, which delays the start. On the other hand, C# or Node.js runtime environments offer lower latencies. Fargate Cold Starts Fargate takes time to provision resources and starts a task. Once the environment is up and running, containers get dedicated resources and run the code as defined. Fargate vs. Lambda: Support AWS Fargate works as an operational layer of a serverless computing architecture to manage Docker-based ECS or Kubernetes-based EKS environments. For ECS, you can define container tasks in text files using JSON. There is support for other runtime environments as well. Fargate offers more capacity deployment control than Lambda, as Lambda is limited to 10GB of space and 10GB of package size for container images and 250 MB for deployments to S3 buckets. Lambda supports all major programming languages, such as Python, Go, Ruby, PHP, C#, Node.js, and Java, and code compilation tools, such as Maven and Gradle. That said, Lambda only supports Linux-based container images. With Fargate, you can develop Docker container images locally using Docker Compose and run them in Fargate without worrying about compatibility issues. Since development and architecture is independent of Fargate, it outperforms Lambda in this particular category. When more control over the container environment is the key requirement, AWS Fargate is definitely the right choice. Fargate vs. Lambda: Costs When comparing Fargate vs. Lambda costs, it is important to note that both tools serve different purposes. While Lambda is a Function-as-a-Service, Fargate is a serverless computing tool for container-based workloads. Lambda costs are billed in milliseconds. AWS Lambda charges $0.20 per 1 Million requests with $0.0000166667 for every GB-second duration for the first 6 Billion GB-seconds / month. The duration costs vary based on the allocated memory. For instance, 128 MB memory costs you $0.0000000021 per ms, and 10 GB memory costs you $0.0000001667 per ms. For example, consider 10 GB of memory with 6 vCPU and concurrency, which is always running. The monthly cost for the foregoing would be $432.50. If the concurrency is two, the price is doubled. If the environment runs half the day, the price gets divided by two. If it’s running for 10 minutes per day, the cost would be $9.10 per month. If you consider the same configuration in Fargate, the prices are drastically lower. Fargate charges a flat rate of $0.04048 per vCPU per hour ($29.145 per month) $0.004445 per GB per hour ($3.20 per month) So, a 10 GB memory with 6 vCPUs running continuously for a month with concurrency one would cost $206.87. Moreover, Fargate separates CPUs from memory, allowing you to choose the right-sized configuration. Therefore, you can save costs by reducing the CPUs depending on your needs. When you consider a concurrency of 10, the difference increases exponentially. Another advantage of Fargate is the spot pricing which offers an additional 30% savings. Notice that Lambda costs are lower than Fargate when the idle time is greater. In light of the foregoing, we can conclude that Lambda is more suitable for workloads that are idle for long periods. Lambda is cost-effective if the resources are idle for a quarter or less of the time. Lambda is the best choice to scale fast or isolate security from an app code. Contrastingly, Fargate suits cloud environments with minimally idle workloads. We think the best option is to implement Infrastructure as Code (IaC) and begin with Lambda. When workloads increase, you can seamlessly switch to Fargate. Fargate vs. Lambda: Easy to Work Lambda is easy to set up and operate as there are minimal knobs to adjust compared to Fargate. More abstraction implies less operational burden. However, it also implies limited flexibility. Lambda comes with a rich ecosystem that offers fully automated administration. You can use the management console or the API to call and control functions synchronously or asynchronously, including concurrency. The runtime supports a common set of functionalities and allows you to switch between different frameworks and languages. As far as operational burdens go, Lambda is easier compared to EC2. Fargate stands between Lambda and EC2 in this category, leaning closer towards Lambda. That said, EC2 offers more flexibility in configuring and operating the environment, followed by Fargate and Lambda. Fargate vs Lambda: Community Both AWS Fargate and Lambda are a part of the AWS serverless ecosystem. As such, both tools enjoy the same level of community support. Both services offer adequate support for new and advanced users, from documentation and how-to guides to tutorials and FAQs. Fargate vs Lambda: Cloud Agnostic Each cloud vendor manages serverless environments differently. For instance, C# functions written for AWS will not work on the Google Cloud. In light of the foregoing, developers need to consider cloud-agnostic issues if multi-cloud and hybrid-cloud architectures are involved. Moving between different cloud vendors involves considerable expenses and operational impacts. As such, vendor lock-in is a big challenge for serverless functions. To overcome this, we suggest using an open-source serverless framework offered by Serverless Inc. Moreover, implementing hexagonal architecture is a good idea because it allows you to move code between different serverless cloud environments. Fargate vs Lambda: Scalability In terms of Lambda vs Fargate scalability, Lambda is known as one of the best scaling technologies available in today’s market. Rapid scaling and scaling to zero are the two key strengths of Lambda. The tool instantly scales from zero to thousands and scales down from 1000 to 0, making it a good choice for low workloads, test environments, and workloads with unexpected traffic spikes. As far as Fargate is concerned, container scaling depends on resizing the underlying clusters. Furthermore, it doesn’t natively scale down to zero. Therefore, you’ll have to shut down Fargate tasks outside business hours to save on operational costs. Tasks such as configuring auto-scaling and updating base container images add up when it comes to maintenance. Fargate vs. Lambda: Security Lambda and Fargate are inherently secure as part of the AWS ecosystem. You can secure the environment using the AWS Identity and Access Management (IAM) service. Similarly, both tools abstract away the underlying infrastructure, which means the security of the infrastructure is managed by other services. The difference between the two tools lies in the IAM configuration. Lambda allows you to customize IAM roles for each function or service, while Fargate customizes each container and pod. Fargate tasks run in an isolated computing environment wherein CPU or memory is not shared with other tasks. Similarly, Lambda functions run in a dedicated execution environment. Also, Fargate offers more control over the environment and more secure touchpoints than Lambda. When to Use Fargate or Lambda? AWS Lambda Use Cases: Operating serverless websites Massively scaling operations Real-time processing of high volumes of data Predictive page rendering Scheduled events for every task and data backup Parse user input and cleanup backend data to increase a website’s rapid response time Analyzing log data on-demand Integrating with external services Converting documents into the user-requested format on-demand Real-Life Lambda Use Cases Serverless Websites: Bustle One of the best use cases for Lambda is operating serverless websites. By hosting frontend apps on S3 buckets and using CloudFront content delivery, organizations can manage static websites and take advantage of the Lambda pricing model. Bustle is a news, entertainment, and fashion website for women. The company was having difficulties scaling its application. In addition, server management, monitoring, and automation was becoming an important administrative burden. The company, therefore, decided to move to AWS Lambda with API Gateway and Amazon Kinesis to run serverless websites. Now, the company doesn’t have to worry about scaling, and its developers can deploy code at an extremely low cost. Event-driven Model for Workloads With Idle Times: Thomson Reuters Companies that manage workloads that are idle most of the time can benefit from the Lambda serverless feature. A notable example is Thomson Reuters, one of the world’s most trusted news organizations. The company wanted to build its own analytics engine. The small team working on this project desired a lessened administrative burden. At the same time, the tool needed to scale elastically during breaking news. Reuters chose Lambda. The tool receives data from Amazon Kinesis and automatically loads this data in a master dataset in an S3 bucket. Lambda is triggered with data integrations with Kinesis and S3. As such, Reuters enjoyed high scalability at the lowest cost possible. Highly Scalable Real-time Processing Environment: Realtor.com AWS Lambda enables organizations to scale resources while instantly cost-effectively processing tasks in real-time. Realtor.com is a leader in the real estate market. After the move to the digital world, the company started experiencing exponential traffic growth. Furthermore, the company needed a solution to update ad listings in real-time. Realtor.com chose AWS for its cloud operations. The company uses Amazon Kinesis Data Streams to collect and stream ad impressions. The internal billing system consumes this data using Amazon Kinesis Firehose, and the aggregate data is sent to the Amazon Redshift data warehouse for analysis. The application uses AWS Lambda to read Kinesis Data Streams and process each event. Realtor.com is now able to massively scale operations cost-effectively while making changes to ad listings in real-time. AWS Fargate Use Cases AWS Fargate is the best choice for managing container-based workloads with minimal idle times. Build, run, and manage APIs, microservices, and applications using containers to enjoy speed and immutability Highly scalable container-based data processing workloads Migrate legacy apps running on EC2 instances without refactoring or rearchitecting them Build and manage highly scalable AI and ML development environments Real-Life Use Cases Samsung Samsung is a leader in the electronics category. The company operates an online portal called “Samsung Developers,” which consists of SmartThings Portal for the Internet of Things (IoT), Bixby Portal for voice-based control of mobile services, and Rich Communication Services (RCS) for mobile messaging. The company was using Amazon ECS to manage the online portal. After the re: Invent 2017 event, Samsung was inspired to implement Fargate for operational efficiency. After migrating to AWS Fargate, the company no longer needed dedicated operators and administrators to manage the web services of the portal. Now, geographically distributed teams simply have to create new container images uploaded to ECR and moved to the test environment on Fargate. Developers can therefore focus more on code, and frequent deployments and administrators can focus more on performance and security. Compute costs were downsized by 44.5%. Quola Insurtech Startup Quola is a Jakarta-based insurance technology startup. The company developed software that automates claim processing using AI and ML algorithms to eliminate manual physical reviews. Quola chose AWS cloud and Fargate to run and manage container-based workloads. Amazon Simple Queue Service (SQS) is used for the message-queuing service. With Fargate, Quola is able to scale apps seamlessly. When a new partner joined the network, data transactions increased from 10,000 to 100,000 in a single day. Nevertheless, the app was able to scale instantly without performance being affected. Vanguard Financial Services Vanguard is a leading provider of financial services in the US. The company moved its on-premise operations to the AWS cloud in 2015 and now manages 1000 apps that run on microservices architecture. With security being a key requirement in the financial industry, Vanguard operates in the secure environment of Fargate. With Fargate, the company could offer seamless computing capacity to its containers and reduce costs by 50%. Considerations When Moving to a Serverless Architecture Inspired by the amazing benefits of serverless architecture, many businesses are aggressively embracing the serverless computing model. Here are the steps to migrate monolith and legacy apps to a serverless architecture. a) Monolith to Microservices: Most legacy apps are built using a monolith architecture. When such is the case, the first step is to break the large into smaller and modular microservices, after which each microservice will perform a specific task or function. b) Implement each Microservice as a REST API: The next step is identifying the best fit within these microservices. Implement each microservice as a REST API with API endpoints as resources. Amazon API Gateway is a fully managed service that can help you. c) Implement a Serverless Compute Engine: Implement a serverless compute engine such as Lambda or Fargate and move the business logic to the serverless tool such that AWS provisions resources every time a function is invoked. d) Staggered Deployment Strategy: Migrating microservices to the serverless architecture can be done in a staggered process. Identify the right services and then build, test, and deploy them. Continue this process to smoothly and seamlessly move the entire application to the new architecture. Considerations for Moving to Amazon Lambda Migrating legacy apps to Lambda is not a difficult job. If your application is written in any Lambda-supported languages, you can simply refactor the code and migrate the app to Lambda. You simply need to make some fundamental changes, such as changing the dependency on local storage to S3 or updating authentication modules. When Fargate vs. Lambda security is considered, Lambda has fewer touchpoints to secure than Fargate. If you are using Java runtime, keep in mind that the size of the runtime environment and resources can result in more cold starts than with Node.js or C#. Another key point to consider is memory allocation. Currently, Lambda’s maximum memory allocation is 3 GB. If your application requires more computing and memory resources, Fargate is a better choice. Considerations for Moving to AWS Fargate While AWS manages resource provisioning, customers still need to handle network security tasks. For instance, when a task is created, AWS creates an Elastic Network Interface (ENI) in the VPC and automatically attaches each task ENI to its corresponding subnet. Therefore, managing the connectivity between the ENI and its touch points is the customer’s sole responsibility. More specifically, you need to manage ENI access to AWS EC2, CloudWatch, Apps running on-premise or other regions, Egress, Ingress, etc. Moreover, audit and compliance aspects must be carefully managed, which is why Fargate is not preferred for highly regulated environments. Conclusion The Fargate vs Lambda battle is getting more and more interesting as the gap between container-based and serverless systems is getting smaller with every passing day. There is no silver bullet when deciding which service is the best. With the new ability to deploy Lambda functions as Docker container images, more organizations seem to lean towards Lambda. On the other hand, organizations that need more control over the container runtime environment are sticking with Fargate.
Docker has revolutionized the way we build and deploy applications. It provides a platform-independent environment that allows developers to package their applications and dependencies into a single container. This container can then be easily deployed across different environments, making it an ideal solution for building and deploying applications at scale. Building Docker images from scratch is a must skill that any DevOps engineer needs to acquire for working with Docker. It allows you to create custom images tailored to your application's specific needs, making your deployments more efficient and reliable. Here, in this blog, we'll explore Docker images, its benefits, the process of building Docker images from scratch, and the best practices for building a Docker image. What Is a Docker Image? A Docker image is a lightweight, standalone, executable package that includes everything needed to run the software, including code, libraries, system tools, and settings. Docker images are built using a Dockerfile, which is a text file that contains a set of instructions for building the image. These instructions specify the base image to use, the packages and dependencies to install, and the configuration settings for the application. Docker images are designed to be portable and can be run on any system that supports Docker. They are stored in a central registry, such as Docker Hub, where others can easily share and download. By using Docker images, developers can quickly and easily deploy their applications in a consistent and reproducible manner, regardless of the underlying infrastructure. This makes Docker images an essential tool for modern software development and deployment. Benefits of Building a Docker Image By building image Docker, you can improve the consistency, reliability, and security of your applications. In addition, Docker images make it easy to deploy and manage applications, which helps to reduce the time and effort required to maintain your infrastructure. Here are some major benefits of building a Docker image: Portability: Docker images are portable and can run on any platform that supports Docker. This makes moving applications between development, testing, and production environments easy. Consistency: Docker images provide a consistent environment for running applications. This ensures that the application behaves the same way across different environments. Reproducibility: Docker images are reproducible, which means you can recreate the same environment every time you run the image. Scalability: Docker images are designed to be scalable, which means that you can easily spin up multiple instances of an application to handle increased traffic. Security: Docker images provide a secure way to package and distribute applications. They allow you to isolate your application from the host system and other applications running on the same system. Efficiency: Docker images are lightweight and take up minimal disk space. This makes it easy to distribute and deploy applications quickly. Versioning: Docker images can be versioned, which allows you to track changes and roll back to previous versions if necessary. Structure of a Docker Image A Docker image is a read-only template that contains the instructions for creating a Docker container. Before you learn how to build a Docker image, let's read about its structure first. The structure of a Docker image includes the following components: Base Image A Docker image is built on top of a base image, which is the starting point for the image. The base image can be an official image from the Docker Hub registry or a custom image created by another user. Filesystem The filesystem of a Docker image is a series of layers that represent the changes made to the base image. Each layer contains a set of files and directories that represent the differences from the previous layer. Metadata Docker images also include metadata that provides information about the image, such as its name, version, author, and description. This metadata is stored in a file called the manifest. Dockerfile The Dockerfile is a text file that contains the instructions for building the Docker image. It specifies the base image, the commands to run in the image, and any additional configuration needed to create the image. Before learning how to build the docker image using the Docker build command from Dockerfile, knowing how dockerfile works will be helpful. Configuration Files Docker images may also include configuration files that are used to customize the image at runtime. These files can be mounted as volumes in the container to provide configuration data or environment variables. Runtime Environment Finally, Docker images may include a runtime environment that specifies the software and libraries needed to run the application in the container. This can include language runtimes such as Python or Node.js or application servers such as Apache or Nginx. The structure of a Docker image is designed to be modular and flexible, allowing technology teams to create images tailored to their specific needs while maintaining consistency and compatibility across different environments. How to Build a Docker Image? To build a Docker image, you need to follow these steps: Create a Dockerfile A Dockerfile is a script that contains instructions on how to build your Docker image. The Dockerfile specifies the base image, dependencies, and application code that are required to build the image. After creating a Dockerfile and understanding how Dockerfile works, move to the next step. Define the Dockerfile Instructions In the Dockerfile, you need to define the instructions for building the Docker image. These instructions include defining the base image, installing dependencies, copying files, and configuring the application. Build the Docker Image To build a Docker image, you need to use the Docker build command. This command takes the Dockerfile as input and builds the Docker image. After using the Docker build command with Dockerfile, you can also specify the name and tag for the image using the -t option. Test the Docker Image Once the Docker image is built, you can test it locally using the docker run command. This command runs a container from the Docker image and allows you to test the application. Push the Docker Image to a Registry Once you have tested the Docker image, you can push it to a Docker registry such as Docker Hub or a private registry. This makes it easy to share the Docker image with others and deploy it to other environments. Let's see this Docker build command example. Once you've created your Dockerfile, you can use the "docker build" command to build the image. Here's the basic syntax for the docker build command with dockerfile: (php) docker build -t <image-name> <path-to-Dockerfile> Here, in this Docker build command example, if your Dockerfile is located in the current directory and you want to name your image "my-app," you can use the following Docker build command from dockerfile. (perl) docker build -t my-app This Docker builds command builds the Docker image using the current directory as the build context and sets the name and tag of the image to "my-app." Best Practices for Building a Docker Image Here are some best practices to follow when building a Docker image: First, use a small base image: Use a small base image such as Alpine Linux or BusyBox while building an image Docker. This helps to reduce the size of your final Docker image and improves security by minimizing the attack surface. Use a .dockerignore file: Use a .dockerignore file to exclude files and directories that are not needed in the Docker image. This helps to reduce the size of the context sent to the Docker daemon during the build process. Use multistage builds: Use multistage builds to optimize your Docker image size. Multistage builds allow you to build multiple images in a single Dockerfile, which can help reduce the number of layers in your final image. Minimize the number of layers: Minimize the number of layers in your Docker image to reduce the build time and image size. Each layer in a Docker image adds overhead, so it's important to combine multiple commands into a single layer. Use specific tags: Use specific tags for your Docker image instead of the latest tag. This helps to ensure that you have a consistent and reproducible environment. Avoid installing unnecessary packages: Avoid installing unnecessary packages in your Docker image to reduce the image size and improve security. Use COPY instead of ADD: Use the COPY command instead of ADD to copy files into your Docker image. The COPY command is more predictable and has fewer side effects than the ADD command. Avoid using root user: Avoid using the root user in your Docker image to improve security. Instead, create a non-root user and use that user in your Docker image. Docker Images: The Key to Seamless Container Management By following these steps and practices outlined in this blog, you can create custom Docker images tailored to your application's specific needs. This will not only make your deployments more efficient and reliable, but it will also help you to save time and resources. With these skills, you can take your Docker knowledge to the next level and build more efficient and scalable applications. Docker is a powerful tool for building and deploying applications, but it can also be complex and challenging to manage. Whether you're facing issues with image compatibility, security vulnerabilities, or performance problems, it's important to have a plan in place for resolving these issues quickly and effectively.
As organizations migrate to the cloud, they desire to exploit this on-demand infrastructure to scale their applications. But such migrations are usually complex and need established patterns and control points to manage. In my previous blog posts, I covered a few of the proven designs for cloud applications. In this article, I’ll introduce the Orchestration Pattern (also known as the Orchestrator Pattern) to add to the list. This technique allows the creation of scalable, reliable, and fault-tolerant systems. The approach can help us manage the flow and coordination among components of a distributed system, predominantly in a microservices architecture. Let’s dive into a problem statement to see how this pattern works. Problem Context Consider a legacy monolith retail e-commerce website. This complex monolith consists of multiple subdomains such as shopping baskets, inventory, payments etc. When a client sends a request, the website performs a sequence of operations to fulfil the request. In this traditional architecture, each operation can be described as a method call. The biggest challenge for the application is scaling with the demand. So, the organisation decided to migrate this application to the cloud. However, the monolithic approach that the application uses is too restricted and would limit scaling even in the cloud. Adopting a lift and shift approach to perform migration would not reap the real benefits of the cloud. Thus, a better migration would be to refactor the entire application and break it down by subdomains. The new services must be deployed and managed individually. The new system comes with all the improvements of distributed architecture. These distributed and potentially stateless services are responsible for their own sub-domains. But the immediate question is how to manage a complete workflow in this distributed architecture. Let us try to address this question in the next section and explore more about Orchestration Patterns. Monolithic application migration to the cloud What Is Orchestration Pattern We have designed an appropriate architecture where all services operate within their bounded context. However, we still need a component that is aware of the entire business workflow. The missing element is responsible for generating the final response by communicating with all of the services. Think of it like an orchestra with musicians playing their instruments. In an orchestra, a central conductor coordinates and aligns the members to produce a final performance. The Orchestration Pattern also introduces a centralized controller or service known as the orchestrator, similar to a central conductor. The orchestrator does not perform business logic but manages complex business flows by calling independently deployed services, handling exceptions, retrying requests, maintaining state, and returning the final response. Orchestrator Pattern The figure above illustrates the pattern. It has three components: the orchestrator or central service, business services that need coordination, and the communication channel between them. It is an extension of the Scatter Gather pattern but involves a sequence of operations instead of executing a single task in parallel. Let’s examine a use case to understand how the pattern works. Use Case Many industries, such as e-commerce, finance, healthcare, telecommunications, and entertainment, widely use the orchestrator pattern with microservices. By now, we also have a good understanding of the pattern. In this section, I will talk about payment processing, which is relevant in many contexts, to detail the pattern in action. Consider a payment gateway system that mediates between a merchant and a customer bank. The payment gateway aims to facilitate secure transactions by managing and coordinating multiple participating services. When the orchestrator service receives a payment request, it triggers a sequence of service calls in the following order: Firstly, it calls the payment authorization service to verify the customer’s payment card, the amount going out, and bank details. The service also confirms the merchant’s bank and its status. Next, the orchestrator invokes the Risk Management Service to retrieve the transaction history of the customer and merchant to detect and prevent fraud. After this, the orchestrator checks for Payment Card Industry (PCI) Compliance by calling the PCI Compliance Service. This service enforces the mandated security standards and requirements for cardholder data. Credit card companies need all online transactions to comply with these security standards. Finally, the orchestrator calls another microservice, the Transaction Service. This service converts the payment to the merchant’s preferred currency if needed. The service then transfers funds to the merchant’s account to settle the payment transaction. Payment Gateway System Flow After completing all the essential steps, the Orchestrator Service responds with a transaction completion status. At this point, the calling service may send a confirmation email to the buyer. The complete flow is depicted in the above diagram. It is important to note that this orchestration service is not just a simple API gateway that calls the APIs of different services. Instead, it is the only service with the complete context and manages all the steps necessary to finish the transaction. If we want to add another step, for example, the introduction of new compliance by the government, all we need to do is create a new service that ensures compliance and add this to the orchestration service. It’s worth noting that the new addition may not affect the other services, and they may not even be aware of it. Implementation Details The previous section has demonstrated a practical use case for managing service using an orchestrator. However, below are a few tactics that can be used while implementing the pattern: Services vs ServerlessMostly following this pattern means having a business logic that spreads across many services. However, there are specific situations when not all the business steps require execution or only a few steps are necessary. Should these steps be deployed as functions instead of services in these scenarios? Events usually trigger functions, which shut down once they complete their job. Such an infrastructure can save us money compared to a service that remains active continuously and performs minimal tasks. Recovery from Transient FailuresThe orchestration pattern implementation can be challenging because it involves coordinating multiple services and workflows, which requires a different approach to designing and managing software systems than traditional monolithic architectures. The implementation must be able to handle potential transient failures, such as network failure, service failure, or database failure. Below are a few ways to cater to such issues: Retry MechanismImplementing a retry mechanism can improve resiliency when a service operation fails. The retry mechanism should configure the number of retries allowed, the delay between retries, and the conditions to attempt retries. Circuit Breaker PatternIn case a service fails, the orchestrator must detect the failure, isolate the failed service, and give it a chance to recover. It can help the service heal without disruption and avoid complete system failure. Graceful DegradationIf a service fails and becomes unavailable, the rest of the services should continue to operate. The orchestrator should look for fallback options to minimize the impact on end-users, such as previously cached results or an alternate service. Monitoring and AlertingThe entire business flow is distributed among various services when we operate with the Orchestration Pattern. Therefore, an effective monitoring and alerting solution is mandatory to trace and debug any failures. The solution must be capable of detecting any issues in real-time and taking appropriate actions to mitigate the impact. It includes implementing auto-recovery strategies, such as restarting failed services or switching to a backup service, and setting up alerts to notify the operations team when exceptions occur. The logs generated by the orchestrator are also valuable for the operations team to troubleshoot errors. We can operate smoothly and meet user needs by proactively identifying and resolving issues. Orchestration Service FailureFinally, we must prepare for scenarios where the orchestrator fails itself while processing requests. For instance, in our payment gateway example, imagine a scenario where the orchestrator calls the Transaction service to transfer the funds but crashes or loses connection before getting a successful response for the occurred transaction. It could lead to a frustrating user experience, with the risk of the customer being charged twice for the same product. To prevent such failure scenarios, we can adopt one of the following solutions: Service ReplicationReplicate the orchestration service across multiple nodes. The service can automatically fail over to the backup node when needed. With a load balancer that can detect and switch to the available node, the replication guarantees seamless service and prevents disruptions to the user. Data ReplicationNot only should we replicate the service, but we should also replicate the data to ensure data consistency. It enables the backup node to take over seamlessly without any data loss. Request QueuesImplementing queues like a buffer for requests when the orchestration service is down. The queue can hold incoming requests until the service is available again. Once the backup node is up and running, it can retrieve the data from the queue buffer and process them in the correct order. Why Use Orchestration Pattern The pattern comes with the following advantages: Orchestration makes it easier to understand, monitor and observe the application, resulting in a better understanding of the core part of the system with less effort. The pattern promotes loose coupling. Each downstream service exposes an API interface and is self-contained, without any need to know about the other services. The pattern simplifies the business workflows and improves the separation of concerns. Each service participates in a long-running transaction without any need to know about it. The orchestrator service can decide what to do in case of failure making the system fault-tolerant and reliable. Important Considerations The primary goal of this architectural pattern is to decompose the entire business workflow into multiple services, making it more flexible and scalable. And due to this, it’s crucial to analyse and comprehend the business processes in detail before implementation. A poorly defined and overly complicated business process will lead to a system that would be hard to maintain and scale. Secondly, it’s easy to fall into the trap of adding business logic into the orchestration service. Sometimes it’s inevitable because certain functionalities are too small to create their separate service. But the risk here is that if the orchestration service becomes too intelligent and performs too much business logic, it can evolve into a monolithic application that also happens to talk to microservices. So, it’s crucial to keep track of every addition to the orchestration service and ensure that its work remains within the boundaries of orchestration. Maintaining the scope of the orchestration service will prevent it from becoming a burden on the system, leading to decreased scalability and flexibility. Summary Numerous organizations are adopting microservice patterns to handle their complex distributed systems. The orchestration pattern plays a vital role in designing and managing these systems. By centralizing control and coordination, the orchestration pattern enhances agility, scalability, and resilience, making it an essential tool for organizations looking to modernize their infrastructure.
Lately, I've come across a lot of discussions and articles about Spring's feature called Profiles that are promoting them as a way to separate environment-specific configurations, which I consider a bad practice. Common Examples The typical way profiles are presented is by having multiple configuration files within the resources folder that will be bundled within the application artifact with application-prod.yml like: YAML some-resource.address: prod-address some-resource.username: prod-user some-resource.password: prod-password Issues I hope one can immediately see some of the issues: Applications' production credentials are committed and available for everyone with access to the repository, which is a very serious security issue. Changing configuration value on a given environment would require recompiling and the creation of a new artifact. The introduction of a new environment would require recompiling and the creation of a new artifact. Recompilation and release of a new application version without really changing any application logic feels stupid. Solution How can this issue be solved? Well, config values have to be put outside of the application's artifact and VCS repository as recommended by Twelve-Factor App. There are at least two ways I have experience: Having a config file beside application.jar (or specifying spring.config.additional-location) on a given environment overrides only specific keys. Use environment variables. In the latter case, config keys are bound with environment variables e.g. some-resource.username <=>SOMERESOURCE_USERNAME. If a custom key name is needed, an "alias" can be made as: YAML some-resource.username: ${OTHER_ENV_KEY} In either case, what's the need for config files per environment? All that is needed is a single application.yml file with both internal and external properties required by the application. These properties can have either empty or default/local values. YAML some-resource.address: some-resource.username: username some-resource.password: ${OTHER_ENV_KEY:123456} When to Use Profiles So far, I have never had a need for profiles per environment. However, there is one "special" profile that I would not consider an environment configuration file, and that is test the profile located in src/test/resources/. This profile and its corresponding configuration file allow overriding only present keys. Having application.yml file in the given folder would require providing all config properties defined in the main file (if that's not an issue, go for it). To activate this profile, use @ActiveProfiles annotation on test classes. The only other usage of profiles I can think of is some optional feature config grouping. One advantage of the profile's mechanism in configuration files is its ability to merge/override config properties (for details, check Piotr's TechBlog and his Github playground project). If we have some optional feature, which requires a separate set of config attributes that we want to be added only if this feature is active, we could have application-{feature-name}.yml, which can be activated in the main config file via spring.profiles.active/include properties. For simple feature flagging, I would use @ConditionalOnProperty annotation. Conclusion To conclude, don't store environment-specific configuration inside the application and instead use externalized configuration managed/injected on a given environment. Note: Please feel free to share your opinion and experience on the usage of profiles for environment-specific configuration files or in general. I might be missing something or made a mistake, and I would like to broaden my knowledge.
Git is one of the most popular version control systems used by developers worldwide. As a software developer, you must be well-versed in Git and its commands to manage code efficiently, collaborate with other team members, and keep track of changes. While there are many Git commands available, not all are equally important. In this article, I’ll cover the top Git commands that every senior-level developer should know. “A Git pull a day keeps the conflicts away” Git Init The git init command initializes a new Git repository. This command is used to start tracking changes in your project. As soon as you run this command, Git creates a new .git directory, which contains all the necessary files to start using Git for version control. Once initialized, Git tracks all changes made to your code and creates a history of your commits. Shell $ git init Initialized empty Git repository in /path/to/repository/.git/ Git Clone The git clone command creates a copy of a remote Git repository on your local machine. This is a great way to start working on a new project or to collaborate with others on existing projects. When you run this command, Git downloads the entire repository, including all branches and history, to your local machine. Shell $ git clone https://github.com/username/repository.git Git Add The git add command adds new or modified files to the staging area, which prepares them to be committed. The staging area is a temporary storage area where you can prepare your changes before committing them. You can specify individual files or directories with this command. Git tracks changes in three stages — modified, staged, and committed. The add command moves changes from the modified stage to the staged stage. To add all changes in the current directory to the staging area, run: Shell $ git add file.txt Shell git add . Git Commit The git commit command creates a new commit with the changes you've added to the staging area. A commit is a snapshot of your repository at a specific point in time, and each commit has a unique identifier. The git commit command records changes to the repository. A commit includes a commit message that describes the changes made. To commit changes to the repository, run the following command: Shell $ git commit -m "Added new feature" Git Status The git status command shows you the current state of your repository, including any changes that have been made and which files are currently staged for commit. It tells you which files are modified, which files are staged, and which files are untracked. Shell $ git status On branch master Changes to be committed: (use "git reset HEAD <file>..." to unstage) modified: file.txt Git Log The git log command shows you the history of all the commits that have been made to your repository. You can use this command to see who made changes to the repository and when those changes were made along with their author, date, and commit message. Shell $ git log commit 5d5b5e5dce7d1e09da978c8706fb3566796e2f22 Author: John Doe <john.doe@example.com> Date: Tue Mar 23 14:39:51 2023 -0400 Added new feature git log --graph: Displays the commit history in a graph format. git log --oneline: Shows the commit history in a condensed format. git log --follow: Follows the history of a file beyond renames. Git Diff The git diff command shows you the differences between the current version of a file and the previous version. This command is useful when you want to see what changes were made to a file. When you run this command, Git shows you the changes that were made between two commits or between a commit and your current working directory. This will show you the differences between the current branch and the “feature_branch” branch. Shell $ git diff feature_branch Git Branch The git branch command shows you a list of all the branches in the Git repository. You can use this command to see which branch you are currently on and to create new branches. This command is used to create, list, or delete branches. Shell $ git branch feature-1 Git Checkout The git checkout command is used to switch between branches. You can use this command to switch to a different branch or to create a new branch. Shell $ git checkout feature-1 Git Merge The git merge command is used to merge changes from one branch into another. This command is useful when you want to combine changes from different branches. When you run this command, Git combines the changes from two branches and creates a new commit. Shell $ git merge feature-1 Git Pull The git pull command is used to update your local repository with changes from a remote repository. This command is used to download changes from a remote repository and merge them into your current branch. When you run this command, Git combines the changes from the remote repository with your local changes and creates a new commit. Shell $ git pull origin master Git Push The git push command is used to push your changes to a remote repository. This command is useful when you want to share your changes with others. When you run this command, Git sends your commits to the remote repository and updates the remote branch. Shell $ git push origin master Git Remote This command is used to manage the remote repositories that your Git repository is connected to. It allows you to add, rename, or remove remote repositories. git remote rm: Removes a remote repository. git remote show: Shows information about a specific remote repository. git remote rename: Renames a remote repository. Git Fetch This command is used to download changes from a remote repository to your local repository. When you run this command, Git updates your local repository with the latest changes from the remote repository but does not merge them into your current branch. Shell $ git fetch origin Git Reset This command is used to unstaged changes in the staging area or undo commits. When you run this command, Git removes the changes from the staging area or rewinds the repository to a previous commit. Shell $ git reset file.txt Git Stash This command is used to temporarily save changes that are not yet ready to be committed. When you run this command, Git saves your changes in a temporary storage area and restores the repository to its previous state. Shell $ git stash save "Work in progress" “Why did the Git user become a magician? Because they liked to git stash and make their code disappear” Git Cherry-Pick This command is used to apply a specific commit to a different branch. When you run this command, Git copies the changes from the specified commit and applies them to your current branch. Shell $ git cherry-pick 5d5b5e5dce7d1e09da978c8706fb3566796e2f22 Git Rebase This command is used to combine changes from two branches into a single branch. When you run this command, Git replays the changes from one branch onto another branch and creates a new commit. Shell $ git rebase feature-1 Git Tag This command is used to create a tag for a specific commit. A tag is a label that marks a specific point in your Git history. Shell $ git tag v1.0.0 Git Blame This command is used to view the commit history for a specific file or line of code. When you run this command, Git shows you the author and date for each line of code in the file. Shell $ git blame file.txt “Git: making it easier to blame others since 2005” Git Show This command is used to view the changes made in a specific commit. When you run this command, Git shows you the files that were changed and the differences between the old and new versions. Shell $ git show 5d5b5e5dce7d1e09da978c8706fb3566796e2f22 Git Bisect This command is used to find the commit that introduced a bug in your code. When you run this command, Git helps you narrow down the range of commits to search through to find the culprit. Shell $ git bisect start $ git bisect bad HEAD $ git bisect good 5d5b5e5dce7d1e09da978c8706fb3566796e2f22 Git Submodule This command is used to manage submodules in your Git repository. A submodule is a separate Git repository included as a subdirectory of your main Git repository. Shell $ git submodule add https://github.com/example/submodule.git Git Archive This command is used to create a tar or zip archive of a Git repository. When you run this command, Git creates an archive of the repository that you can save or send to others. Shell $ git archive master --format=zip --output=archive.zip Git Clean This command is used to remove untracked files from your working directory. When you run this command, Git removes all files and directories that are not tracked by Git. Shell $ git clean -f Git Reflog This command is used to view the history of all Git references in your repository, including branches, tags, and HEAD. When you run this command, Git shows you a list of all the actions that have been performed in your repository. Git Config This command is used in Git to configure various aspects of the Git system. It is used to set or get configuration variables that control various Git behaviors. Here are some examples of how to use git config: Set user information: Shell git config --global user.name "Your Name" git config --global user.email "your.email@example.com" The above commands will set your name and email address globally so that they will be used in all your Git commits. Show configuration variables: Shell git config --list The above command will display all the configuration variables and their values that are currently set for the Git system. Set the default branch name: Shell git config --global init.defaultBranch main The above command will set the default branch name to “main”. This is the branch name that Git will use when you create a new repository. Git Grep This command in Git searches the contents of a Git repository for a specific text string or regular expression. It works similarly to the Unix grep command but is optimized for searching through Git repositories. Here's an example of how to use git grep: Let’s say you want to search for the word “example” in all the files in your Git repository. You can use the following command: Shell git grep example This will search for the word “example” in all the files in the current directory and its subdirectories. If the word “example” is found, git grep will print the name of the file, the line number, and the line containing the matched text. Here's an example output: Shell README.md:5:This is an example file. index.html:10:<h1>Example Website</h1> In this example, the word “example” was found in two files: README.md and index.html. The first line of the output shows the file name, followed by the line number and the line containing the matched text. You can also use regular expressions with git grep. For example, if you want to search for all instances of the word "example" that occur at the beginning of a line, you can use the following command: Shell git grep '^example' This will search for all instances of the word “example” that occur at the beginning of a line in all files in the repository. Note that the regular expression ^ represents the beginning of a line. Git Revert This command is used to undo a previous commit. Unlike git reset, which removes the commit from the repository, git revert creates a new commit that undoes the changes made by the previous commit. Here's an example of how to use git revert. Let’s say you have a Git repository with three commits: Plain Text commit 1: Add new feature commit 2: Update documentation commit 3: Fix bug introduced in commit 1 You realize that the new feature you added in commit 1 is causing issues and you want to undo that commit. You can use the following command to revert commit 1: Shell git revert <commit-1> This will create a new commit that undoes the changes made by commit 1. If you run git log after running git revert, you'll see that the repository now has four commits: Plain Text commit 1: Add new feature commit 2: Update documentation commit 3: Fix bug introduced in commit 1 commit 4: Revert "Add new feature" The fourth commit is the one created by git revert. It contains the changes necessary to undo the changes made by commit 1. Git RM This command is used to remove files from a Git repository. It can be used to delete files that were added to the repository, as well as to remove files that were previously tracked by Git. Here's an example of how to use git rm- Remove a file that was added to the repository but not yet committed: Shell git rm filename.txt This will remove filename.txt from the repository and stage the deletion for the next commit. Remove a file that was previously committed: Shell git rm filename.txt git commit -m "Remove filename.txt" The first command will remove filename.txt from the repository and stage the deletion for the next commit. The second command will commit the deletion. Remove a file from the repository but keep it in the working directory: Shell git rm --cached filename.txt This will remove filename.txt from the repository but keep it in the working directory. The file will no longer be tracked by Git, but it will still exist on your local machine. Remove a directory and its contents from the repository: Shell git rm -r directoryname This will remove directoryname and its contents from the repository and stage the deletion for the next commit. In conclusion, these frequently used Git commands are essential for every software professional who works with Git repositories regularly. Knowing how to use these commands effectively can help you streamline your workflow, collaborate with your team more effectively, and troubleshoot issues that may arise in your Git repository. In conclusion, these 25 Git commands are essential for every software engineer who work with Git repositories regularly. Knowing how to use these commands effectively can help you streamline your workflow, collaborate with your team more effectively, and troubleshoot issues that may arise in your Git repository. If you enjoyed this story, please share it to help others find it! Feel free to leave a comment below. Thanks for your interest. Connect with me on LinkedIn.
Cloud-native technologies like Kubernetes enable companies to build software quickly and scale effortlessly. However, debugging these Kubernetes-based applications can be quite challenging due to the added complexity of building service-oriented architectures (microservices) and operating the underlying Kubernetes infrastructure. Bugs are inevitable and typically occur as a result of an error or oversight made during the software development process. So, in order for a business to keep pace with app delivery and keep their end users happy, developers need an efficient and effective way to debug. This involves finding, analyzing, and fixing these bugs. This article highlights five Kubernetes debugging challenges and how to tackle them. #1. Slow Dev Loop Due To Building and Re-Deploying Containers When a development team adopts a cloud-native technology like Kubernetes, their developer experience is significantly altered as they’ll now be expected to carry out extra steps in the inner dev loop. Instead of coding and seeing the result of their code changes immediately, as they used to when working with monolithic environments, they now have to manage external dependencies, build containers, and implement orchestration configuration (e.g., Kubernetes YAML) before they can see the impact of their code changes. There are several ways to tackle this Kubernetes debugging challenge: The first one is for you to develop services locally and focus on unit tests over end-to-end tests but this was painful when a service/web application has authentication requirements and dependencies on databases. Another way to solve this is to use a tool called DevSpace which will automate your build and deployment steps, thereby making it faster. And finally, you can also utilize a CNCF tool called Telepresence to connect your local development environment to a remote Kubernetes cluster, thereby making it possible to access these external dependencies in the remote Kubernetes cluster and test them against the service being developed locally for an instant feedback loop. #2. Lack of Visibility in the End-to-End Flow of a Distributed Application Another debugging challenge when working with Kubernetes is having full visibility of the end-to-end flow of your application because there are often just too many services. And without full visibility, it’s difficult to identify and fix a bug. Ideally, you should be able to get cross-service visibility into what is calling what, what is timing out, etc. To tackle this, you need to utilize tools that make observability and tracing more seamless. For example, tools OpenTelemetry, Jaeger, and Grafana Tempo can help you get the necessary information to reproduce errors. The goal here is to get as much information as possible, and when you do, you’d be able to fix bugs in real-time and ultimately improve the overall performance of your application. #3. Inability To Attach a Debugger to the Code One of the most important things a developer needs is the ability to attach a debugger to their code, and working with Kubernetes doesn’t make this easy. Yes, things like print/log statements work, but they are nowhere near as good as being able to put a debugger on something and step through the code, especially if it’s a new code base that a user isn’t familiar with. Two possible ways to tackle this Kubernetes debugging issue are to: Develop locally and find ways to mock or spin up local instances of dependencies. Ensure code is unit testable and focus on those because they are easier to write tests for and easy to throw a debugger on. #4. Complicated Setup for Performing Integration Testing With a Local Change Cloud-native applications are often composed of various microservices. More often than not, these microservices work interdependently and communicate with each other to process larger business requests. As an example, a timeline service for a social media application may need to talk to a user profile service to determine a user's followers and, at the same time, may need to talk to an authentication service to determine the authentication state of a user. Because of this multi-directional, service-to-service communication that happens between microservices, it is crucial to perform integration testing on microservices before deploying any changes because unit testing alone doesn't always provide guarantees about the behavior of the application in the target environment. Performing integration testing in this context naturally involves running multiple services and connecting to (potentially remote) middleware and data stores. This requires techniques and tooling that present multiple challenges. These challenges include having limited resources and inconsistent data between production and non-production environments; managing distinct configurations for separate environments; and difficulties associated with managing service versioning, releases, and deployment cycles. #5. Reproducing an Issue That Only Happens in Prod/Staging Sometimes, it can be very complex to reproduce a bug that happened in production or staging locally. At this point, your mocks or existing values will not be sufficient. You’d think to yourself, how can I actually reproduce this issue? How can I get to the root of the problem faster? Well, an open-source tool called Telepresence is usually my go-to when facing the K8s debugging challenge — The tool allows you to access remote dependencies as if they were running locally and reroute traffic from remote to local services. This means you’d get to debug them in real-time, reproduce these issues, and push a fix to your preferred version control and CI/CD pipeline faster. Conclusion Most organizations insist that any important delivery of software goes through multiple iterations of testing, but it’s important to remember that bugs are inevitable. Having the ability to debug applications effectively is one of the best techniques for identifying, understanding, and fixing bugs. Container technology, such as Kubernetes, provides many benefits for software developers but also introduces app debugging challenges. Fortunately, there are multiple ways to address these challenges easily. If there are other Kubernetes debugging techniques that you’d like to share, please mention them in the comment section.
Steel Threads are a powerful but obscure software design approach. Learning about Steel Threads will make you a better engineer. You can use them to avoid common problems like integration pain, and you can use them to cut through the complexity of system design. So Obscure It Was Deleted From Wikipedia in 2013 How unknown are Steel Threads? The concept was deleted from Wikipedia in 2013 because “the idea is not notable within Software Engineering, and hasn’t received significant coverage from notable sources.” Let’s add to the coverage, and also talk through why it is such a useful approach. What Are Steel Threads? A Steel Thread is a very thin slice of functionality that threads through a software system. They are called “threads” because they weave through the various parts of the software system and implement an important use case. They are called “steel” because the thread becomes a solid foundation for later improvements. With a Steel Thread approach, you build the thinnest possible version that crosses the boundaries of the system and covers an important use case. Example of Conventional, Problematic Approach Let’s say you’re building a new service to replace a part of your monolithic codebase. The most common way to do this would be to: Look at the old code, and figure out the needs of the new system. Design and build out the APIs that provide the capabilities you need. Go into the old code, and update references to use the new APIs. Do it behind a feature flag. Cut over using the feature flag. Fix any issues that come up until it’s working, turning off the feature flag if necessary to go back to the old code path. When it’s stable, remove the old code paths. Sounds reasonable, right? Well, this is the most common way software engineers operate, but this approach has a lot of landmines. What problems would I expect in this project? It may be appealing to build the new service in a way disconnected from the old system. After all, the design might feel purer. But you’re also introducing significantly more structural change and you’re making these changes without any integration into the old system. This increases integration pain significantly. My expectation would be that all the estimates for the project are unrealistic. And I’d expect the project to be considered a failure after it is completed, even if the resulting service has a generally good design. I would expect the switchover to the new system to be problematic. There will be a series of problems uncovered as you switch over, that will require switching back to the old code paths or working intensely to fix problems in the final stages of the project. Both of these things are avoidable, by not having a huge cutover. Note that even cutting over one percent of traffic to the new service with a feature flag is a cutover approach. Why? You’re cutting over all that one percent of traffic to all the changes at the same time. I still would not expect it to go well. You are taking steps that are too large. Example Using a Steel Thread Contrast that approach with the Steel Thread way of doing it. Think about the new system you’re building. Come up with some narrow use cases that represent Steel Threads of the system – they cover useful functionality into the system, but don’t handle all use cases, or are constrained in some ways. Choose a starting use case that is as narrow as possible, that provides some value. For example, you might choose one API that you think would be part of the new service. Build out the new API in a new service. Make it work for just that narrow use case. For any other use case, use the old code path. Get it out to production, and into full use. (Tip: you could even do both the new AND old code path, and compare!) Then you gradually add the additional use cases, until you’ve moved all of the functionality you need to, to the new service. Each use case is in production. Once you’re done, you rip out the old code and feature flags. This isn’t risky at all, since you’re already running on the new system. Steel Threads Avoid Integration Pain, and Give You Higher Confidence Integration pain is one of the bigger causes of last-minute problems in projects. When you cut over to a new system, you always find problems you don’t expect. You should be suspicious of anything that involves a cut-over. Do things in small increments. Steel Threads integrate from the beginning, so you never have a lot of integration pain to wade through. Instead, you have small integration pain, all along the way. Also, your service never needs to be tested before it goes live, because you’ve tested it incrementally, along the way. You know it can handle production loads. You’ve already added network latency, so you know the implications of that. All the surprises are moved forward, and handled incrementally, as just part of the way you gradually roll out the service. The important thing is that you have a working, integrated system, and as you work on it, you keep it working. And you flesh it out over time. Steel Threads Can Help Cut Through Complexity When you’re designing a system, you have a LOT of complexity. Building a set of requirements for the new system can be a challenging endeavor. When using a Steel Thread approach, you choose some of the core requirements and phrase them in a way that cuts through the layers of the system, and exercises your design. It provides a sort of skeletal structure for the whole system. The implementation of that Steel Thread then becomes the bones upon which further requirements can be built. Thus, Steel Threads are a subset of the requirements of a system. For example, let’s say you’re implementing a clone of Slack. Your initial Steel Thread might be something like: “Any unauthenticated person can post a message in a hardcoded #general room in a hardcoded account. Messages persist through page refreshes.” Note how limited this initial Steel Thread is. It doesn’t handle authentication, users, or accounts. It does handle writing messages, and persisting them. Your second Steel Thread can move the system towards being more useful. You could, for example, have a Steel Thread that allows the message poster to choose the name they post under. This second Steel Thread hasn’t actually done much. You still don’t have authentication, accounts, or even a concept of a user. But you have made a chat room that works enough that you can start using it. Steel Threads Provide Early Feedback Note that in this Slack clone example, you can get early feedback on the system you’re building, even though you haven’t built that much yet. This is another powerful reason for using Steel Threads. After just those two Steel Threads, your team could start using the chat room full-time. Think about how much your team will learn from using your system. It’s a working system. Compare that to what you would have learned building out the User and Account systems, hooking everything up, and finally building out a chat room. Start With Steel Threads Steel Threads are often a good place to start when designing your projects. They create a skeleton for the rest of the work to come. They nail down the core parts of the system so that there are natural places to flesh out. I encourage you to try a Steel Threaded approach. I think you’ll find it can transform your projects. Let me know your experiences with it! Steel Threads Are Closely Related To Vertical Slices You may have heard of the term “vertical slicing." I describe the concept in my post on Milestones. Steel Threads are a software design technique that results in delivering your software in vertical slices. The term tends to be used to describe the initial vertical slices of a system. They’re closely related concepts, but not completely the same. I’ve also heard of Steel Threads being referred to as “tracer bullets."
Three Hard Facts First, the complexity of your software systems is through the roof, and you have more external dependencies than ever before. 51% of IT professionals surveyed by SolarWinds in 2021 selected IT complexity as the top issue facing their organization. Second, you must deliver faster than the competition, which is increasingly difficult as more open-source and reusable tools let small teams move extremely fast. Of the 950 IT professionals surveyed by RedHat, only 1% indicated that open-source software was “not at all important.” And third, reliability is slowing you down. The Reliability/Speed Tradeoff In the olden days of software, we could just test the software before a release to ensure it was good. We ran unit tests, made sure the QA team took a look, and then we’d carefully push a software update during a planned maintenance window, test it again, and hopefully get back to enjoying our weekend. By 2023 standards, this is a lazy pace! We expect teams to constantly push new updates (even on Fridays) with minimal dedicated manual testing. They must keep up with security patches, release the latest features, and ensure that bug fixes flow to production. The challenge is that pushing software faster increases the risk of something going wrong. If you took the old software delivery approach and sped it up, you’d undoubtedly have always broken releases. To solve this, modern tooling and cloud-native infrastructure make delivering software more reliable and safer, all while reducing the manual toil of releases. According to the 2021 State of DevOps report, more than 74% of organizations surveyed have Change Failure Rate (CFR) greater than 16%. For organizations seeking to speed up software changes (see DORA metrics), many of these updates caused issues requiring additional remediation like a hotfix or rollback. If your team hasn’t invested in improving the reliability of software delivery tooling, you won’t be able to achieve reliable releases at speed. In today’s world, all your infrastructure, including dev/test infrastructure, is part of the production environment. To go fast, you also have to go safely. More minor incremental changes, automated release and rollback procedures, high-quality metrics, and clearly defined reliability goals make fast and reliable software releases possible. Defining Reliability With clearly defined goals, you will know if your system is reliable enough to meet expectations. What does it mean to be up or down? You have hundreds of thousands of services deployed in clouds worldwide in constant flux. The developers no longer coordinate releases and push software. Dependencies break for unexpected reasons. Security fixes force teams to rush updates to production to avoid costly data breaches and cybersecurity threats. You need a structured, interpreted language to encode your expectations and limits of your systems and automated corrective actions. Today, definitions are in code. Anything less is undefined. The alternative is manual intervention, which will slow you down. You can’t work on delivering new features if you’re constantly trying to figure out what’s broken and fix releases that have already gone out the door. The most precious resource in your organization is attention, and the only way to create more is to reduce distractions. Speeding Up Reliably Service level objectives (SLOs) are reliability targets that are precisely defined. SLOs include a pointer to a data source, usually a query against a monitoring or observability system. They also have a defined threshold and targets that clearly define pass or fail at any given time. SLOs include a time window (either rolling or calendar aligned) to count errors against a budget. OpenSLO is the modern de facto standard for declaring your reliability targets. Once you have SLOs to describe your reliability targets across services, something changes. While SLOs don’t improve reliability directly, they shine a light on the disconnect between expectations and reality. There is a lot of power in simply clarifying and publishing your goals. What was once a rough shared understanding becomes explicitly defined. We can debate the SLO and decide to raise, lower, redefine, split, combine, and modify it with a paper trail in the commit history. We can learn from failures as well as successes. Whatever other investments you’re making, SLOs help you measure and improve your service. Reliability is engineered; you can’t engineer a system without understanding its requirements and limitations. SLOs-as-code defines consistent reliability across teams, companies, implementations, clouds, languages, etc.
What Is Distributed Tracing? The rise of microservices has enabled users to create distributed applications that consist of modular services rather than a single functional unit. This modularity makes testing and deployment easier while preventing a single point of failure with the application. While applications begin to scale and distribute their resources amongst multiple cloud-native services, tracing a single transaction becomes tedious and nearly impossible. Hence, developers need to apply distributed tracing techniques. Distributed tracing allows a single transaction to be tracked across the front end to the backend services while providing visibility into the systems’ behavior. How Distributed Tracing Works The distributed tracing process operates on a fundamental concept of being able to trace every transaction through multiple distributed components of the application. To achieve this visibility, distributed tracing technology uses unique identifiers, namely the Trace ID, to tag each transaction. The system then puts together each trace from the various components of the application by using this unique identifier, thus building a timeline of the transaction. Each trace consists of one or more spans that represent a single operation within a single trace. It is essential to understand that a span can be referred to as a parent span for another span, indicating that the parent span triggers the child span. Implementing Distributed Tracing Setting up a distributed tracing depends on the selected solution. However, every solution will consist of these common steps. These three steps ensure developers have a solid base to start their distributed tracing journey: Setting up a distributed tracing system. Instrumenting code for tracing. Collecting and storing trace data. 1. Setting Up a Distributed System Selecting the right distributed tracing solution is crucial. Key aspects, such as compatibility, scale, and other important factors must be addressed. Many distributed tracing tools support various programming languages, including Node.js, Python, Go, .NET, Java, etc. These tools allow developers to use a single solution for distributed tracing across multiple services. 2. Instrumenting Code for Tracing Depending on the solution, the method of integration may change. The most common approach many solutions provide is using an SDK that collects the data during runtime. For example, developers using Helios with Node.js require installing the latest Helios OpenTelemetry SDK by running the following command: npm install --save helios-opentelemetry-sdk Afterward, the solution requires defining the following environment variables. Finally, it enables the SDK to collect the necessary data from the service: export NODE_OPTIONS="--require helios-opentelemetry-sdk" export HS_TOKEN="{{HELIOS_API_TOKEN}" export HS_SERVICE_NAME="<Lambda01>" export HS_ENVIRONMENT="<ServiceEnvironment01>" 3. Collecting and Storing Trace Data In most distributed tracing systems, trace data collection occurs automatically during the runtime. Then, this data makes its way to the distributed tracing solution, where the analysis and visualization occur. The collection and storage of the trace data depend on the solution in use. For example, if the solution is SaaS-based, the solution provider will take care of all trace data collecting and storage aspects. However, if the tracing solution is self-hosted, the responsibility of taking care of these aspects falls on the administrators of the solution. Analyzing Trace Data Analyzing trace data can be tedious. However, visualizing the trace data makes it easier for developers to understand the actual transaction flow and identify anomalies or bottlenecks. The following demonstrates the flow of the transaction through the various services and components of the application. An advanced distributed tracing system may highlight errors and bottlenecks that each transaction runs through. Since the trace data contains the time it takes for each service to process the transaction, developers can analyze the latencies and identify abnormalities that may impact the application’s performance. Identifying an issue using the distributed tracing solution can provide insight into the problem that has taken place. However, to gain further details regarding the issue, developers may need to use additional tools that provide added insight with observability or the capability to correlate traces with the logs to identify the cause. Distributed tracing solutions, such as Helios, offer insight into the error’s details, which eases the developer’s burden. Best Practices for Distributed Tracing A comprehensive distributed tracing solution empowers developers to respond to crucial issues swiftly. The following best practices set the fundamentals for a successful distributed tracing solution. 1. Ensuring Trace Data Accuracy and Completeness Collecting trace data from services enable developers to identify the performance and latency of all the services each transaction flows through. However, when the trace data does not contain information from a specific service, it reduces the accuracy of the entire trace and its overall completeness. To ensure developers obtain the most out of distributed tracing, it is vital that the system collects accurate trace information from all services to reflect the original data. 2. Balancing Trace Overhead and Detail Collecting all trace information from all the services will provide the most comprehensive trace. However, collecting most trace information comes at the cost of the overhead to the overall application or the individual service. The tradeoff between the amount of data collected and the acceptable overhead is crucial. Planning for this tradeoff ensures distributed tracing does not harm the overall solution, thus outweighing the benefits the solution brings. Another take on balancing these aspects is filtering and sampling the trace information to collect what is required. However, this would require additional planning and a thorough understanding of the requirement to collect valuable trace information. 3. Protecting Sensitive Data in Trace Data Collecting trace information from transactions includes collecting payloads of the actual transaction. This information is usually considered sensitive since it may contain personally identifiable information of customers, such as driver’s license numbers or banking information. Regulations worldwide clearly define what information to store during business operations and how to handle this information. Therefore, it is of unparalleled importance that the information collected must undergo data obfuscation. Helios enables its users to easily obfuscate sensitive data from the payloads collected, thereby enabling compliance with regulations. In addition to obfuscation, Helios provides other techniques to enhance and filter out the data sent to the Helios platform. Distributed Tracing Tools Today, numerous distributed tracing tools are available for developers to easily leverage their capabilities in resolving issues quicker. 1. Lightstep Lightstep is a cloud-agnostic distributed tracing tool that provides full-context distributed tracing across multi-cloud environments or microservices. It enables developers to integrate the solution with complex systems with little extra effort. It also provides a free plan with the features required for developers to get started on their distributed tracing journey. In addition, the free plan offers many helpful features, including data ingestion, analysis, and monitoring. Source: LightStep UI 2. Zipkin Zipkin is an open-source solution that provides distributed tracing with easy-to-use steps to get started. It enhances its distributed tracing efforts by enabling the integration with Elasticsearch for efficient log searching. Source: Zipkin UI It was developed at Twitter to gather crucial timing data needed to troubleshoot latency issues in service architectures, and it is straightforward to set up with a simple Docker command: docker run -d -p 9411:9411 openzipkin/zipkin 3. Jaeger Tracing Jaeger Tracing is yet another open-source solution that provides end-to-end distributed tracing and the ability to perform root cause analysis to identify performance issues or bottlenecks across each trace. It also supports Elasticsearch for data persistence and exposes Prometheus metrics by default to help developers derive meaningful insights. In addition, it allows filtering traces based on duration, service, and tags using the pre-built Jaeger UI. Source: Jaeger Tracing 4. SigNoz SigNoz is an open-source tool that enables developers to perform distributed tracing across microservices-based systems while capturing logs, traces, and metrics and later visualizing them within its unified UI. It also provides insightful performance metrics such as the p50, p95, and p99 latency. Some key benefits of using SigNoz include the consolidated UI that showcases logs, metrics, and traces while supporting OpenTelemetry. Source: SigNoz UI 5. New Relic New Relic is a distributed tracing solution that can observe 100% of an application’s traces. It provides compatibility with a vast technology stack and support for industry-standard frameworks such as OpenTelemetry. It also supports alerts to diagnose errors before they become major issues. New Relic has the advantage of being a fully managed cloud-native with support for on-demand scalability. In addition, developers can use a single agent to automatically instrument the entire application code. Source: New Relic UI 6. Datadog Datadog is a well-recognized solution that offers cloud monitoring as a service. It provides distributed tracing capabilities with Datadog APM, including additional features to correlate distributed tracing, browser sessions, logs, profiles, network, processes, and infrastructure metrics. In addition, Datadog APM allows developers to easily integrate the solution with the application. Developers can also use the solution’s capabilities to seamlessly instrument application code to monitor cloud infrastructure. Source: DataDog UI 7. Splunk Splunk offers a distributed tracing tool capable of ingesting all application data while enabling an AI-driven service to identify error-prone microservices. It also adds the advantage of correlating between application and infrastructure metrics to better understand the fault at hand. You can start with a free tier that brings in essential features. However, it is crucial to understand that this solution will store data in the cloud; this may cause compliance issues in some industries. Source: Splunk UI 8. Honeycomb Honeycomb brings in distributed tracing capabilities in addition to its native observability functionalities. One of its standout features is that it uses anomaly detection to pinpoint which spans are tied to bad user experiences. It supports OpenTelemetry to enable developers to instrument code without being stuck to a single vendor while offering a pay-as-you-go pricing model to only pay for what you use. Source: HoneyComb UI 9. Helios Helios brings advanced distributed tracing techniques that enhance the developer’s ability to get actionable insight into the end-to-end application flow by adapting OpenTelemetry’s context propagation framework. The solution provides visibility into your system across microservices, serverless functions, databases, and third-party APIs, thus enabling you to quickly identify, reproduce, and resolve issues. Source: Helios Sandbox Furthermore, Helios provides a free trace visualization tool based on OpenTelemetry that allows developers to visualize and analyze a trace file by simply uploading it. Conclusion Distributed tracing has seen many iterations and feature enhancements that allow developers to easily identify issues within the application. It reduces the time taken to detect and respond to performance issues and helps understand the relationships between individual microservices. The future of distributed tracing would incorporate multi-cloud tracing, enabling developers to troubleshoot issues across various cloud platforms. Also, these platforms consolidate the trace, thus cutting off the requirement for developers to trace these transactions across each cloud platform manually, which is time-consuming and nearly impossible to achieve. I hope you have found this helpful. Thank you for reading!
John Vester
Lead Software Engineer,
Marqeta @JohnJVester
Marija Naumovska
Product Manager,
Microtica
Vishnu Vasudevan
Head of Product Engineering & Management,
Opsera
Seun Matt
Engineering Manager,
Cellulant