Similarweb Engineering - Medium

Resilience in Code: Lessons Learned After 97 Days at War

Dor Amram — Tue, 23 Jan 2024 11:09:09 GMT

Introduction

Three months ago, life as I knew it was upended. The onset of war in my country led to a swift and unexpected transition for many of us, including myself, from my role as software engineer to fulfilling civic duty in the reserves. This abrupt change left a profound impact, not just on me but on several of my colleagues who found themselves in similar situations, with some still serving. This blog post is a reflection on this shared experience, focusing on the unique challenges and learnings that come with such sudden transitions in my professional life.

Understanding the Process

The experience of leaving a familiar work environment for an uncertain future brought about a profound realization of the challenges faced in such transitions. The first lesson was clear: the importance of robust, adaptable processes in a tech team. When a key member steps away suddenly, the team’s ability to maintain momentum hinges on these established practices. In my case, and for many of my colleagues, the disruption highlighted the crucial role of clear communication channels and flexible workflows, which had been set up but not fully appreciated until tested by these circumstances.

The second lesson was about the resilience of our team dynamics. The absence of one or more team members can strain a team, yet it can also reveal the strength of the collective. It’s in these moments that the team’s adaptability and cohesiveness are truly tested. For us, it meant redistributing responsibilities, relying more on remote collaboration tools, and finding new ways to support each other, both professionally and emotionally.

The Essence of Good Coding Practices During Absences

In the whirlwind of our sudden departures, the underlying principles of good coding practices stood out as a beacon. These practices, often summed up as ‘clean code’, go beyond mere neatness. They encompass writing code that is not just functional but also clear, well-organized, and easily maintainable. In our absence, this approach to coding proved to be a critical asset.

The true power of these practices became apparent when we returned and seamlessly re-engaged with our projects. The code we revisited wasn’t a puzzle to be solved; it was a clear map, guiding us through the logic and decisions made in our absence. This clarity in code meant less time deciphering and more time contributing effectively, facilitating a smoother transition back into the team.

Moreover, these coding principles fostered a sense of ongoing collaboration. Despite not being physically present, the well-structured code acted as a continuous thread of communication. It bridged the gaps left by our absence, ensuring that projects didn’t just survive but thrived. This experience underscored the fact that good coding practices are more than technical necessities; they are vital for sustaining team momentum and adaptability in times of change

Personal Insights on Team Communication

During my time away and upon my return, the critical role of efficient communication within the team became increasingly apparent. It wasn’t just about staying in touch; it was about ensuring that the communication was effective and facilitated a smooth transition.

The essence of good communication in our team was rooted in clarity and precision. Every interaction, whether it was a quick update or a detailed discussion, was approached with a focus on clear, concise, and meaningful exchanges. This approach reduced misunderstandings and streamlined our collaborative efforts, making it easier for me and others who were away to reintegrate seamlessly.

Additionally, the importance of working cleanly and methodically played a significant role in our communication efficiency. When code, documentation, and project plans are structured and well-organized, they communicate as effectively as a well-crafted email or meeting. This clarity in our work products meant that a significant portion of our communication was already embedded in the work itself, reducing the need for lengthy explanations and clarifications.

This combination of efficient communication and clean working practices created an environment where information flowed smoothly and collaboration was effortless. It highlighted that effective communication is not just about the frequency or methods used but is deeply interconnected with how we approach our work. The efforts made by my team in maintaining these standards significantly eased the challenges of my reintegration, making it a cohesive and productive experience.

This experience taught a crucial lesson: the symbiotic relationship between effective communication and orderly work practices is key to team resilience and adaptability, particularly during times of transition.

Adapting to Task Nature During Transition Periods

There was a brief period during my absence when I was able to return to the team for a few days. This short stint back in the office offered a unique insight into how task assignments adapt during such transition periods. Instead of diving back into the deep end with mission-critical tasks, I was assigned to handle smaller, infrastructure-related tasks. This strategic choice by our team leadership proved to be incredibly beneficial for several reasons.

Firstly, these tasks, while less critical, were vital for the smooth running of our systems. They included minor coding tasks, system updates, and small-scale project management. This focus allowed me to contribute meaningfully without the pressure of immediately tackling complex, high-stakes projects.

Secondly, handling these tasks provided me with the opportunity to reacquaint myself with the team’s current workflow and technological stack. Changes, however minor, had occurred during my absence, and these tasks acted as a gentle reintroduction to these new elements.

Lastly, this approach had a positive impact on team dynamics. It eased the burden of my readjustment on my colleagues, ensuring that their workflow wasn’t disrupted by my need to catch up. It also demonstrated the team’s understanding and support, acknowledging that a gradual reintegration was more effective than an immediate deep dive.

This experience underscored a valuable lesson: the importance of thoughtful task allocation during transitional periods. It not only aids in the smooth reintegration of returning team members but also maintains team stability and productivity. It’s a strategy that not only benefits the returning team member but also the entire team, fostering a more supportive and adaptable work environment.

Transitioning Back to Civilian Life

Returning to civilian life after serving in the reserves is a process, one that unfolds over time and requires patience and understanding. This transition is not just a change in routine; it’s a shift in mindset and environment. It’s essential to recognize and accept that it’s okay for this adjustment to take time.

My own journey back to the civil world was significantly eased by the unwavering support I received from my manager. Their understanding and empathy during this period were invaluable. They recognized the challenges of this transition and provided the necessary space and support for me to gradually readjust to the civilian work environment. This support was not just about easing back into work; it was about reorienting myself to a life that had momentarily taken a backseat.

The patience and guidance from my manager and colleagues reminded me that it’s not about rushing to ‘get back to normal’ but allowing oneself the time and space to adapt at a comfortable pace. Their support was a critical element in my smooth transition, reinforcing the idea that the journey back to civilian life is a gradual process, one that benefits greatly from a supportive and understanding work environment.

Conclusion

The journey of stepping away and then returning to a tech team, especially under extraordinary circumstances like a national crisis, is fraught with challenges. Yet, it also brings invaluable lessons and opportunities for growth. The experiences shared in this blog post — from adapting to changes in team dynamics to embracing efficient communication and orderly work practices — demonstrate that we are equipped with the tools and resilience needed to navigate whatever lies ahead.

As I reflect on the myriad of challenges and transitions we’ve faced, one thing becomes clear: the future is uncertain, but our preparation for it is not. While we cannot predict the future, our journey thus far has armed us with adaptability, empathy, and innovative thinking. These are the tools that will guide us through unknown territories. It’s with this mindset that we can approach the future, not with apprehension, but with the confidence that comes from knowing we are prepared to face and adapt to whatever challenges it may bring.

Resilience in Code: Lessons Learned After 97 Days at War was originally published in Similarweb Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Navigating Rough Waters: Shedding Technical Debt

Dor Amram — Mon, 04 Sep 2023 10:47:14 GMT

If you’ve been in the software engineering field for even a short period, you’ve likely encountered the beast we all know as “technical debt.” I’ve been there, too — staring at a screen filled with spaghetti code, wondering how we got here. Over the years, I’ve learned that technical debt isn’t just an annoying byproduct of development; it’s a reality that, if not managed well, can cripple even the most promising projects. In this blog post, I want to share my personal experiences and the strategies I’ve found effective for fighting technical debt. I’ll also talk about how I’ve been working on creating what I like to call a “technical immune system” to keep this debt in check.

The Reality of Technical Debt

Technical debt is like that credit card bill you keep saying you’ll pay off next month but never do. It accumulates interest, and before you know it, you’re stuck in a cycle of just paying the minimum amount, never really reducing the principal. In software terms, this means your team is spending more time fixing bugs and navigating through complex, outdated code than actually building new features.

Strategies for Fighting Tech Debt: A Deeper Dive

Regular Audits: The Health Check-Ups

Regular audits are akin to the health check-ups we all know we should be getting but often neglect. In the context of software engineering, these audits serve as a diagnostic tool to identify areas of the codebase that have accumulated technical debt. I’ve found that setting aside time for these audits at least once a quarter has been invaluable.

But the audit doesn’t stop at identification; it extends to action. Once the audit is complete, we categorize the issues based on their severity and impact. We then create actionable tickets and prioritize them in our development backlog. This ensures that the identified issues don’t just sit there but are actively addressed in subsequent sprints.

The key to making audits effective is consistency and follow-through. It’s easy to conduct an audit once and forget about it, but the real value comes from making it a recurring activity. This allows us to track our progress over time and ensures that we’re moving in the right direction in terms of code quality and maintainability.

Prioritize Refactoring: The Diet Plan

Refactoring is the “diet plan” of the software world. We all know it’s good for us, but it’s often the first thing to be sacrificed when deadlines loom. I’ve been guilty of this more times than I’d like to admit. However, I’ve come to realize that consistent, small-scale refactoring is far more manageable and effective than occasional, large-scale overhauls.

To make refactoring a priority, I’ve started allocating a fixed percentage of each quarter solely for these tasks. This ensures that refactoring becomes an integral part of our development cycle rather than a one-off activity that happens “when we have time.” The trick is to make it non-negotiable, just like you would with a diet plan.

The benefits of this approach are twofold. First, it helps in gradually reducing the existing technical debt. Second, it prevents the accumulation of new debt by ensuring that we continuously improve the codebase. It’s a proactive approach that pays dividends in the long run.

Automated Testing: The Exercise Regimen

Automated testing is the exercise regimen that keeps your codebase fit and healthy. I’ve found that a robust automated testing framework is an invaluable asset in the fight against technical debt. We use a combination of unit tests, integration tests, and end-to-end tests to cover as much ground as possible. These tests are run automatically as part of our CI/CD pipeline, ensuring that any new code or changes to existing code are thoroughly vetted before being deployed.

The beauty of automated testing is that it provides immediate feedback. If a piece of code doesn’t meet the expected standards or if it breaks existing functionality, we know right away. This allows us to catch issues early in the development cycle, making them easier and less costly to fix.

Moreover, a strong testing framework acts as a safety net, giving developers the confidence to refactor and make changes to the codebase. This is crucial for reducing technical debt, as it allows us to improve the code quality without the fear of breaking existing functionality.

Code Reviews: The Personal Trainer

Code reviews are the personal trainers of the software development world. They provide an external perspective, catch potential issues, and push you to do better. I’ve made it a rule in my team that no code gets merged into the main branch without undergoing a peer review. This practice serves multiple purposes.

First, it acts as a quality gate, ensuring that the code meets the team’s standards both in terms of functionality and readability. Second, it fosters a culture of collective code ownership. When multiple eyes scrutinize every line of code, it’s less likely that technical debt will slip through the cracks.

Lastly, code reviews are an excellent platform for knowledge sharing and mentorship. More experienced team members can provide insights and best practices, while less experienced members get the opportunity to learn and improve. It’s a win-win situation that not only helps in reducing technical debt but also contributes to the team’s overall growth and development.

Building a Technical Immune System

Monitoring: The Vital Signs

Monitoring is the heartbeat of a technical immune system. Imagine walking into a hospital room where the patient’s vital signs are continuously displayed on a monitor. The doctors and nurses can instantly see if something goes wrong. Similarly, in the realm of software engineering, monitoring tools act as our eyes and ears, continuously scanning the codebase for “vital signs” like code complexity, dependency vulnerabilities, and performance metrics.

Monitoring is not just about collecting data; it’s about making sense of it. We use tools like DataDog to visualize this data in real-time dashboards. We set up alerts that notify us via Slack or email if certain thresholds are crossed. For instance, if the build fails or if some data is missing, the team is immediately alerted.

The beauty of monitoring is that it allows for proactive rather than reactive measures. Instead of waiting for a system to fail or for a user to report an issue, we can identify potential problems before they escalate. This is akin to catching a disease in its early stages, making it much easier to treat. Monitoring is the cornerstone of our technical immune system, providing the data we need to make informed decisions.

Automated Workflows: The Auto-Healing Mechanism

Automated workflows are the auto-healing mechanisms of our technical immune system. Just like white blood cells in our body rush to the site of an infection to combat pathogens, automated workflows kick in when they detect issues in the codebase. We use Continuous Integration/Continuous Deployment (CI/CD) pipelines to automate a series of checks and balances. These pipelines run a battery of tests, perform code quality assessments, and even auto-refactor code where possible.

The power of automation lies in its consistency and speed. Manual processes are prone to human error and can be time-consuming. Automated workflows, on the other hand, execute the same set of tasks with machine-like precision, and they do it fast. This ensures that any new code or changes to existing code meet the predefined quality standards before they are deployed.

Moreover, these workflows are not set in stone; they evolve. As we identify new types of issues or adopt new technologies, we update our workflows to include checks for them. This adaptability makes our technical immune system resilient and up-to-date, capable of dealing with new “pathogens” as they emerge.

Knowledge Sharing: The Collective Immunity

Knowledge sharing is what I like to call the “collective immunity” of our technical ecosystem. Just as herd immunity protects a community from the spread of diseases, a shared knowledge base safeguards the team from repeating past mistakes and poor practices. We maintain an internal wiki that serves as a repository for all things technical — best practices, coding guidelines, lessons learned from past incidents, and even architectural decisions.

This knowledge base is a living document, continuously updated and enriched by contributions from team members. New hires are encouraged to go through this repository as part of their onboarding process. This not only brings them up to speed but also instills a culture of knowledge sharing right from the start.

The impact of this collective knowledge is exponential. It not only helps in reducing the introduction of new technical debt but also aids in faster problem-solving. When faced with a challenge, team members can refer to the knowledge base to see if a similar issue has been tackled before, saving time and effort.

Feedback Loops: The Body’s Response System

Feedback loops are akin to the body’s nervous system, sending signals from various parts to the brain for interpretation and action. In our technical environment, these loops are channels of continuous feedback from all stakeholders — developers, QA teams, product managers, and even end-users. We use tools like Jira and Slack to facilitate this communication, and we hold regular retrospectives to discuss what’s working and what’s not.

Feedback loops serve multiple purposes. First, they help us gauge the impact of technical debt on the product and user experience. Second, they provide insights into areas that may not be on our radar but are causing pain points for others. For example, a feature that we consider “done” might be causing usability issues for the end-users.

By closing the feedback loop, we not only improve the product but also fine-tune our technical immune system. We learn from our mistakes and successes alike, making necessary adjustments to our strategies, tools, and workflows. This iterative learning process is what makes our technical immune system robust and effective, capable of adapting to new challenges and complexities.

Conclusion

Navigating the dangerous waters of software development — filled with tight deadlines, rapid changes, and high stakes — can often lead to accumulating technical debt. Fighting it is a continuous journey, not a destination. It’s about making conscious choices and trade-offs.
By implementing the right strategies and building a resilient technical immune system, I’ve found a sustainable way to manage technical debt without stifling innovation. It’s all about being prepared and proactive, so you can not only survive but thrive in this challenging environment.

Navigating Rough Waters: Shedding Technical Debt was originally published in Similarweb Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Reduce NAT Gateway Charges By Identifying Missing VPC Endpoints

Liav Shabtai — Wed, 23 Aug 2023 12:10:38 GMT

It often happens that, when architecting a network, all traffic is routed via a NAT gateway. This can be due to network architecture habits inherited from the traditional data center, in combination with lack of awareness of the costs involved in using a NAT Gateway, the use of this service can easily accumulate high charges while lacking clear visibility to the traffic that is being traversed through it.

To reduce costs for customers, AWS introduced VPC Endpoints. Enables customers to privately connect to supported AWS services through VPC endpoints powered by AWS PrivateLink, for a much cheaper price compared to the bytes transfer charges through Nat Gateways.

This blog post will guide you through the essential steps to set up VPC Flow Logs in the most cost-efficient manner, and how to query for missing VPC Endpoints. Providing FinOps and DevOps engineers with crucial visibility into their NAT Gateway traffic. Which may lead to substantial cost reductions.

At Similarweb, by adopting this optimization method, we successfully identified and configured missing VPC Endpoints, slashing our NAT traffic expenses by over 35%.

An added perk? if you will Follow the step-by-step guide (take me there) The entire setup can be completed in under 30 minutes.

NAT Gateway VS VPC Endpoints Architecture and Pricing

NAT Gateway

NAT Gateways allow instances in a Virtual Private Cloud (VPC) to initiate traffic to the internet, and then return the response, without allowing the internet to initiate a connection with the requesting instances.
Typically used for instances in a private subnet to reach the internet (for updates, patches, etc.) but not for the internet to reach those instances

VPC Endpoints

Interface Endpoints: VPC endpoint enables a private connection to supported AWS services and VPC endpoint services powered by AWS PrivateLink.
Gateway Endpoints: These can be created for Amazon S3 and DynamoDB and route traffic to these services.

Benefits of VPC Endpoints:

Security: Your traffic does not traverse the public internet, reduces the exposure to threats such as data breaches and data loss.
Performance: They provide reliable, and often faster, connections to AWS services.
Cost-Efficiency: Data processed through VPC Endpoints is less expensive than the data processed through NAT Gateways. Specifically, VPC Gateway Endpoints for S3 and DynamoDB incur no additional charges. Therefore, should be a definite inclusion in all network architectures, effectively eliminating current and future bytes transfer charges associated with these services.

Traffic to AWS Services with VPC Endpoints Configured

Pricing:

The pricing information is accurate for US East (N. Virginia) at the time of this article’s publication.

Routing traffic via VPC Endpoints can be significantly more cost-effective, potentially reducing costs by over 75%, to supporting services, compared to using the default NAT alternative. Beyond the direct cost savings, NAT Gateways also incur standard data transfer fees, Additional charges for internet outbound and cross-AZ traffic (See detailed pricing for Data Transfer Charges). VPC Endpoints, remove these additional charges completely.

Step By Step: How to Identify Missing VPC Endpoints in Your Network Architecture

Prerequisites

Delivering Cost and Usage Reports to an Athena-Configured S3 Bucket with Resource ID Cost Allocation.AWS Guide
S3 Bucket or Prefix to deliver VPC Flow logs into. In case working cross accounts a bucket policy would need to be enabled to deliver the logs (How to publish VPC Flow logs to a different account).

Step 1: Focus on your top spending NAT Gateways

If you are unsure which NAT Gateways account for the highest Bytes transferred usage, execute the following Athena query. This will help identify the top NAT Gateways based on their Bytes transferred charges, allowing you to prioritize optimization efforts on them.

-- Results display NAT GW ARN costs in descending order.select line_item_resource_id,
-- for the months of June, July, and August in the year 2023.
select line_item_resource_id,
sum("line_item_unblended_cost") as "unblended_cost"
from .
where line_item_usage_type like '%NatGateway-Bytes%'
and year like '2023'
and month in ('6','7','8')
group by 1
order by 2 desc;Afterwards, by using the AWS console, find their attached Elastic Network Interface (ENI) and their CIDR Ranges.
Step 2: Enable VPC Flow Logs on chosen ENIs
The most cost Effective method of delivering VPC Flow logs is by enabling the delivery to S3 and storing the file in a parquet compressed format, the alternative of using Çloudwatch to store and query the VPC Flow logs can accumulate high charges, be warned.
Use the following Python Script to Create flow logs automatically
import boto3
def create_vpc_flow_logs(s3_location, eni_list, region):
# Initialize the EC2 client
ec2 = boto3.client('ec2', region_name=region)
# Create VPC flow logs
create_response = ec2.create_flow_logs(
DryRun=False,
ResourceIds=eni_list,
ResourceType='NetworkInterface',
TrafficType='ALL',
LogDestinationType='s3',
LogDestination=s3_location,
LogFormat='${version} ${account-id} ${interface-id} ${srcaddr} ${dstaddr} ${pkt-srcaddr} ${pkt-dstaddr} ${pkt-src-aws-service} ${pkt-dst-aws-service} ${srcport} ${dstport} ${protocol} ${packets} ${bytes} ${start} ${end} ${action} ${log-status} ${flow-direction}',
TagSpecifications=[
{
'ResourceType': 'vpc-flow-log'
}
],
MaxAggregationInterval=600,
DestinationOptions={
'FileFormat': 'parquet',
'HiveCompatiblePartitions': False,
'PerHourPartition': True
}
)
# Return the creation response
return create_response
# Configuration settings
s3_location = 'arn:aws:s3:::/'
eni_list = ['', '']
region = ''
# Call the function and print results
response = create_vpc_flow_logs(s3_location, eni_list,region)
print(response)
print(f"Flow Logs successfully created, Flow Log ID: {response['FlowLogIds'][0]}")
Manual Configuration
Click on the ENI in the AWS Console, choose create Flow Log.
Configuration of Flow Logs:
Destination: Send to an S3 bucket
Change The Default Log Record Format to the following(VPC Flow Logs Attributes):
${version} ${account-id} ${interface-id} ${srcaddr} ${dstaddr} ${pkt-srcaddr} ${pkt-dstaddr} ${pkt-src-aws-service} ${pkt-dst-aws-service} ${srcport} ${dstport} ${protocol} ${packets} ${bytes} ${start} ${end} ${action} ${log-status} ${flow-direction}
Log file format : Parquet
Partition logs by time: Every 1 hour (60 mins)
Step 3: Create VPC Flow Logs Table in Athena
CREATE EXTERNAL TABLE `vpc_flow_logs`(
`version` int,
`account_id` string,
`interface_id` string,
`srcaddr` string,
`dstaddr` string,
`pkt_srcaddr` string,
`pkt_dstaddr` string,
`pkt_src_aws_service` string,
`pkt_dst_aws_service` string,
`srcport` int,
`dstport` int,
`protocol` bigint,
`packets` bigint,
`bytes` bigint,
`start` bigint,
`end` bigint,
`action` string,
`log_status` string,
`flow_direction` string,
`vpc_id` string,
`subnet_id` string,
`instance_id` string,
`tcp_flags` int,
`type` string,
`az_id` string,
`sublocation_type` string,
`sublocation_id` string,
`traffic_path` int)
PARTITIONED BY (
`region` string,
`datehour` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3:///'
TBLPROPERTIES (
'projection.datehour.format'='yyyy/MM/dd/HH',
'projection.datehour.interval'='1',
'projection.datehour.interval.unit'='HOURS',
'projection.datehour.range'='2021/01/01/00,NOW',
'projection.datehour.type'='date',
'projection.enabled'='true',
'projection.region.type'='enum',
'projection.region.values'='us-east-1,',
'skip.header.line.count'='1',
'storage.location.template'='s3:////vpcflowlogs/${region}/${datehour}'
)
Step 4: Query The Traffic to find Missing VPC Endpoints
The pkt_dst_aws_service, pkt_src_aws_servicecolumn will point to the name of the AWS Service you are trying to communicate with. However most AWS Services are still not mapped and will receive the value “AMAZON”.
mapped AWS services values for the pkt_src/dst _aws_serice column
Query for missing Endpoints for Mapped AWS Services
Calculate the total bytes transferred, categorized by thepkt_dst_aws_service and pkt_src_aws_service columns. This will help identify which mapped AWS Services are not sending data through the VPC Endpoint.
-- Uploads to aws services
-- x.x.x.x.x is the NAT Gateways IP Address
-- y.y.%.% all traffic directed to resources withing the NAT Gateway ip range
select pkt_dst_aws_service,sum(bytes)/(1000*1000) as "MB"
from finops.test_table_vpclogs
where srcaddr = <'x.x.x.x.x '> and dstaddr not like <'y.y.%.% '>
group by 1
order by 2 desc
limit 1000;
-- Downloads to aws services
-- x.x.x.x.x is the NAT Gateways IP Address
-- y.y.%.% all traffic directed to resources withing the NAT Gateway ip range
select pkt_dst_aws_service,sum(bytes)/(1000*1000) as "MB"
from finops.test_table_vpclogs
where dstaddr = <'x.x.x.x.x '> and srcaddr not like <'y.y.%.% '>
group by 1
order by 2 desc
limit 1000;
Query for missing Endpoints for Remaining AWS Services
It is very likely you will find that most traffic is for unmapped services which receive the value “AMAZON” for the pkt_dst_aws_service column.
Here is how to inspect them:
-- uploads to aws services
-- x.x.x.x.x is the NAT Gateways IP Address
-- y.y.%.% all traffic directed to resources withing the NAT Gateway ip range
select dstaddr,pkt_dstaddr,pkt_dst_aws_service,sum(bytes)/(1024*1024*1024) as "GB Transfered"
from .vpc_flow_logs
where srcaddr = <'x.x.x.x.x '> and dstaddr not like <'y.y.%.% '>
and "pkt_dst_aws_service" = 'AMAZON'
group by 1,2,3
order by 2 desc
limit 1000;
-- downloads to aws services
-- x.x.x.x.x is the NAT Gateways IP Address
-- y.y.%.% all traffic directed to resources withing the NAT Gateway ip range
select dstaddr,pkt_dstaddr,pkt_dst_aws_service,sum(bytes)/(1024*1024*1024) as "GB Transfered"
from .vpc_flow_logs
where dstaddr = <'x.x.x.x.x '> and srcaddr not like <'y.y.%.% '>
and "pkt_dst_aws_service" = 'AMAZON'
group by 1,2,3
order by 2 desc
limit 1000;
The column (representing destination address IPs) will display the IP addresses of unmapped AWS Services that the NAT is attempting to communicate with.
From here we choose to cherry pick specific ip address, starting from those that transferred large amounts of data and did the following:
Open up a we-browser search for htttp://
Click Advance
Click on Advance
The Searched IP address was communication through NAT gateway to ECR not through a VPC Endpoint
In our case, individually reviewing the list of destination IP addresses with the highest traffic was sufficient. Nonetheless, we acknowledge that using third-party tools can enhance and automate this phase of the procedure more efficiently.
Step 5: Create the Missing VPC endpoints
Step 6: Terminate all VPC Flow logs after investigation to not incur further charges for Log Delivery.
Conclusion:
By identifying and leveraging VPC Endpoints, organizations can not only secure their traffic but also avoid unnecessary expenses associated with NAT Gateways and potentially reduce their NAT Traffic charges drastically. This article has highlighted the foundational steps required for any FinOps or DevOps engineer to gain a clear insight into their NAT Gateway traffic, enabling them to identify and fill the gaps where VPC Endpoints are missing.
We managed to decrease our NAT Gateway traffic costs drastically, by over 35%, How much will you?
Supporting Sources:
Mastering AWS Cost Optimization: A comprehensive guide on AWS costs, covering services, pricing models, and cost-reduction strategies.
Overview of Data Transfer Costs for Common Architectures: An article from the AWS Architecture Blog detailing data transfer costs for various AWS architectures and best practices for cost optimization.
Reduce NAT Gateway Charges By Identifying Missing VPC Endpoints was originally published in Similarweb Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Demystifying Advanced Git Commands for More Effective Version Control
Dor Amram — Wed, 02 Aug 2023 08:21:37 GMT
Version control is a crucial part of any software development process, allowing developers to track and manage changes to their codebase over time. One of the most popular systems for version control is Git. While most developers are familiar with basic commands like `git commit` or `git push`, Git also has many advanced commands that can help you work more efficiently and manage complex code changes. Today, we are going to dive into some of these advanced Git commands to help you supercharge your workflow.
Before proceeding, ensure you have a basic understanding of Git and you’ve used it to manage version control. If you’re still new to Git, you might want to start with some of the basic commands first.
1. git bisect
`git bisect` is a binary search command in Git that enables you to find the commit that introduced a bug in your code. This can be incredibly useful when you know that your code was working at one point and has since broken, but you aren’t sure exactly when the bug was introduced.
The process starts by marking the last known good commit as `good` and the current bad commit as `bad`. Git will then checkout a commit in the middle, and you have to test your program. Depending on whether the bug appears or not, you mark this commit as `good` or `bad`. Git will then keep bisecting until it finds the exact commit that introduced the bug.
git bisect start
git bisect good {LAST_KNOWN_GOOD_COMMIT}
git bisect bad {CURRENTLY_BAD_COMMIT}
2. git rebase -i (Interactive Rebasing)
Git’s interactive rebasing feature allows you to modify previous commits by combining, altering, or even removing them. This can be particularly useful for cleaning up your commit history before merging a feature branch into the main branch.
git rebase -i HEAD~n
In this command, `n` is the number of commits you want to include in the rebase.
The command will open a text editor with a list of the last `n` commits, each prefixed with the word `pick`. You can replace `pick` with commands such as `reword` to change a commit’s message, `squash` to combine a commit with the one before it, or `drop` to remove a commit entirely.
3. git stash
`git stash` is a command that allows you to save changes that you don’t want to commit immediately. It’s very useful when you want to switch branches but you aren’t ready to commit your changes.
You can use `git stash save “message”` to save your changes with a descriptive message. The changes are saved into a stack, and you can retrieve them later with the `git stash pop` command.
git stash save "work in progress for feature X"
4. git cherry-pick
`git cherry-pick` is a powerful command that allows you to apply the changes from an existing commit to your current working branch. This is useful when you want to grab a specific change from another branch without merging all the changes from that branch.
git cherry-pick {COMMIT_HASH}
In the above command, replace `{commit_hash}` with the hash of the commit you want to cherry pick.
5. git reflog
If you’ve ever been in a situation where you lost a commit, `git reflog` is the superhero command that can help you recover it. `git reflog` maintains a log of where your HEAD and branch references have pointed in the past. This makes it possible to recover lost commits, or even lost branches.
git reflog
Handy Git Aliases for a Streamlined Workflow
As an added bonus, I’d love to share with you some of my favorite Git aliases that have supercharged my own workflow. These are shortcuts that I’ve devised over time, designed to save keystrokes and make your Git experience more seamless. Let’s take a look:
1. The Cleaning Tool
This bash function, f, is intended to automatically delete all local Git branches that have been merged into the current or specified branch, excluding the current or specified branch itself. This is useful for repository maintenance and reducing clutter, as it automates the cleanup of outdated branches that are no longer needed after their changes have been integrated.
!f() { DEFAULT=$(git default); git branch --merged ${1-$DEFAULT} | grep -v ${1-$DEFAULT}$ | xargs git branch -d; }; f
2. Synchronize with the Master
This alias is the perfect one-click solution to synchronize your feature branch with the master. It reduces a multi-step process to a single command.
git checkout master && git pull && git checkout - && git rebase master
3. Detect Frequently Modified Files
This command provides an overview of how often each file in your Git repository has been modified, helping you understand the evolution and high-change areas of your codebase.
Frequently modified files in a Git repository might indicate areas of the codebase that are complex, unstable, or subject to regular updates. Such areas are often more prone to errors or bugs due to the continuous changes, potentially making them a source of recurring issues in the software. Therefore, the given command can help developers identify these “hotspots” and allocate more time for thorough code review, testing, or refactoring in these areas to improve code quality and stability.
git log --all --find-renames --find-copies --name-only --format=format: | sort | egrep -v '^$' | uniq -c | sort -n | awk 'BEGIN {print "Count\tFile"} {print $1 "\t" $2}'
4. Safeguard Changes
This command offers a safety net during experimentation. It saves all your changes in a new commit, then undoes that commit, effectively storing all changes, including untracked ones.
git add -A && git commit -qm 'WIPE SAVEPOINT' && git reset HEAD~1 - hard
Me after using these aliases
Adding these aliases to your toolkit is like getting your favorite superhero powers. Use them wisely, and they will make you an even more effective and efficient Git user.
Git is a powerful tool, and like any powerful tool, it takes time to understand and use it to its full potential. The commands we discussed in this post are only the tip of the iceberg when it comes to Git’s capabilities. As you become more comfortable with Git, I encourage you to explore the documentation and discover other advanced commands and options that can further enhance your workflow.
Remember that while it’s important to understand and be able to use these advanced commands, it’s equally important to use them judiciously. Git is a tool designed to aid development, not to create unnecessary complexity. Happy Gitting!
Demystifying Advanced Git Commands for More Effective Version Control was originally published in Similarweb Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

PySpark Pitfalls: A Comedy of Errors and How to Dodge Them
Dor Amram — Mon, 10 Jul 2023 15:57:25 GMT
Me waiting for my query to finish
If you’ve ever danced with PySpark, you know it can be like tangoing with a hungry bear. While it can be a powerful partner, if you step on its toes, you’re in for a wild ride. Buckle up, as we traverse the “sparkling” landscape of PySpark mishaps, and learn how not to end up as the comedic relief in your own coding journey.
Misusing Collect — A Memoir of Lost Memory
Picture this: It’s late at night, and you’ve just run your PySpark job. Suddenly, the silence is shattered by the howling of your computer, begging for mercy. You’ve used ‘collect()’ to get all the elements of a DataFrame, only to realize that you’ve just tried to cram a terabyte-sized monster into your laptop’s memory.
Avoid this disaster with actions like ‘take()’, ‘first()’, or ‘count()’. Only use ‘collect()’ when necessary, and when you’re sure your machine can handle it. Here’s an example:
Ignoring Data Partitioning — A Tale of Endless Shuffling
Not partitioning your data in PySpark is like trying to find your favorite book in a library where books are randomly scattered. You’ll end up running around (or in Spark’s case, shuffling data) until you’re out of breath.
Do your Spark job a favor and arrange those ‘books’ with data partitioning:
Now, Spark knows exactly where to find the data it needs, saving you time and computational resources.
Overusing Python UDFs — The Tortoise and the Hare Redux
PySpark allows you to use Python User Defined Functions (UDFs), which feels like home for Pythonistas. But remember the tale of the Tortoise and the Hare? In this version, Python UDFs play the slow-and-steady tortoise. However, unlike the classic fable, the hare (PySpark’s built-in functions) gets the job done faster and doesn’t nap on the job.
Consider using PySpark SQL’s built-in functions over Python UDFs, like so:
Neglecting Broadcast Variables — Sharing is Caring
In PySpark’s world, sharing variables is akin to handing out flyers. By default, PySpark hands out a flyer (copies of a variable) to every worker for each task. If you’re dealing with a hefty variable, that’s a lot of wasted paper (network bandwidth).
Broadcast variables come to the rescue like a Spark superhero, giving each worker one copy of the ‘flyer’, saving on resources:
The Misadventures of Window Functions — A Window with A View
Window functions in PySpark are a fantastic tool, offering insights on data with respect to a specific frame or ‘window’ of data. But just like that tempting open window on a summer day, it can let in a swarm of bugs if not used properly.
Suppose you’re working with a DataFrame of daily sales and you want to calculate a running total. You might decide to use a window function to get the job done. However, if you neglect to specify the window frame, you’ll get unexpected results.
At first glance, this looks okay. But there’s a catch! The ‘orderBy()’ in the window definition sorts the data, but without a specified frame, it calculates the ‘running_total’ from the first row to the current row. Not exactly a “running” total, more like a “stumbling” total.
To get a proper running total, you need to specify the frame. In this case, the frame is all rows between the start of the DataFrame and the current row:
Now that’s a running total that would make Barry Allen proud!
Conclusion
Remember folks, PySpark is like a wild horse — majestic and powerful, but it’ll buck you off if you’re not careful. Navigate through the PySpark wilderness with caution, respecting its unique quirks and features. When in doubt, remember these comedic tales and their lessons. After all, you wouldn’t want to become the next comic strip in the PySpark universe, would you? Happy Sparking and avoid the pratfalls!
PySpark Pitfalls: A Comedy of Errors and How to Dodge Them was originally published in Similarweb Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Deploy AWS MWAA (Airflow) environments in scale using Terraform
Eliezer Yaacov — Wed, 29 Mar 2023 08:44:56 GMT
For years now, Airflow become the standard for using a platform for developing and scheduling batch workflows.
If you ever tried to manage the infrastructure for airflow, you probably had to tailor your solution for creating the MetaStore DB, queue or key-value storage for the scheduler, servers to host the web servers, workers and scheduler components, and maybe other component to create a fully, functional, production-ready Airflow environment.
While it might sounds complex, we used to run Airflow on a vary infrastructures such as Nomad and Kubernetes and it was good enough. The actual problem started when we wanted to scale up Airflow environments creation. The ability to create environment in a hour, with all its component, gave us the ability to develop, test and deploy changes quickly to production, work on few parallel data sets with separated environments and more.
If you are looking to create airflow environments in scale, quickly, the following solution worked for us.
Amazon Managed Workflows for Apache Airflow (MWAA) is a fully managed service that makes it easy to run open-source versions of Apache Airflow on AWS. With AWS MWAA, you can easily build, run, and scale your workflows without having to manage the underlying infrastructure. In this article, we’ll guide you through the steps to create an AWS MWAA environment in Terraform, including an IAM execution role and an S3 bucket.
The entire infrastructure in Similarweb managed in Terraform code. We created a Terraform module with the following parts:
MWAA environment
S3 bucket
IAM role (execution role)
IAM user (CI user)
The S3 bucket purpose is to contain all code relevant for Airflow env (DAGs, requirements, plugins, etc).
The execution role is required by MWAA env to invoke actions on the different services in the environment such as logs and metrics, access to the bucket, etc.
The IAM user purpose is to provide the full solution with automatic code deployment. Whether managing the code in GitHub, Gitlab or any other version control system, we’ll want a process of code deployment, when the code is verified, and upload it to the S3 bucket. For that, we’ll use awscli with the user, to manage the code sync.
Baseline for Terraform — All code tested on Terraform 0.14.11 and up, AWS provider 4.9 and up.
MWAA Environment —
resource "aws_mwaa_environment" "managed_airflow" {
airflow_version = "2.2.2"
airflow_configuration_options = {
.....
"core.dag_file_processor_timeout" = 150
"core.dagbag_import_timeout" = 90
....
}
dag_s3_path = "dags/"
execution_role_arn = module.execution_role.role_arn
name = "airflow-env-name"
environment_class = "mw1.small"

network_configuration {
security_group_ids = [aws_security_group.managed_airflow_sg.id]
subnet_ids = ["A", "B"] # 2 subnets required for high availability
}

source_bucket_arn = aws_s3_bucket.managed-airflow-bucket.arn
weekly_maintenance_window_start = "SUN:19:00"

logging_configuration {
dag_processing_logs {
enabled = true
log_level = "WARNING"
}

scheduler_logs {
enabled = true
log_level = "WARNING"
}

task_logs {
enabled = true
log_level = "WARNING"
}

webserver_logs {
enabled = true
log_level = "WARNING"
}

worker_logs {
enabled = true
log_level = "WARNING"
}
}

tags = {
name = "airflow-env-name"
.....
}

lifecycle {
ignore_changes = [
requirements_s3_object_version,
plugins_s3_object_version,
]
}

}

##### Security group
resource "aws_security_group" "managed_airflow_sg" {
name = "managed_airflow-sg"
vpc_id =

tags = {
Name = "managed-airflow-sg"
}
}

##### Security group rules
resource "aws_security_group_rule" "allow_all_out_traffic_managed_airflow" {
type = "egress"
from_port = 0
to_port = 0
protocol = -1
cidr_blocks = ["0.0.0.0/0"]
security_group_id = aws_security_group.managed_airflow_sg.id
}

resource "aws_security_group_rule" "allow_inbound_internal_traffic" {
type = "ingress"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = data.terraform_remote_state.network_remote_state.outputs.internal_subnet_cidrs
security_group_id = aws_security_group.managed_airflow_sg.id
}

resource "aws_security_group_rule" "self_reference_sgr" {
type = "ingress"
from_port = 0
to_port = 65535
protocol = "tcp"
self = true
security_group_id = aws_security_group.managed_airflow_sg.id
}
S3 Bucket —
resource "aws_s3_bucket" "managed-airflow-bucket" {
bucket = "airflow-bucket-sw"
force_destroy = "false"

tags = {
Name = "airflow-bucket-sw"
.....
}
}

resource "aws_s3_bucket_versioning" "managed-airflow-bucket-versioning" {
bucket = aws_s3_bucket.managed-airflow-bucket.id
versioning_configuration {
status = "Enabled"
}
}
Bucket must be defined with versioning as a requirement by MWAA environment. The version of the files helps to manage changes in production environments for requirements, plugins and more.
IAM execution role and policy —
data "aws_iam_policy_document" "execution_role_policy" {
version = "2012-10-17"
statement {
effect = "Allow"
actions = [
"airflow:PublishMetrics"
]
resources = [
"arn:aws:airflow:${var.region}:${var.account_id}:environment/${var.name}*"
]
}
statement {
effect = "Deny"
actions = ["s3:ListAllMyBuckets"]
resources = [
"arn:aws:s3:::${var.bucket_name}",
"arn:aws:s3:::${var.bucket_name}/*"
]
}
statement {
effect = "Allow"
actions = [
"s3:GetObject*",
"s3:GetBucket*",
"s3:List*"
]
resources = [
"arn:aws:s3:::${var.bucket_name}",
"arn:aws:s3:::${var.bucket_name}/*"
]
}
statement {
effect = "Allow"
actions = [
"s3:GetAccountPublicAccessBlock"
]
resources = ["*"]
}
statement {
effect = "Allow"
actions = [
"logs:CreateLogStream",
"logs:CreateLogGroup",
"logs:PutLogEvents",
"logs:GetLogEvents",
"logs:GetLogRecord",
"logs:GetLogGroupFields",
"logs:GetQueryResults"
]
resources = [
"arn:aws:logs:${var.region}:${var.account_id}:log-group:airflow-${var.name}-*"
]
}
statement {
effect = "Allow"
actions = [
"logs:DescribeLogGroups"
]
resources = [
"*"
]
}
statement {

effect = "Allow"
actions = [
"cloudwatch:PutMetricData"
]
resources = [
"*"
]
}
statement {
effect = "Allow"
actions = [
"sqs:ChangeMessageVisibility",
"sqs:DeleteMessage",
"sqs:GetQueueAttributes",
"sqs:GetQueueUrl",
"sqs:ReceiveMessage",
"sqs:SendMessage"
]
resources = [
"arn:aws:sqs:${var.region}:*:airflow-celery-*"
]
}
statement {
effect = "Allow"
actions = [
"kms:Decrypt",
"kms:DescribeKey",
"kms:GenerateDataKey*",
"kms:Encrypt"
]
resources = var.kms_key_arn != null ? [
var.kms_key_arn
] : []
not_resources = var.kms_key_arn == null ? [
"arn:aws:kms:*:${var.account_id}:key/*"
] : []
condition {
test = "StringLike"
values = var.kms_key_arn != null ? [
"sqs.${var.region}.amazonaws.com",
"s3.${var.region}.amazonaws.com"
] : [
"sqs.${var.region}.amazonaws.com"
]
variable = "kms:ViaService"
}
}
}

resource "aws_iam_role" "role" {
name = "airflow-execution-role"
path = "/"
tags = local.tag_list
}

resource "aws_iam_role_policy" "role-policy" {
name = "airflow-execution-role-policy"
role = aws_iam_role.role.id
policy = data.aws_iam_policy_document.execution_role_policy.json
}
The execution role has access to KMS for a given key. We can extend this policy of course if we need airflow to access other services, but this is the baseline, recommended by AWS.
IAM User (CI user) —
data "aws_iam_policy_document" "ci_user_policy" {
statement {
effect = "Allow"
actions = [
"s3:GetObject*",
"s3:GetBucket*",
"s3:List*",
"s3:PutObject",
"s3:DeleteObject",
"s3:GetEncryptionConfiguration",
]
resources = [
"arn:aws:s3:::${var.bucket_name}",
"arn:aws:s3:::${var.bucket_name}/*"
]
}
}
resource "aws_iam_user" "app_user" {
name = "appusr_airflow_ci"
path = "/"
tags = local.tag_list

}

resource "aws_iam_access_key" "access_key" {
user = aws_iam_user.app_user.name
}

resource "aws_iam_user_policy" "user_policy" {
name = "airflow_ci_policy"
user = aws_iam_user.app_user.name
policy = data.aws_iam_policy_document.ci_user_policy.json
}

output "ci_access_key" {
value = aws_iam_access_key.access_key.id
}

output "ci_secret_key" {
value = aws_iam_access_key.access_key.secret
}
The user will have permissions to sync the source code containing the DAGs, requirements and plugins to an S3 bucket. The user needs permission to add/update and delete files if necessary, and its actions limited to the bucket only. Warning: apply the code like that, will expose the secret key to the output.
Eventually, the CI user will be used to sync the code to the S3 bucket, as followed:
AWS_ACCESS_KEY_ID= \
AWS_SECRET_ACCESS_KEY= \
aws s3 sync ./ s3://$AWS_S3_BUCKET/ \
--delete \
--exclude '.git/*' \
--exclude \
Just to emphasize how simple is it to create Airflow env, using the suggested solution here, the following is a call for a Terraform module which contains all the above parts:
locals {
airflow_configuration_options = {
....
"core.dag_file_processor_timeout" = 150
"core.dagbag_import_timeout" = 90
....
}
}

module "managed_airflow_web_platform" {
source = "terraform-registry-url/airflow-managed/aws"
version = "~> 2.0"

subnet_cidrs = ["10.10.X.Y/27", "10.10.X.Z/27"]
dag_s3_path = "dags/"
name = "sw-airflow"
bucket_name = "sw-airflow"
cicd_user = "sw-airflow-ci-user"
requirements_s3_path = "requirements.txt"
plugins_s3_path = "plugins.zip"
environment_class = "mw1.large"
airflow_configuration_options = local.airflow_configuration_options
dag_processing_logs_level = "INFO"
}
All we have to do now is run Terraform apply on this example state, connect a git repository to it which manages the DAGs and airflow code and we have full Airflow environment setup in no time. :)
Deploy AWS MWAA (Airflow) environments in scale using Terraform was originally published in Similarweb Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

How we healed our AWS MWAA (aka airflow) env
Lior Mor — Wed, 08 Feb 2023 08:40:39 GMT
tl;dr
in order to save your airflow’s scheduler CPU:
1. Use imports only where you need it. Separate code to files and reduce many redundant dependencies and CPU consumption.
2. Remove network and db calls from the dag processing.
3. Use the scheduler environment variables in order to enhance the scheduler work.
4. Use .airflowignore file
Apache Airflow is a great product to arrange, configure and orchestrate our data pipelines. In a data company such as Similarweb, it is essential to maintain a single system that enables us to update and monitor our ETLs with, and airflow gives us such solution.
In the last months we moved from on-premise cluster to MWAA — a managed version of airflow running in AWS, which simplifies the usage by letting AWS do some stuff instead of the developers, such as monitor and report, auto scaling and other integrations.
key part in Airflow architecture is the scheduler, a micro service that handles dags (dag — directed acyclic graph, that represents execution of tasks and their dependency on each other) and tasks execution, with respect to its dependencies, i.e. time schedule and other tasks execution.
Through the time, when adding more and more dags and expanding the infrastructure for dag processing, we found out that the CPU consumption became higher and higher.
The problem started on the days we used the on-prem environment. When moving to MWAA, we hoped that the managed env will solve it by its auto-optimization. however when we started migrating and adding the dags to MWAA we found out that the problem is still here and the scheduler’s CPU is always on 100%. As long as most of the dags not running we barely felt it however when lots of dags were triggered the scheduler just couldn’t trigger fast enough and many tasks failed to start, which caused very high latency and even throttled actions. dags could not render.
First thing we made is, along with the MWAA support, understand better the accessible environment variables that control the scheduler’s actions and load. controlling those variables in MWAA is just simple and can be done through the page of your MWAA environment in AWS console. for those who use airflow on-premise it can be done by the airflow.cfg file. There are the main values we change:
1. scheduler.min_file_process_interval — this value reflects the maximum time in seconds that the scheduler will process each dag file. As you can guess, lower number means higher cpu. we increased it from 30 seconds to 300.
2. core.min_serialized_dag_update_interval — the minimum interval in seconds which a dag state will be updated in the airflow database, here we also increased from 30 to 300 seconds.
3. core.sql_alchemy_pool_size— number of max connections in a database pool. we raised it from 5 to 25 in order to make up for the scheduler’s interval, and put more load on the network than on the CPU.
4. scheduler.scheduler_idle_sleep_time—since the default for the scheduler to sleep within loops is only 1 second, we raised it and set it to 5.
for more info on how to control airflow core services go here
The second thing we done was to add .airflowignore file in the dags s3 bucket.
with MWAA you define the s3 bucket where your dags are, which promises by MWAA to update the environment with any change without the need to close and deploy the env over and over. The scheduler constantly scans this bucket to process dags and/or update them. Since we hold other files, that are not dag files in this bucket, it improves performance to let the scheduler know what files in should process.
before explaining the next step, lets understand a key principle in airflow: the difference between dag processing and dag execution. airflow’s scheduler endlessly scans the dag files and creates dags, update them and check if dag can start running.
this is dag processing, and it is done inside the scheduler.
when a dag can be triggered, it starts a dag-run, and the scheduler triggers its execution in one of the available workers.
Hence, the next step was removing ALL network calls from the dag processing stage. Since airflow scans dags endlessly, any long workload might cause a dramatic performance change for all dags. we removed some calls to the scheduler db and to other external resources to the dag execution, where the dag really runs.
Last, but very not least (actually this was the most important change) we declared our imports in the python files in a very economical way. this means:
1. separating long files into smaller modules, where each file has its own imports.
2. when possible, using imports inside the function that uses it and not in the header of the file.
this datadog graph shows the change after the imports refactor.
since python is interpreted, when importing a file, python runs the file and all its dependencies recursively. this, along with the non-stop scheduler work requires a very strict python files load.
when investigating it a bit more we found out that open source packages were written with a call to airflow db in the ctor, which is a bad practice that we had to use it carefully, in a more lazy way.
more than that, in airflow documents it is not said loud enough how important is economical import strategy. I would say it is the first principle you need to adopt when developing on airflow. We found it on the hard way but I believe it could have been saved from us on earlier stage :)
How we healed our AWS MWAA (aka airflow) env was originally published in Similarweb Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

How We Cut Our Databricks Costs by 50%
Ran Sasportas — Mon, 09 Jan 2023 07:01:00 GMT
One of the main drivers of R&D cost is the use of cloud resources, particularly when it comes to big data processing tools like Databricks.
2.5 years ago we decided to use Databricks clusters as the compute for our Batch API — if you are interested in what’s Databricks and how we are using it you can check out this blog post — Since then it is our primary tool for generating Similarweb Data reports for our clients, today we are generating 70K reports a month for our clients.
In this post, I’ll share how we were able to reduce our monthly Databricks costs from $25,000 to just $12,500 by making a few key changes to our setup.
Below is a graph describing our Databricks and AWS costs over 5 months.
Databricks + AWS operational costs
It’s worth mentioning that during those months the demand for our service has increased and we have served more data to more clients.
💰 Analyzing Costs
Before making any changes to our Databricks setup, we first needed to understand where our costs were coming from. The total is determined by both Databricks and AWS costs, let's take a look at how these two bills are calculated —
AWS Monthly Cost = (Number of Worker Nodes * Cost per Worker Node Hour * Active Seconds) + (Number of Driver Nodes * Cost per Driver Node Hour * Active Seconds) + (Storage Cost) + (Data Transfer Cost)
Databricks Monthly Cost = (Number of Nodes * DBU’s Per Node per Hour * Active Second * Price per DBU)
* A Databricks Unit (DBU) is a normalized unit of processing power on the Databricks Lakehouse Platform used for measurement and pricing purposes. DBU pricing varies and can be found on the official website.
So as you can see, both monthly costs are derived mainly from the number of nodes and active hours, which means — if we will optimize the AWS compute costs, Databricks will go as follow.
The first step of this optimization is analyzing the cost breakdown, we used the AWS Cost Explorer to get visibility into our AWS bill.
Our costs in the US EAST 1 region (AWS Costs explorer)
In order to simplify the cost breakdown chart, I filtered it to show data for only 1 region out of the 2 regions we are operating on.
I'll add here a little bit of translation —
BoxUsage — On-demand instances
SpotUsage — Spot instances
c5d.2xlarge — the workers' node type we use
i3.4xlarge — driver’s node type we use
EBS: VolumeUsage — EC2 Storage
The On-demand Fiasco
One thing that immediately stood out was the price of the on-demand instances, the nodes in our cluster are configured to prefer spot instances and fall back to on-demand. This was the most prominent chunk of our bill, and it was clear that we needed to optimize this if we wanted to bring our costs down.
Spot instances are a cost-effective way to use spare capacity in the cloud, but they can be unpredictable. If the demand for spot instances exceeds the available supply, they can be terminated and replaced with on-demand instances. This can result in higher costs if it happens frequently.
We dealt with it with two courses of action:
Reduce the number of needed nodes, fewer nodes = fewer fallbacks.
Optimize the availability of spot instances.
🤖 1. Re-configuring the aggressive Auto-scaler
An optimizer's best friend is his monitoring tool, in Databricks’ case it's Ganglia UI.
Ganglia UI’s cluster’s CPU and Memory Monitoring graphs.
As you can see here in the above graphs — once the cluster gets work, the auto-scaler kicks in and up-scales the cluster as he sees fit.
Auto-scaling is a great feature that allows you to automatically add or remove worker nodes as needed to meet the demands of your workload. However, it can also be a major contributor to costs if you’re not careful.
When analyzing our cluster via the Ganglia UI it seems that the Auto-scaler has been upscaling aggressively, most times after an upscale our cluster’s CPU and memory are oversized and not utilized enough.
This cluster usually deals with in-frequent short jobs (1–2 minutes average execution time), this means that frequently when scaling up, the job would already be completed, and the new nodes would not be used.
With this analysis, we have decided to reduce the maximum number of workers that we allowed Databricks to scale up from a maximum of 500 workers to 250 workers, which made a big difference, and that's why -
Reducing the number of active nodes — Fewer nodes = less money.
Reducing the number of fallbacks to on-demand — fewer spot instances = fewer on-demand fallbacks.
🔋 2. Utilizing the power of Multi-AZ
Another course of action was using the Multi-AZ feature in Databricks. This allowed us to automatically switch to the availability zone with the most available spot instances, which helped us reduce costly fallbacks to on-demand.
📈 Results
Costs Breakdown for December 2022 in US EAST 1 region
As you can see those 2 actions helped us to successfully reduce the cost of BoxUsage: c5d nodes by 80% (from 2800$ to 500$ in this region), and as you can see the SpotUsage: c5d cost did not change significantly.
Bonus Round
☠️ ️Terminating the Driver
Those of you with keen eyes might have noticed that we also managed to bring driver costs down too significantly (BoxUsage:i3.4xlarge) — around 50%.
This was due to a decision to terminate the driver after 30 minutes of inactivity (also a great Databricks feature).
This decision has an undeniable upside — when the cluster is not being used, terminate it — therefore — stop paying for it. The tradeoff is — Cluster initialization usually takes up to 3–5 minutes in our case, which means that our response time will be impacted from time to time.
🔀 The EBS switch·er·oo
We recognized one more opportunity for optimization.
EBS volumes are used to store data that is persisted beyond the life of an EC2 (Elastic Compute Cloud) instance, and they can be a significant contributor to costs if not adequately managed. Databricks provisions EBS volumes for every worker node to support operations like shuffles for example.
Initially, we were using gp2 volumes for our EBS storage. After some research, it was clear that switching to gp3 volumes is the right decision, which offers higher performance and more cost-effective pricing, and it's just a click of a button away.
On the Databricks Admin Console Page
This little checkbox reduced the EBS costs by almost 50%.
Bottom lines
We managed to drive down EC2 costs and usage on AWS, which impacted directly on the Databricks Costs and resulted in a monthly 12,500$ cost saving. Yearly, it's — 150,000$, significant, right?
What should you take from here?
Analysis — An optimizer’s best friend is his monitoring tool, master it.
Initiative — As a Software Engineer, it is within your power to impact the price of the company’s software. Do it.
Patience — Cost optimization takes time for — research, experiments, waiting for results, and then again. As you’ve seen in my case, it took 5 months of cycling, be patient!
Click on the damn gp3 checkbox.
I would like to thank the brilliant Oded Fried for working with me on this.
How We Cut Our Databricks Costs by 50% was originally published in Similarweb Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Taming the Legacy Beast — A Refactoring Algorithm In 5 Steps
Dor Amram — Wed, 14 Dec 2022 13:06:50 GMT
Taming the Legacy Beast In 5 Steps — A Refactoring Algorithm
You’ve been there. At some point in your career you were probably tasked with changing something in a feature when you could not understand what exactly it was doing.
The documentation was pretty basic, raising even more questions.
After reading the code itself, you still weren’t sure.
When you got to the tests, they were naive and had insufficient coverage.
As you debugged, you didn’t understand the nature of the side effects you were causing.
In a state of frustration, you turned to *git blame*, hoping to find someone who could shed some light on this code. Only to realize that the is author is YOU.
A while ago I was tasked with working on one of our team’s core projects. We wanted to add support to a new use case, but the entanglement of the code made it a very difficult task.
The codebase I’m working on started as a clean project. Thorough research was conducted, the design was simple, concerns were separated and things were looking quite good.
Unfortunately, as often happens, somewhere along the line things changed. Feature requests started to pile up and “minor” compromises were made.
The implementation details became part of the business logic, and the separation between the abstraction layers became a bit fuzzy. Not to mention the code itself, which now had very specific business logic conditions in very unexpected places.
Clearly some refactoring had to be done, but this was a bit like opening a Pandora’s box…
How do you change one of the team’s main projects while still running in production? How do you modify it without affecting existing performance? How do you approach code when you’re not familiar with all its bits and bytes?
In this post I will do my best to answer these questions using the Legacy Code Algorithm.
The Process
As mentioned, for this task I used the Legacy Code Algorithm, described in the book ‘Working Effectively with Legacy Code’. This algorithm provides a few simple steps you can take to handle legacy code as smoothly and cleanly as possible.
But before jumping into it, let’s keep in mind what it is that we’re aiming for.
The main goal of refactoring is to make adding or altering features easier. It’s kind of hard to do so without understanding how exactly that domain behaves. In order to do that, we’ll simply need to see how it handles itself in different scenarios, or, to put it simply, have a bunch of tests around it.
The Legacy Code Algorithm
Back to the algorithm you can follow in order to achieve that goal:
Identify Change Points
The first thing you need to do is understand exactly what your new feature requires and how it interacts with the existing code base.
2. Find Test Points
Once you understand what parts of the code you need to alter, you’ll want to add tests around those parts. You need to do this in order to make sure your changes are only doing what they’re meant to do.
You’ll want to do so in the smallest granularity possible. This will help you understand the existing flows of your code and where the road to writing those tests is easy. One of the techniques we use, when trying to prioritize testing areas in the code, is thinking about it as a network.
Say that you’re interested in understanding how the data flows in your code.
Imagine each function could point where it’s getting its data from. Now imagine giving each function a score based on the pointing of other functions.
The functions with the highest scores would be considered good candidates for data sources.
Luckily — Google’s PageRank algorithm can provide us with this exact knowledge. Without going too deep into its implementation, lets look at the following example:
Here we can see that we have a function that our data is coming from (`query_db`) and the data flows are affected by it. In this small example, it’s quite easy to figure it out, but in real-life this might not be the case.
Today, most languages have tracing mechanisms built into them.
In this case I’m using Python’s `trace` package. When running it on the following code and formalizing it into a graph (code snippet available here) we get the following image:
By looking at the graph attributes, we see that query_db received a high pagerank score — meaning we’ll want to start testing the types of data there.
3. Breaking (Bad) Dependencies
After identifying all the areas in your code that you wish to test. You may discover that testing them is not as simple as you imagined. Some of the functions might be way too long and/or perform multiple operations, and as a result they’re too difficult to simulate.
You’ll first need to divide the functions into smaller chunks of code, based on the parts of the procedure it is trying to do. There’s a lot to be said on what the guidelines are for this kind of modification, but that’s beyond the scope of this post.
4. Tests, Tests and Some More Tests
Once you’ve separated your functions into smaller parts, you start with writing your tests. You will find that writing an isolated test has become an easier task.
You still have your integration tests that check the entire flow from E2E, but now you can introduce the relevant unit tests that cover all possible use cases.
5. Make Changes and Refactor
Once you’ve done all of that, you’re finally ready to start the work you wanted to do all along.
Your code looks a bit different, it’s changed from its initial state. It is separated in a better way, less coupled and perhaps even more readable. Not to mention that you now have your tests to alert you if anything unexpected happens.
The Insights
Once you start refactoring and making your changes, you might notice that modifications to the code don’t feel as risky as they felt in the beginning.
But you should keep the following in mind:
This process should be done in baby steps. You’ll have multiple iterations in which you’ll change and test a bit of code each time, but it’s a necessary phase
The “healthier” your codebase is — the quicker this process will be
Some of these steps may take more or less time on different projects
Refactoring production code is a complex task. Although this scaffolding process may add some additional work to an already long process, sticking with these principles will significantly increase your chances of doing it right.
What’s next? Expecting the Unexpected
So you’ve reached a point where you can now add your new desired functionality, but where do you go from here? How do you make sure that the same process that brought you here won’t repeat itself? The urgency of delivery is a feature, not a bug, and as such it will follow developers through every step of the development process.
Our code must be written in a way that enables adding changes to it at a minimal cost, without having to rewrite parts that are not relevant to the change. We must maintain a clear separation of concerns, so that when changes come, and they will, they will be isolated to the specific domain they are related to.
Once you’ve located the players on the field, and the types of interactions they have, formalizing it into an API becomes an easy task. Fortunately, we have tools to help us face these exact types of challenges.
I found that, when tackling these kinds of challenges, sticking with Domain Driven Development and SOLID principles, among other things, can be extremely useful.
In my next post, I’ll elaborate on how we use these methods in practice to minimize development efforts and deliver quick & clean code.
Taming the Legacy Beast — A Refactoring Algorithm In 5 Steps was originally published in Similarweb Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Do you feel your Sprint Retrospective could be a lot better?
Liora Korni — Sun, 23 Oct 2022 07:43:41 GMT
If you are part of a scrum team you must agree that sprint retrospective meetings are sometimes challenging to make successful.
I am not a Scrum Master, but I am an experienced engineer who believes that the sprint retrospective is the best way to maintain your team’s continuous improvement. If that’s the case, you really want to ensure you do it right in order to get the best out of it.
Photo by Jeffrey F Lin on Unsplash
The purpose of the sprint retrospective, according to the scrum guide, is an “Opportunity for the Scrum Team to inspect itself and create a plan for improvements to be enacted during the next Sprint.”
In the retrospective, the team members share their honest feedback on what’s going well, what could be improved, and discusses doable solutions and documents them as action items.
It’s a consistent cycle of inspection and adaptation that creates a high-performing team. But if you don’t do it right, it can easily get off track and become a blame game. Resulting in the Scrum team spirit declining (oh no, not again…) and team performance decreasing.
Retrospective meetings can easily become a blame game..
You don’t need to be a Scrum Master to make your retrospective meetings impactful. Every scrum participant has the ability and responsibility to do so.
In this post I’d like to share the valuable lessons I’ve learned that can help you create effective retrospective meetings.
Lesson No 1: Don’t rush
We used to leave 5–10 minutes for a retrospective at the end of each Sprint Planning meeting. “Does anyone have anything to say about the last sprint?” Not surprisingly, no one had…
I’m sure you can agree that this equals having no retro…
My suggestion, make sure you set at least 60 minutes for the meeting. Allowing sufficient time for even the most reticent participants to have the time and space to share their opinions.
Photo by Kari Shea on Unsplash
Lesson No 2: Make it interesting
The retrospective does have the potential to become boring.
Keep in mind that the retro facilitator doesn’t have to be the Scrum Master. Any of the team members can be a facilitator and lead the retro, bringing their own ideas and structure. Try giving each participant a turn to lead a retro — ownership often leads to increased engagement.
You can also suggest playing games, adapting different exercises and questions. Here is a list of some interesting ideas.
Lesson No 3: Make it safe
We once had line managers participating in our retrospective meetings.
Would you expect an open discussion among the team members in such an environment? Not surprisingly, there was no open discussion and team members felt insecure to share issues.
You’re probably familiar with meetings of endless cycles of blame and finger pointing. This doesn’t facilitate improvement — if your retro feels like a blame game what can you do about it?
A great approach is going back to scrum values: The team wins together, the team fails together. Together with your team create a Scrum Value radar (see interesting example) try focusing on: Commitment, Courage, Focus, Openness, and Respect.
Scrum Value Radar
Lesson No 4: Make it a must
You may be familiar with some of these:
“We’re under a lot of pressure this Sprint, there’s no time for retro” **Well if there is a lot of pressure and Sprint goals aren’t being met, doesn’t it sound an alarm that a retro is needed here?
“There was nothing special this Sprint. No need for retro.”
“If there will be a need then we will do one.”
Remember that one of the most important values in a Scrum is constant improvement, and what is a better engine for that than retro meetings?
Not long ago in our Scrum team, there was great tension building up between the Product Manager, the developers and the graphic designer. No one was speaking, but everybody was angry and frustrated. The retro gave us the opportunity to clear the air, and created better communication and a team that, once again, loved working together.
So, please, don’t skip the retro!
Lesson No 5: Check action items
We used to hold retrospective meetings by the book. Great retros, held regularly at the end of each Sprint. With enough time, with everybody participating and great issues raised in a safe environment with respect for others. We’d also assign valuable action items to the participants.
But with no one checking up on the status of the previous retrospective, what was the point? Where is the continuous improvement?
Make sure you schedule your retrospective before the Sprint Planning of the following Sprint, so you can add the action items to the upcoming Sprint.
Lesson No 6: Retrospective the retrospective
Is it the same procedure every time, ritualized and boring? Do the team think the retro is a waste of time?
I suggest that from time to time you consider having a meta-retrospective on the retrospective itself in order to make it great again.
Lesson No 7: Make sure everybody is participating
Sometimes the team members are present but aren’t participating, or one or two team members are dominating the retrospective.
According to Google, equally distributed speaking time is a great sign of a high-performing team (see this reference).
Make sure the speaking time is distributed equally. You shouldn’t be surprised that introverts can also provide great feedback to the team, if the meeting is facilitated properly.
Photo by Shane Rounce on Unsplash
Wrapping up
There are a lot more lessons to learn, and plenty of general recommendations.
The ones I’ve shared are in my experience the most important ones.
Eventually, at the end of the day it’s up to you and your team to identify what works best for you.
It’s definitely worth investing time and effort in making your team retrospective meetings as effective and productive as possible.
You will soon see fruitful results and continuous improvement of every aspect of your team.
Do you have your own lessons you want to share? I’d love to hear about them.
Do you feel your Sprint Retrospective could be a lot better? was originally published in Similarweb Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.