Varshitha Gudimalla | Data Engineer Portfolio

About

Results-driven Data Engineer with 3 years of experience designing, developing, and optimizing large-scale ETL/ELT pipelines, real-time streaming, and cloud-native lakehouse architectures across AWS, Azure, and Databricks. Proven success in improving performance, scalability, and data quality for enterprise-grade analytics and machine learning workloads. At FedEx, I build and optimize PySpark-based pipelines processing 1B+ daily records, leveraging Databricks, Airflow, Delta Lake, and AWS Glue to achieve 99.9% SLA compliance. Experienced in multi-tenant lakehouse design, schema evolution, and CI/CD automation with Jenkins and Git-based workflows. Previously at Knowledge Solutions and CloudEnd Platform, I engineered robust data ecosystems on Azure Data Factory, Synapse Analytics, and Snowflake, enabling customer churn prediction, forecasting models, and governed data lakes that improved analytics efficiency by 40%+. Skilled in Python, SQL, and PySpark, with hands-on expertise in Airflow, Databricks, Glue, ADF, Redshift, and Synapse. Adept at implementing data quality frameworks (Great Expectations, Delta constraints), managing governance (Unity Catalog, IAM), and automating CI/CD deployments for scalable, high-reliability data solutions.

Skills

Programming & Scripting:

Python, PySpark, SQL, Scala

Big Data & Processing:

Apache Spark, Databricks, Delta Lake, Apache Airflow, AWS Glue, Azure Data Factory

Cloud Platforms:

AWS (S3, Redshift, Glue, Step Functions, SNS), Azure (ADLS, Synapse Analytics, DevOps, Monitor)

Data Warehousing:

Snowflake, Azure Synapse Analytics, Amazon Redshift, SQL Server

ETL/ELT Tools:

SSIS, Databricks Workflows, AWS Glue, Azure Data Factory

Data Quality & Governance:

Great Expectations, Delta Lake Constraints, Unity Catalog, IAM, Hive Metastore

Orchestration & Automation:

Apache Airflow, Jenkins, Azure DevOps, AWS Step Functions

Visualization & Analytics:

Power BI, Tableau

Machine Learning:

Azure Machine Learning, Prophet, Scikit-learn, Feature Engineering

Version Control & CI/CD:

Git, GitHub, Databricks Repos, Jenkins, Azure DevOps Pipelines

Databases:

SQL Server, PostgreSQL, MySQL, NoSQL

Education

University at Albany, SUNY — Master of Science in Data Science

Albany, NY • Aug 2023 – May 2025

Relevant Coursework: Advanced Statistics, Machine Learning, Big Data Analytics, Data Mining, Business Intelligence, Statistical Computing

CMR Institute of Technology, Hyderabad — B.Tech in Computer Science

Hyderabad, India • Aug 2018 – May 2022

Relevant Coursework: Data Structures & Algorithms, DBMS, Software Engineering, OOP, Web Technologies

Professional Experience

Data Engineer — FedEx (Remote)

Jan 2025 – Present • USA

Developed scalable ETL and ELT pipelines in Databricks using PySpark, Delta Lake, and Python, implementing advanced transformation logic, schema inference, and error-handling frameworks to support end-to-end ingestion from logistics and customer systems into curated data layers.
Designed and orchestrated data workflows using Apache Airflow, Databricks Workflows, and AWS Step Functions for dependency management, automated scheduling, and failure recovery, ensuring continuous data availability across environments.
Integrated key AWS components including S3, Glue Catalog, Redshift, and SNS to build a modular data lake architecture. Leveraged parameterized Databricks notebooks and reusable Python modules to support multi-tenant project pipelines and environment-driven configurations.
Implemented data quality and observability frameworks using Great Expectations, Delta Lake constraints, and custom validation logic embedded within Airflow DAGs and Databricks jobs, ensuring accuracy, schema consistency, and traceability across all ingestion layers.
Collaborated with data scientists, analysts, and DevOps engineers to optimize Spark cluster configurations, improve job parallelism, and enforce governance policies using IAM roles, Unity Catalog, and workspace access controls for sensitive PII data.
Enhanced performance and scalability of both streaming and batch data pipelines through dynamic partitioning, broadcast joins, and optimized Parquet/Delta file layouts, improving reliability and supporting analytical workloads and ML feature generation.
Integrated CI/CD and version control by connecting Jenkins and Git-based workflows with Databricks Repos and Airflow deployments, enabling continuous delivery, rollback capability, and environment synchronization across development and production setups.

Data Engineer — Knowledge Solutions (Hyderabad, India)

Jun 2022 – Jul 2023

Developed PySpark-based ETL frameworks in Azure Databricks to process and transform large-scale e-commerce datasets sourced from customer surveys, transactions, and logistics platforms. Designed modular notebooks for feature engineering, data cleansing, and enrichment, enabling advanced analytics use cases like NPS prediction and churn modeling.
Implemented end-to-end orchestration and scheduling using Azure Data Factory, integrating on-premise data sources, APIs, and message-based feeds into Azure Data Lake Storage (ADLS) for centralized, governed analytics. Configured dynamic pipelines, linked services, and trigger-based workflows for reliable and automated data movement.
Built CI/CD deployment pipelines in Azure DevOps to automate the delivery of Databricks notebooks, SQL scripts, and SSIS packages across development, test, and production environments. Incorporated Git branching strategies, automated testing, and job-cluster provisioning to ensure consistent deployment and environment synchronization.
Integrated Azure Data Lake Storage with Synapse Analytics for unified metadata management and structured warehousing. Optimized PySpark workloads with adaptive partitioning, broadcast joins, and caching techniques to improve query responsiveness and end-to-end pipeline throughput.
Collaborated with analytics and ML teams to deliver curated feature tables for Azure Machine Learning models used in customer churn and satisfaction prediction. Designed reusable Snowflake queries and Databricks transformations with clustering and caching for faster data access during model training and validation.
Developed Power BI dashboards connected to Azure Synapse and curated data layers to visualize customer satisfaction, NPS, and churn KPIs. Configured Azure Monitor, Log Analytics, and alert rules to provide real-time observability for pipeline performance, ensuring traceability and proactive issue detection across the ecosystem.

Data Engineer — CloudEnd Platform Pvt Ltd (Hyderabad, India)

Jun 2021 – May 2022

Designed and implemented scalable ETL pipelines using Databricks, AWS Glue, and PySpark to process and transform healthcare therapy logs and patient survey data into Delta Lake for analytics and compliance reporting. Applied optimized Spark transformations and modular notebook logic aligned with HIPAA data protection standards.
Developed ingestion frameworks for structured and semi-structured data sourced from SQL Server, REST APIs, and flat files using SSIS and PySpark, with schema registration and metadata standardization in AWS Glue Catalog and Hive Metastore. Integrated validation rules to manage null, duplicate, and out-of-range values for improved governance and audit readiness.
Built and deployed predictive forecasting pipelines using Python and Prophet within Databricks to support patient adherence and treatment outcome modeling. Integrated feature extraction logic with Delta tables to generate ML-ready datasets for downstream analytics and clinical insights.
Automated end-to-end scheduling and monitoring through Apache Airflow DAGs and Jenkins CI/CD pipelines, orchestrating nightly refreshes and automated alerting via AWS SNS and email triggers. Embedded retry policies, SLA monitoring, and lineage tracking for seamless recovery and observability.
Developed and optimized SQL Server stored procedures, views, and analytical functions to support patient segmentation, treatment analytics, and operational dashboards in Power BI. Leveraged query tuning and partitioned views to enable responsive clinical data exploration and real-time monitoring.
Implemented data governance and security best practices including IAM-based access control, data encryption, Git version control, and automated job execution policies to align with healthcare compliance requirements. Established reusable workflow templates and configuration-driven orchestration to enhance maintainability and lower operational overhead.

Projects

Global Inflation Monitor

Spark, Snowflake, Tableau, ETL

Built end-to-end Spark and Snowflake ETL pipelines and delivered a Tableau dashboard enabling analysts to track real-time inflation and wage data across 190+ countries, reducing reporting delays by 80%.

Customer Churn Prediction

Python, Machine Learning, Tableau, Predictive Modeling

Developed predictive churn models and Tableau dashboards to surface high-risk telecom customers and behavioral patterns, empowering business analysts to drive targeted retention strategies and policy actions.

Connect

📍 USA 📧 varshithag1908@gmail.com

LinkedIn GitHub My Portfolio