Challenging project experiecne
Our front-end application relied on multiple curated datasets that needed to be processed daily from raw material, and operational data. The datasets were business-approved and formed the backbone of downstream analytics and reporting. I was responsible for developing and maintaining the back-end batch pipelines that transformed raw data into reliable, production-grade datasets — applying complex business logic around material consumption, part replacements, failure codes, etc. Designed modular pipelines in Foundry using Code Workbook, SQL transforms, and PySpark-based logic in the Code Repository. The biggest challenge was optimizing batch processing times for large datasets is to decide : When and how to repartition data to balance processing speed with resource usage Implementing comprehensive health checks and schema validation Since the data was high-volume (several GBs processed daily), I had to optimize the batch performance by introducing repartitioning at key transformati...