Challenging project experiecne
Our front-end application relied on multiple curated datasets that needed to be processed daily from raw material, and operational data. The datasets were business-approved and formed the backbone of downstream analytics and reporting.
I was responsible for developing and maintaining the back-end batch pipelines that transformed raw data into reliable, production-grade datasets — applying complex business logic around material consumption, part replacements, failure codes, etc.
Designed modular pipelines in Foundry using Code Workbook, SQL transforms, and PySpark-based logic in the Code Repository.
The biggest challenge was optimizing batch processing times for large datasets is to decide :
When and how to repartition data to balance processing speed with resource usage
Implementing comprehensive health checks and schema validation
Since the data was high-volume (several GBs processed daily), I had to optimize the batch performance by introducing repartitioning at key transformation steps to reduce shuffling and improve Spark performance.
We were able to reduce the compute hour resources by ~14 percent
Implemented schema validation, data health checks, and null handling logic to ensure consistency across evolving source systems.
Maintained clean separation between raw, staging, and final datasets, making sure every business-validated dataset could be traced.
Worked closely with the product owner and business user to update logic as the business evolved including controlled deployments and lineage tracking.
Validation at Ingestion Stage
Goal: Catch issues early before data flows downstream
Strategies:
-
✅ Schema validation: Used
Type Safetyandschema checksin Code Workbook or PySpark scripts to ensure all required fields were present. -
✅ Null checks & mandatory fields: Ensured critical business fields (e.g., material ID, maintenance date) were non-null.
-
✅ Drop corrupt or malformed records: Flagged records that didn’t meet quality thresholds and logged them separately for review.
✅ 2. Transformation-Phase Quality Checks
Goal: Ensure logic preserves integrity and business correctness
Strategies:
-
✅ Reference checks: Validated foreign key relationships (e.g., every part ID maps to a known part in the master table).
-
✅ Duplicate handling: Deduplicated records using
row_number()/rank()or Spark window functions based on business rules (e.g., most recent event per asset). -
✅ Repartitioning: Applied partitioning during joins to avoid skew and data loss during heavy transformations.
✅ 3. End-to-End Auditability
Goal: Make output datasets trustworthy and explainable
Strategies:
-
✅ Column-level data profiling: Performed summary stats (min, max, unique counts) after transformations to spot anomalies.
-
✅ Assertions & test transforms: Added lightweight validation nodes in Foundry (e.g., “No Nulls in
statusfield”, or “>= 1000 rows must be present”). -
✅ Backfill comparison: When backfilling data, compared summary metrics with previous versions to detect gaps or spikes.
✅ 4. Monitoring and Feedback
Goal: Prevent issues in production and fix regressions quickly
Strategies:
-
✅ Object health checks in Foundry: Enabled data health visualizations to track failure patterns.
-
✅ Alerting: Used Foundry alerts to notify on sudden row count drops or schema mismatches.
-
✅ Feedback loop with business: Partnered with analysts to get quick feedback on anomalies in curated datasets — especially when business logic changed.
I followed a following strategy
1. At ingestion, I performed schema checks to validate structure and applied null checks for critical fields. I also filtered out corrupt or malformed records and flagged them for further investigation.
2. During transformation, I implemented deduplication logic using Spark window functions — for example, using the latest record, but keeping all the previous records. I also performed business rule validations
3. At the output stage, I added assertions and row count checks in Foundry to catch anomalies like missing records or schema mismatches. Unit test at each level.
Scheduling the pipelines
4. Finally, I enabled Foundry health checks on key object sets and configured lineage and documentation so that analysts could trace issues back to source easily.
I ran into a tricky issue where our logic to calculate the part family was breaking. It turned out that the replaced_part_number field was coming in as null in some records, and the transformation wasn’t handling it properly, which caused the entire family-mapping step to fail. Also, I noticed that some part numbers were malformed like they had fewer digits or extra spaces which led to incorrect results downstream. I fixed this by adding proper null checks and fallback logic, and I also cleaned and validated part number formats using regex. After that, I put in a validation layer upstream so we could catch these issues earlier and notify the data source team
Comments
Post a Comment