Challenging project experiecne

Our front-end application relied on multiple curated datasets that needed to be processed daily from raw  material, and operational data. The datasets were business-approved and formed the backbone of downstream analytics and reporting.

I was responsible for developing and maintaining the back-end batch pipelines that transformed raw data into reliable, production-grade datasets — applying complex business logic around material consumption, part replacements, failure codes, etc.

Designed modular pipelines in Foundry using Code Workbook, SQL transforms, and PySpark-based logic in the Code Repository.

The biggest challenge was optimizing batch processing times for large datasets is to decide :

When and how to repartition data to balance processing speed with resource usage

Implementing comprehensive health checks and schema validation

Since the data was high-volume (several GBs processed daily), I had to optimize the batch performance by introducing repartitioning at key transformation steps to reduce shuffling and improve Spark performance.

We were able to reduce the compute hour resources by ~14 percent

Implemented schema validation, data health checks, and null handling logic to ensure consistency across evolving source systems.

Maintained clean separation between raw, staging, and final datasets, making sure every business-validated dataset could be traced.

Worked closely with the product owner and business user to update logic as the business evolved including controlled deployments and lineage tracking.

Validation at Ingestion Stage

Goal: Catch issues early before data flows downstream

Strategies:

  • Schema validation: Used Type Safety and schema checks in Code Workbook or PySpark scripts to ensure all required fields were present.

  • Null checks & mandatory fields: Ensured critical business fields (e.g., material ID, maintenance date) were non-null.

  • Drop corrupt or malformed records: Flagged records that didn’t meet quality thresholds and logged them separately for review.


✅ 2. Transformation-Phase Quality Checks

Goal: Ensure logic preserves integrity and business correctness

Strategies:

  • Reference checks: Validated foreign key relationships (e.g., every part ID maps to a known part in the master table).

  • Duplicate handling: Deduplicated records using row_number()/rank() or Spark window functions based on business rules (e.g., most recent event per asset).

  • Repartitioning: Applied partitioning during joins to avoid skew and data loss during heavy transformations.


✅ 3. End-to-End Auditability

Goal: Make output datasets trustworthy and explainable

Strategies:

  • Column-level data profiling: Performed summary stats (min, max, unique counts) after transformations to spot anomalies.

  • Assertions & test transforms: Added lightweight validation nodes in Foundry (e.g., “No Nulls in status field”, or “>= 1000 rows must be present”).

  • Backfill comparison: When backfilling data, compared summary metrics with previous versions to detect gaps or spikes.


✅ 4. Monitoring and Feedback

Goal: Prevent issues in production and fix regressions quickly

Strategies:

  • Object health checks in Foundry: Enabled data health visualizations to track failure patterns.

  • Alerting: Used Foundry alerts to notify on sudden row count drops or schema mismatches.

  • Feedback loop with business: Partnered with analysts to get quick feedback on anomalies in curated datasets — especially when business logic changed.

I followed a following strategy 

1. At ingestion, I performed schema checks to validate structure and applied null checks for critical fields. I also filtered out corrupt or malformed records and flagged them for further investigation.

2. During transformation, I implemented deduplication logic using Spark window functions — for example, using the latest record, but keeping all the previous records.  I also performed business rule validations

3. At the output stage, I added assertions and row count checks in Foundry to catch anomalies like missing records or schema mismatches. Unit test at each level. 
Scheduling the pipelines 

4. Finally, I enabled Foundry health checks on key object sets and configured lineage and documentation so that analysts could trace issues back to source easily.


I ran into a tricky issue where our logic to calculate the part family was breaking. It turned out that the replaced_part_number field was coming in as null in some records, and the transformation wasn’t handling it properly, which caused the entire family-mapping step to fail. Also, I noticed that some part numbers were malformed like they had fewer digits or extra spaces which led to incorrect results downstream. I fixed this by adding proper null checks and fallback logic, and I also cleaned and validated part number formats using regex. After that, I put in a validation layer upstream so we could catch these issues earlier and notify the data source team


Comments

Popular posts from this blog

Read and Navigate XML - Beautiful Soup

difference-between-stream-processing-and-message-processing

WordNet in Python