Posts

Challenging project experiecne

Our front-end application relied on multiple curated datasets that needed to be processed daily from raw  material, and operational data. The datasets were business-approved and formed the backbone of downstream analytics and reporting. I was responsible for developing and maintaining the back-end batch pipelines that transformed raw data into reliable, production-grade datasets — applying complex business logic around material consumption, part replacements, failure codes, etc. Designed modular pipelines in Foundry using Code Workbook, SQL transforms, and PySpark-based logic in the Code Repository. The biggest challenge was optimizing batch processing times for large datasets is to decide : When and how to repartition data to balance processing speed with resource usage Implementing comprehensive health checks and schema validation Since the data was high-volume (several GBs processed daily), I had to optimize the batch performance by introducing repartitioning at key transformati...
Interview questions for 10 yrs experienced for senior engineer role   OOP basics (for modular ETL jobs Generators & iterators (for memory efficiency) Write complex queries (joins, window functions, CTEs) Understand indexing, partitioning, performance tuning Partitioning & bucketing Shuffles, joins, broadcast variables Cache vs persist Avoiding wide transformations early Airflow or Azure Data Factory Understand DAGs, triggers, retries, dependencies Parquet, ORC, Avro vs CSV, JSON Compression (snappy, gzip) Partitioning strategies on cloud storage (S3, ADLS) Handling nulls, duplicates, schema mismatch Type casting, filtering bad rows Column-level transformations (e.g., timestamp to epoch, JSON flattening) Normalization/standardization Skew handling techniques (salting keys) Idempotency (no duplicates if rerun) Incremental loads (using watermark, last updated timestamp) Data partitioning (by date or id for scale) Logging, monitoring, alerting Backfill handling Retry strategies...

Python Scripts

 https://medium.com/@yashwanthnandam/12-time-saving-python-automation-scripts-you-didnt-know-you-needed-bc400ad28d0a

Performance factors affecting spark

 Spark Memory  Performance is sensitive to  application code,  configuration settings,  data layout and storage,  multi-tenancy,  resource allocation and  elasticity in cloud deployments like Amazon EMR, Microsoft Azure, Google Dataproc, Qubole, etc. tuning memory usage:  the  amount  of memory used by your objects,  the  cost  of accessing those objects,  and the  overhead  of “garbage collection”

Spark Context , Spark Session and JVM

 Why we have one spark context per JVM ? handling single data intensive application is painful enough (tuning GC, dealing with leaking resources, communication overhead). Mutliple Spark applications running in a single JVM would be impossible to tune and manage in a long run. If one of the processes hangs, or fails, or it's security is compromised, the others don't get affected. I think having separate runtimes also helps GC, because it has less references to handle than if it was altogether. What is JVM ? The  Java Virtual Machine  (JVM) is the virtual machine that runs the Java bytecodes. The JVM doesn't understand Java source code; that's why you need compile your  *.java  files to obtain  *.class  files that contain the bytecodes understood by the JVM. It's also the entity that allows Java to be a "portable language" ( write once, run anywhere ). Indeed, there are specific implementations of the JVM for different systems (Windows, Linux, macOS,...