Decorators in Python

Posts

June 28, 2025

Interview questions for 10 yrs experienced for senior engineer role OOP basics (for modular ETL jobs Generators & iterators (for memory efficiency) Write complex queries (joins, window functions, CTEs) Understand indexing, partitioning, performance tuning Partitioning & bucketing Shuffles, joins, broadcast variables Cache vs persist Avoiding wide transformations early Airflow or Azure Data Factory Understand DAGs, triggers, retries, dependencies Parquet, ORC, Avro vs CSV, JSON Compression (snappy, gzip) Partitioning strategies on cloud storage (S3, ADLS) Handling nulls, duplicates, schema mismatch Type casting, filtering bad rows Column-level transformations (e.g., timestamp to epoch, JSON flattening) Normalization/standardization Skew handling techniques (salting keys) Idempotency (no duplicates if rerun) Incremental loads (using watermark, last updated timestamp) Data partitioning (by date or id for scale) Logging, monitoring, alerting Backfill handling Retry strategies...

Python Scripts

June 09, 2025

https://medium.com/@yashwanthnandam/12-time-saving-python-automation-scripts-you-didnt-know-you-needed-bc400ad28d0a

Performance factors affecting spark

March 30, 2024

Spark Memory Performance is sensitive to application code, configuration settings, data layout and storage, multi-tenancy, resource allocation and elasticity in cloud deployments like Amazon EMR, Microsoft Azure, Google Dataproc, Qubole, etc. tuning memory usage: the amount of memory used by your objects, the cost of accessing those objects, and the overhead of “garbage collection”

Spark Context , Spark Session and JVM

March 26, 2024

Why we have one spark context per JVM ? handling single data intensive application is painful enough (tuning GC, dealing with leaking resources, communication overhead). Mutliple Spark applications running in a single JVM would be impossible to tune and manage in a long run. If one of the processes hangs, or fails, or it's security is compromised, the others don't get affected. I think having separate runtimes also helps GC, because it has less references to handle than if it was altogether. What is JVM ? The Java Virtual Machine (JVM) is the virtual machine that runs the Java bytecodes. The JVM doesn't understand Java source code; that's why you need compile your *.java files to obtain *.class files that contain the bytecodes understood by the JVM. It's also the entity that allows Java to be a "portable language" ( write once, run anywhere ). Indeed, there are specific implementations of the JVM for different systems (Windows, Linux, macOS,...

Search This Blog

Decorators in Python

Posts

Challenging project experiecne

Python Scripts

Performance factors affecting spark

Spark Context , Spark Session and JVM