Decorators in Python

June 28, 2025

Interview questions for 10 yrs experienced for senior engineer role
OOP basics (for modular ETL jobs
Generators & iterators (for memory efficiency)
Write complex queries (joins, window functions, CTEs)
Understand indexing, partitioning, performance tuning
Partitioning & bucketing
Shuffles, joins, broadcast variables
Cache vs persist
Avoiding wide transformations early
Airflow or Azure Data Factory
Understand DAGs, triggers, retries, dependencies
Parquet, ORC, Avro vs CSV, JSON
Compression (snappy, gzip)
Partitioning strategies on cloud storage (S3, ADLS)
Handling nulls, duplicates, schema mismatch
Type casting, filtering bad rows
Column-level transformations (e.g., timestamp to epoch, JSON flattening)
Normalization/standardization
Skew handling techniques (salting keys)
Idempotency (no duplicates if rerun)
Incremental loads (using watermark, last updated timestamp)
Data partitioning (by date or id for scale)
Logging, monitoring, alerting
Backfill handling
Retry strategies for failed jobs
Build a project on this theme:
Ingest logs or IoT data (multi-GB or synthetic multi-TB)
Clean, transform, and aggregate (e.g., per user/day stats)
Store in Parquet, partitioned by date
Load into data warehouse (BigQuery, Snowflake, Redshift)
Use:
Spark on Databricks or EMR or Azure Synapse
Schedule with Airflow / ADF
Monitor and log job status
🧪 6. Mock Interview Questions
How would you process a 10 TB log file every day?
How do you handle skewed joins in Spark?
What happens when your job runs out of memory?
Design an ETL pipeline to clean and load 5 TB of sales data into a data lake.
Compare Parquet and JSON for storing big data.
how to build pipelines using: S3 → Glue → Redshift
how to build pipelines using: S3 → Glue → Redshift
Understand IAM, security, cost optimization
Understand IAM, security, cost optimization
LeetCode "Database" problems + optimize queries (EXPLAIN plans).
BI Role)
Strong SQL & data manipulation
ETL pipeline experience (multi-TB scale)
Scripting in Python
AWS technologies (especially S3, Redshift, Glue, Athena)
Query Optimization
Q:
You have a query that joins a 100 million row transaction table with a 10K row dimension table. It's slow.
How would you optimize it?
🧠 Look for: indexing, broadcast joins, filtering early, subquery materialization, partitioning.
🔷 2. Window Function Deep Dive
Q:
Find the 2nd highest salary for each department, and include department name and employee name.
(Handle ties properly.)
🧠 Use DENSE_RANK() or ROW_NUMBER() over partition.
🔷 3. Gap and Island Problem
Q:
Given a table of login timestamps for users, find continuous login streaks.
🧠 Advanced use of LAG, LEAD, and GROUPING.
🔷 4. Running Totals with Resets
Q:
Calculate a running total of sales for each customer, reset every month.
🧠 Use PARTITION BY customer_id, month and ORDER BY date.
🔷 5. Schema Design Question
Q:
You need to store customer addresses that change over time.
How would you model the table to track address history and ensure only one is "current"?
🧠 Design a slowly changing dimension (SCD Type 2) in SQL.
🔷 6. Cumulative Distinct Count
Q:
Count the number of unique users seen up to each day, in a time series.
🧠 Use DISTINCT with windowing or correlated subqueries.
🔷 7. Recursive CTE
Q:
Given a table of employee → manager hierarchy, find the full chain of command for each employee.
🧠 Use a recursive CTE.
🔷 8. SQL Anti-Patterns
Q:
Why is SELECT * in production reports discouraged?
What are dangers of correlated subqueries in WHERE clauses?
🔷 9. Handling Skewed Joins
Q:
A report query joining sales and products is stuck. Products has 1 row for "Unknown", but it matches 1B rows.
How do you fix it?
🧠 Talk about skew, broadcasting small tables, filtering before join, salting techniques in big data.
🔷 10. Pivoting and Unpivoting
Q:
Convert a sales table with columns product_id, month, amount into columns Jan, Feb, Mar etc.
🧠 Use CASE WHEN or database-specific PIVOT.
🔷 11. Null Logic Edge Case
Q:
You join two tables on nullable foreign keys. Why are some matches missing even though values look the same?
🧠 Understand how NULL behaves in joins and conditions.
🔷 12. Complex Aggregation Logic
Q:
For each product, compute its average weekly sales over the last 6 weeks, excluding the week with the highest and lowest sales.
🧠 Use NTILE, ROW_NUMBER, subqueries to filter min/max.
🔷 13. Backfill-Safe Query Design
Q:
You want to re-run your ETL for a month of data. How do you ensure idempotency?
🧠 Explain merge (upsert) patterns, deduplication logic, checksums.
🔷 14. Windowed Aggregations with Gaps
Q:
Show rolling 7-day average sales per user, even for days without sales.
🧠 Use a date spine / calendar table + LEFT JOIN + WINDOW functions.
🔷 15. Materialized View Management
Q:
You have a large, frequently accessed report table. When do you choose a materialized view, and how do you refresh it?
🧠 Pros/cons of materialized views vs. temp tables vs. cached results.
Time Series & Date Logic
Key Skills:
Date arithmetic
Rolling aggregates
CASE WHEN logic
PIVOT (manual or using SQL function)
Recursive queries
Idempotent queries
Upserts, slowly changing dimensions (SCD)
Partitioned inserts
Tree/graph structures
GROUP BY ROLLUP/CUBE
Generating missing dates

Search This Blog

Decorators in Python

Comments

Post a Comment

Popular posts from this blog

Read and Navigate XML - Beautiful Soup

difference-between-stream-processing-and-message-processing

WordNet in Python