• Interview questions for 10 yrs experienced for senior engineer role

     
  • OOP basics (for modular ETL jobs
  • Generators & iterators (for memory efficiency)
  • Write complex queries (joins, window functions, CTEs)
  • Understand indexing, partitioning, performance tuning
  • Partitioning & bucketing
  • Shuffles, joins, broadcast variables
  • Cache vs persist
  • Avoiding wide transformations early
  • Airflow or Azure Data Factory
  • Understand DAGs, triggers, retries, dependencies
  • Parquet, ORC, Avro vs CSV, JSON
  • Compression (snappy, gzip)
  • Partitioning strategies on cloud storage (S3, ADLS)
  • Handling nulls, duplicates, schema mismatch

  • Type casting, filtering bad rows

  • Column-level transformations (e.g., timestamp to epoch, JSON flattening)

  • Normalization/standardization

  • Skew handling techniques (salting keys)
  • Idempotency (no duplicates if rerun)

  • Incremental loads (using watermark, last updated timestamp)

  • Data partitioning (by date or id for scale)

  • Logging, monitoring, alerting

  • Backfill handling
  • Retry strategies for failed jobs
  • Build a project on this theme:
  • Ingest logs or IoT data (multi-GB or synthetic multi-TB)

  • Clean, transform, and aggregate (e.g., per user/day stats)

  • Store in Parquet, partitioned by date

  • Load into data warehouse (BigQuery, Snowflake, Redshift)

  • Use:

  • Spark on Databricks or EMR or Azure Synapse

  • Schedule with Airflow / ADF

  • Monitor and log job status

  • ๐Ÿงช 6. Mock Interview Questions
  • How would you process a 10 TB log file every day?

  • How do you handle skewed joins in Spark?

  • What happens when your job runs out of memory?

  • Design an ETL pipeline to clean and load 5 TB of sales data into a data lake.

  • Compare Parquet and JSON for storing big data.
  • how to build pipelines using: S3 → Glue → Redshift

  • how to build pipelines using: S3 → Glue → Redshift

  • Understand IAM, security, cost optimization
  • Understand IAM, security, cost optimization
  •  LeetCode "Database" problems + optimize queries (EXPLAIN plans).
  • BI Role)
  • Strong SQL & data manipulation

  • ETL pipeline experience (multi-TB scale)

  • Scripting in Python

  • AWS technologies (especially S3, Redshift, Glue, Athena)
  • Query Optimization
  • Q:
  • You have a query that joins a 100 million row transaction table with a 10K row dimension table. It's slow.
  • How would you optimize it?
  • ๐Ÿง  Look for: indexing, broadcast joins, filtering early, subquery materialization, partitioning.

  • ๐Ÿ”ท 2. Window Function Deep Dive
  • Q:
  • Find the 2nd highest salary for each department, and include department name and employee name.
  • (Handle ties properly.)
  • ๐Ÿง  Use DENSE_RANK() or ROW_NUMBER() over partition.

  • ๐Ÿ”ท 3. Gap and Island Problem
  • Q:
  • Given a table of login timestamps for users, find continuous login streaks.
  • ๐Ÿง  Advanced use of LAG, LEAD, and GROUPING.

  • ๐Ÿ”ท 4. Running Totals with Resets
  • Q:
  • Calculate a running total of sales for each customer, reset every month.
  • ๐Ÿง  Use PARTITION BY customer_id, month and ORDER BY date.

  • ๐Ÿ”ท 5. Schema Design Question
  • Q:
  • You need to store customer addresses that change over time.
  • How would you model the table to track address history and ensure only one is "current"?
  • ๐Ÿง  Design a slowly changing dimension (SCD Type 2) in SQL.

  • ๐Ÿ”ท 6. Cumulative Distinct Count
  • Q:
  • Count the number of unique users seen up to each day, in a time series.
  • ๐Ÿง  Use DISTINCT with windowing or correlated subqueries.

  • ๐Ÿ”ท 7. Recursive CTE
  • Q:
  • Given a table of employee → manager hierarchy, find the full chain of command for each employee.
  • ๐Ÿง  Use a recursive CTE.

  • ๐Ÿ”ท 8. SQL Anti-Patterns
  • Q:
  • Why is SELECT * in production reports discouraged?
  • What are dangers of correlated subqueries in WHERE clauses?

  • ๐Ÿ”ท 9. Handling Skewed Joins
  • Q:
  • A report query joining sales and products is stuck. Products has 1 row for "Unknown", but it matches 1B rows.
  • How do you fix it?

  • ๐Ÿง  Talk about skew, broadcasting small tables, filtering before join, salting techniques in big data.

  • ๐Ÿ”ท 10. Pivoting and Unpivoting
  • Q:
  • Convert a sales table with columns product_id, month, amount into columns Jan, Feb, Mar etc.

  • ๐Ÿง  Use CASE WHEN or database-specific PIVOT.

  • ๐Ÿ”ท 11. Null Logic Edge Case
  • Q:
  • You join two tables on nullable foreign keys. Why are some matches missing even though values look the same?

  • ๐Ÿง  Understand how NULL behaves in joins and conditions.

  • ๐Ÿ”ท 12. Complex Aggregation Logic
  • Q:
  • For each product, compute its average weekly sales over the last 6 weeks, excluding the week with the highest and lowest sales.

  • ๐Ÿง  Use NTILE, ROW_NUMBER, subqueries to filter min/max.

  • ๐Ÿ”ท 13. Backfill-Safe Query Design
  • Q:
  • You want to re-run your ETL for a month of data. How do you ensure idempotency?

  • ๐Ÿง  Explain merge (upsert) patterns, deduplication logic, checksums.

  • ๐Ÿ”ท 14. Windowed Aggregations with Gaps
  • Q:
  • Show rolling 7-day average sales per user, even for days without sales.

  • ๐Ÿง  Use a date spine / calendar table + LEFT JOIN + WINDOW functions.

  • ๐Ÿ”ท 15. Materialized View Management
  • Q:
  • You have a large, frequently accessed report table. When do you choose a materialized view, and how do you refresh it?

  • ๐Ÿง  Pros/cons of materialized views vs. temp tables vs. cached results.
  • Time Series & Date Logic
  • Key Skills:

  • Date arithmetic

  • Rolling aggregates

  • CASE WHEN logic

  • PIVOT (manual or using SQL function)

  • Recursive queries

  • Idempotent queries

  • Upserts, slowly changing dimensions (SCD)

  • Partitioned inserts

  • Tree/graph structures
  • GROUP BY ROLLUP/CUBE
  • Generating missing dates

Comments

Popular posts from this blog

Read and Navigate XML - Beautiful Soup

difference-between-stream-processing-and-message-processing

WordNet in Python