Interview questions for 10 yrs experienced for senior engineer role
- OOP basics (for modular ETL jobs
- Generators & iterators (for memory efficiency)
- Write complex queries (joins, window functions, CTEs)
- Understand indexing, partitioning, performance tuning
- Partitioning & bucketing
- Shuffles, joins, broadcast variables
- Cache vs persist
- Avoiding wide transformations early
- Airflow or Azure Data Factory
- Understand DAGs, triggers, retries, dependencies
- Parquet, ORC, Avro vs CSV, JSON
- Compression (snappy, gzip)
- Partitioning strategies on cloud storage (S3, ADLS)
- Handling nulls, duplicates, schema mismatch
- Type casting, filtering bad rows
- Column-level transformations (e.g., timestamp to epoch, JSON flattening)
- Normalization/standardization
- Skew handling techniques (salting keys)
- Idempotency (no duplicates if rerun)
- Incremental loads (using watermark, last updated timestamp)
- Data partitioning (by date or id for scale)
- Logging, monitoring, alerting
- Backfill handling
- Retry strategies for failed jobs
- Build a project on this theme:
- Ingest logs or IoT data (multi-GB or synthetic multi-TB)
- Clean, transform, and aggregate (e.g., per user/day stats)
- Store in Parquet, partitioned by date
- Load into data warehouse (BigQuery, Snowflake, Redshift)
- Use:
- Spark on Databricks or EMR or Azure Synapse
- Schedule with Airflow / ADF
- Monitor and log job status
- ๐งช 6. Mock Interview Questions
- How would you process a 10 TB log file every day?
- How do you handle skewed joins in Spark?
- What happens when your job runs out of memory?
- Design an ETL pipeline to clean and load 5 TB of sales data into a data lake.
- Compare Parquet and JSON for storing big data.
- how to build pipelines using: S3 → Glue → Redshift
- how to build pipelines using: S3 → Glue → Redshift
- Understand IAM, security, cost optimization
- Understand IAM, security, cost optimization
- LeetCode "Database" problems + optimize queries (EXPLAIN plans).
- BI Role)
- Strong SQL & data manipulation
- ETL pipeline experience (multi-TB scale)
- Scripting in Python
- AWS technologies (especially S3, Redshift, Glue, Athena)
- Query Optimization
- Q:
- You have a query that joins a 100 million row transaction table with a 10K row dimension table. It's slow.
- How would you optimize it?
- ๐ง Look for: indexing, broadcast joins, filtering early, subquery materialization, partitioning.
- ๐ท 2. Window Function Deep Dive
- Q:
- Find the 2nd highest salary for each department, and include department name and employee name.
- (Handle ties properly.)
- ๐ง Use DENSE_RANK() or ROW_NUMBER() over partition.
- ๐ท 3. Gap and Island Problem
- Q:
- Given a table of login timestamps for users, find continuous login streaks.
- ๐ง Advanced use of LAG, LEAD, and GROUPING.
- ๐ท 4. Running Totals with Resets
- Q:
- Calculate a running total of sales for each customer, reset every month.
- ๐ง Use PARTITION BY customer_id, month and ORDER BY date.
- ๐ท 5. Schema Design Question
- Q:
- You need to store customer addresses that change over time.
- How would you model the table to track address history and ensure only one is "current"?
- ๐ง Design a slowly changing dimension (SCD Type 2) in SQL.
- ๐ท 6. Cumulative Distinct Count
- Q:
- Count the number of unique users seen up to each day, in a time series.
- ๐ง Use DISTINCT with windowing or correlated subqueries.
- ๐ท 7. Recursive CTE
- Q:
- Given a table of employee → manager hierarchy, find the full chain of command for each employee.
- ๐ง Use a recursive CTE.
- ๐ท 8. SQL Anti-Patterns
- Q:
- Why is SELECT * in production reports discouraged?
- What are dangers of correlated subqueries in WHERE clauses?
- ๐ท 9. Handling Skewed Joins
- Q:
- A report query joining sales and products is stuck. Products has 1 row for "Unknown", but it matches 1B rows.
- How do you fix it?
- ๐ง Talk about skew, broadcasting small tables, filtering before join, salting techniques in big data.
- ๐ท 10. Pivoting and Unpivoting
- Q:
- Convert a sales table with columns product_id, month, amount into columns Jan, Feb, Mar etc.
- ๐ง Use CASE WHEN or database-specific PIVOT.
- ๐ท 11. Null Logic Edge Case
- Q:
- You join two tables on nullable foreign keys. Why are some matches missing even though values look the same?
- ๐ง Understand how NULL behaves in joins and conditions.
- ๐ท 12. Complex Aggregation Logic
- Q:
- For each product, compute its average weekly sales over the last 6 weeks, excluding the week with the highest and lowest sales.
- ๐ง Use NTILE, ROW_NUMBER, subqueries to filter min/max.
- ๐ท 13. Backfill-Safe Query Design
- Q:
- You want to re-run your ETL for a month of data. How do you ensure idempotency?
- ๐ง Explain merge (upsert) patterns, deduplication logic, checksums.
- ๐ท 14. Windowed Aggregations with Gaps
- Q:
- Show rolling 7-day average sales per user, even for days without sales.
- ๐ง Use a date spine / calendar table + LEFT JOIN + WINDOW functions.
- ๐ท 15. Materialized View Management
- Q:
- You have a large, frequently accessed report table. When do you choose a materialized view, and how do you refresh it?
- ๐ง Pros/cons of materialized views vs. temp tables vs. cached results.
- Time Series & Date Logic
- Key Skills:
- Date arithmetic
- Rolling aggregates
- CASE WHEN logic
- PIVOT (manual or using SQL function)
- Recursive queries
- Idempotent queries
- Upserts, slowly changing dimensions (SCD)
- Partitioned inserts
- Tree/graph structures
- GROUP BY ROLLUP/CUBE
- Generating missing dates
Read and Navigate XML - Beautiful Soup
### How to read and navigate XML There is a Python library called BeautifulSoup, which makes reading in and parsing XML data easier. Here is the link to the documentation: [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/) The find() method will find the first place where an xml element occurs. For example using find('record') will return the first record in the xml file: ```xml <record> <field name="Country or Area" key="ABW">Aruba</field> <field name="Item" key="SP.POP.TOTL">Population, total</field> <field name="Year">1960</field> <field name="Value">54211</field> </record> ``` The find_all() method returns all of the matching tags. So find_all('record') would return all of the elements with the `<record>` tag. Run the code cells below to get a basic idea of how to navigate XML with BeautifulSoup. To naviga...
Comments
Post a Comment