SQL for data analysts is the set of queries and patterns you'll use to retrieve, clean, summarize, and combine data to answer business questions. In practice, mastering a core set of 10 queries — selects with filters, aggregations, joins, window functions, CTEs, dedup checks, CASE logic, date-time grouping, existence checks, and unions — will let you handle 80% of day-to-day tasks. You'll see short examples, sample results (counts, percentages, rank positions), and concrete steps to run and validate each query against tables like orders, users, and events. By the end you'll know when to use each pattern, how to avoid common pitfalls (e.g., incorrect joins or over-aggregating), and how to convert raw query results into insights: top 10 customers, monthly revenue trends, churn candidates, and cleaned datasets ready for visualization. This guide uses friendly examples, two comparison tables, and actionable tips so you can apply each query on your own datasets right away.
1. SELECT with WHERE, ORDER BY and LIMIT
Want recent high-value transactions? Use SELECT with WHERE, ORDER BY and LIMIT to filter and get the top rows efficiently. Example: SELECT id, user_id, amount FROM orders WHERE amount > 100 ORDER BY amount DESC LIMIT 10; returns the top 10 orders over $100. Practical data point: if your orders table has 1,200,000 rows, adding an index on amount or created_at can reduce query time from minutes to seconds. Actionable tips: always apply selective WHERE first, then ORDER BY on indexed columns, and use LIMIT during exploration to inspect samples before running full reports. Quick checklist:
- Filter early to reduce scanned rows.
- Use LIMIT while iterating to avoid long jobs.
- Prefer indexed columns for ORDER BY.
Case: a retail analyst found a 7x speedup by adding an index on created_at when pulling recent high-value orders.
2. GROUP BY with aggregate functions (+ HAVING)
To summarize metrics by category, use GROUP BY with aggregates like SUM, COUNT, AVG, and filter groups with HAVING. Example: SELECT product_id, COUNT(*) AS orders, SUM(amount) AS revenue FROM orders WHERE created_at >= '2025-01-01' GROUP BY product_id HAVING COUNT(*) > 50 ORDER BY revenue DESC; returns products with at least 50 orders and their revenue. Data point: grouping a 500k-row sales table by product might yield 3,200 groups; choose GROUP BY keys that match analysis granularity. Actionable insight: use HAVING for conditions on aggregates (e.g., remove low-volume groups) and compute percent contribution with window sums to identify top 20% of products driving 80% revenue.
| Aggregate | Use case |
|---|---|
| SUM | Total revenue per region |
| COUNT | Orders per customer |
| AVG | Average session length |
| MIN | First purchase date |
| MAX | Last activity timestamp |
3. INNER and LEFT JOINs to combine tables
Combine related data with JOINs: INNER JOIN keeps matching rows; LEFT JOIN preserves left-side rows even when right-side is missing. Example: SELECT u.id, u.email, o.total FROM users u LEFT JOIN orders o ON u.id = o.user_id WHERE o.created_at >= '2025-01-01'; returns users with recent orders and keeps users without orders (NULLs). Concrete numbers: joining 100k users to 1.2M orders will produce up to 1.2M rows for INNER JOIN and at least 100k rows for LEFT JOIN. Actionable tips: choose INNER when you only need matched records; use LEFT to detect missing relationships (e.g., users with no purchases). Avoid accidental multiplicity by joining on unique keys or aggregating orders first.
| Join Type | Returns | When to use |
|---|---|---|
| INNER JOIN | Only matched rows | Matched records only (orders with users) |
| LEFT JOIN | All left rows + matches | Keep all users, even without orders |
| RIGHT JOIN | All right rows + matches | Rare, when right-side anchor is primary |
| FULL JOIN | All rows from both sides | Union of datasets with nulls |
| CROSS JOIN | Cartesian product | Multiplication of sets (use with care) |
4. Window functions (ROW_NUMBER, RANK, LAG, SUM OVER)
Window functions let you compute row-level metrics without collapsing rows. Use ROW_NUMBER() to deduplicate, RANK() to handle ties, LAG() to compute previous values, and SUM(...) OVER(...) for rolling totals. Example: SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn FROM orders; then filter rn=1 to get last order per user. Data point: computing a 7-day rolling revenue with SUM(amount) OVER (PARTITION BY product_id ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) provides fast time-series smoothing for 12 months of daily data. Actionable insights: use window functions to build cohorts, compute retention, and create running totals without extra joins or CTEs; they often outperform self-joins.
- ROW_NUMBER: pick latest record per group.
- LAG: compute day-over-day change.
- SUM OVER: cumulative or moving sums.
5. Common Table Expressions (CTEs) and subqueries
CTEs and subqueries structure complex logic into readable steps. Use WITH to stage intermediate results and improve maintainability. Example: WITH recent_orders AS (SELECT * FROM orders WHERE created_at >= '2025-01-01') SELECT u.id, COUNT(ro.id) AS orders_last_year FROM users u LEFT JOIN recent_orders ro ON u.id=ro.user_id GROUP BY u.id; This separates filtering from aggregation. Data point: splitting a 2M-row pipeline into staged CTEs can help you test each step individually and cache results in some databases. Actionable tips: use CTEs for clarity and testing; prefer materialized intermediate tables for heavy repeated computations; avoid deeply nested correlated subqueries when a CTE + join is simpler and faster.
6. DISTINCT and COUNT to find uniques & duplicates
Use DISTINCT, COUNT(DISTINCT ...), and GROUP BY to measure uniques and surface duplicates. Example: SELECT email, COUNT(*) AS hits FROM users GROUP BY email HAVING COUNT(*) > 1; finds duplicate emails. For unique counts: SELECT COUNT(DISTINCT user_id) FROM events WHERE event_name='signup'; Data point: on a 3M-event table, COUNT(DISTINCT user_id) gives the number of unique users (e.g., 245,321). Actionable insights: to dedupe, identify the earliest or latest id per key with ROW_NUMBER() and then delete; to estimate distincts on very large sets, consider approximate functions like hyperloglog when supported (e.g., approx_count_distinct) to save time and memory.
| Problem | Query |
|---|---|
| Find duplicates | GROUP BY ... HAVING COUNT(*)>1 |
| Unique users | COUNT(DISTINCT user_id) |
| Sample unique emails | SELECT DISTINCT email FROM users LIMIT 100 |
| Approx distinct | approx_count_distinct(user_id) |
| Deduplicate | ROW_NUMBER() PARTITION BY email ORDER BY created_at DESC |
7. CASE expressions for conditional transformations
Use CASE to map raw values to categories or create flags. Example: SELECT user_id, amount, CASE WHEN amount >= 500 THEN 'High' WHEN amount >=100 THEN 'Medium' ELSE 'Low' END AS value_bucket FROM orders; This converts numeric amounts into buckets for easy reporting. Data point: bucketing can reduce cardinality from thousands of distinct amounts to 3 actionable segments, improving aggregated visuals. Actionable tips: prefer deterministic buckets, document boundary logic, and test edge cases (NULLs, negative amounts). Also combine CASE with SUM(...) to compute conditional aggregates, e.g., SUM(CASE WHEN amount >=500 THEN amount ELSE 0 END) AS high_value_revenue.
8. Date/time queries and DATE_TRUNC for time-series analysis
DATE_TRUNC and similar functions let you roll timestamps into time periods for trend analysis. Example: SELECT DATE_TRUNC('month', created_at) AS month, COUNT(*) AS orders, SUM(amount) AS revenue FROM orders WHERE created_at >= '2024-01-01' GROUP BY month ORDER BY month; This gives monthly revenue. Data point: converting timestamps to daily or weekly buckets can reveal seasonality; e.g., Monday orders 18% higher than Sunday in sample dataset. Actionable tips: align timezones, use DATE_TRUNC for consistent buckets, and pre-aggregate to daily rolls for dashboards to speed queries. For moving averages, combine DATE_TRUNC with window SUM OVER partitioned by product or region.
9. EXISTS / IN for existence checks and semi-joins
Use EXISTS for efficient existence checks and IN for small lists; semi-joins avoid pulling full rows. Example using EXISTS: SELECT u.id FROM users u WHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id AND o.created_at >= '2025-01-01'); returns users with recent orders without duplicating rows. Data point: with 10M users and 2M orders, EXISTS with proper indexing often outperforms JOIN + DISTINCT. Actionable insights: prefer EXISTS when you only need presence/absence, and use IN for static small lists (e.g., IN ('US','CA')). For semi-joins returning unique left-side rows, EXISTS expresses intent and can aid optimizer performance.
10. UNION / UNION ALL to combine result sets
UNION merges result sets vertically. Use UNION ALL to keep duplicates and run faster; use UNION to de-duplicate. Example: SELECT email FROM users_2024 UNION ALL SELECT email FROM users_2025; simply stacks results. Data point: combining two datasets of 1M rows each with UNION ALL returns 2M rows quickly; using UNION adds a sort/dedup step which can be costly. Actionable tips: choose UNION ALL when you want full append, and use UNION when dedupe is required; consider adding a source column to track origin: SELECT email, '2024' AS src ... UNION ALL SELECT email, '2025' AS src ...
- UNION ALL: faster, keeps duplicates.
- UNION: slower, removes duplicates.
- Add source tags to track dataset origin.
Conclusion
These 10 queries form the backbone of SQL for data analysts: selection and sampling, aggregation with GROUP BY, joining datasets, window analytics, CTEs for clarity, distinct/duplicate checks, CASE transformations, time-series grouping, existence checks, and set unions. Practice each pattern on real tables — orders, users, and events — and measure performance: row counts, execution time, and index usage. Next steps: pick one query pattern per day and apply it to a real use case (e.g., compute monthly churn with DATE_TRUNC + window functions), add indexes where queries are slow, and convert repeatable CTE logic into materialized tables or scheduled ETL jobs. Use the two tables and examples above as templates; over time you'll reduce exploratory time and deliver faster, more accurate insights to stakeholders. Keep this guide handy as a checklist while building dashboards and reports with SQL for data analysts.