Fundamentals of Databases: What Every CS Student Must Know

a couple of pieces of luggage sitting on top of each other

Every application you’ve ever used—from Instagram to your university’s course registration system—sits on top of a database. Yet somehow, database fundamentals get glossed over in many CS curricula. Students learn to write a SELECT statement but can’t explain why an index speeds up queries or what ACID actually means. This guide fills those gaps. Whether you’re preparing for technical interviews or building your first production application, these are the database concepts that separate novices from engineers.

“Data is the new oil. But like oil, it’s only valuable if you can store it, refine it, and retrieve it efficiently.”

1. The Relational Model: Why Tables Rule Everything

Before NoSQL and graph databases dominated tech blogs, E.F. Codd introduced the relational model in 1970—and it still powers the vast majority of business-critical systems. Understanding why it works is foundational.

Relations, Tuples, and Attributes

Relational Algebra Schema Design Primary Keys

The core idea: Data is organized into relations (tables) consisting of tuples (rows) and attributes (columns). Each table represents a single concept—Students, Courses, Enrollments. Relationships between concepts are represented by shared keys, not by nesting data.

Why this matters: This separation of concerns is what makes relational databases flexible. You can query data in ways the original designer never anticipated because the structure is based on mathematical set theory, not application-specific hierarchies.

Key terminology:

  • Primary Key: A column (or combination) that uniquely identifies each row. Every table should have one.
  • Foreign Key: A column that references a primary key in another table, establishing relationships.
  • Candidate Key: Any column(s) that could serve as primary key. Choose the simplest, most stable one.
  • Composite Key: A primary key consisting of multiple columns.

2. SQL: The Lingua Franca of Data

SQL (Structured Query Language) is the standard language for interacting with relational databases. It’s declarative—you describe what you want, not how to get it. The database engine figures out the optimal execution plan.

The Four Pillars of SQL

DDL DML DCL TCL
Category Purpose Key Commands
DDL (Data Definition Language) Define and modify database structure CREATE, ALTER, DROP, TRUNCATE
DML (Data Manipulation Language) Query and modify data SELECT, INSERT, UPDATE, DELETE
DCL (Data Control Language) Manage permissions GRANT, REVOKE
TCL (Transaction Control Language) Manage transactions COMMIT, ROLLBACK, SAVEPOINT

JOINs: The Heart of Relational Queries

JOINs combine rows from two or more tables based on related columns. Understanding the different types is non-negotiable:

-- INNER JOIN: Only rows with matches in both tables
SELECT students.name, courses.title
FROM students
INNER JOIN enrollments ON students.id = enrollments.student_id
INNER JOIN courses ON enrollments.course_id = courses.id;

-- LEFT JOIN: All rows from left table, matched rows from right (NULL if no match)
SELECT students.name, enrollments.grade
FROM students
LEFT JOIN enrollments ON students.id = enrollments.student_id;

-- FULL OUTER JOIN: All rows from both tables (not supported in MySQL)
SELECT *
FROM table1
FULL OUTER JOIN table2 ON table1.id = table2.id;

Beyond basic joins, master these essential SQL patterns:

  • Aggregation with GROUP BY: COUNT(), SUM(), AVG(), MAX(), MIN() combined with GROUP BY for summary statistics.
  • Filtering groups with HAVING: WHERE filters rows before grouping; HAVING filters groups after aggregation.
  • Subqueries and CTEs (Common Table Expressions): Use WITH clauses to break complex queries into readable, reusable pieces.
  • Window Functions: ROW_NUMBER(), RANK(), LAG(), LEAD() for analytics without collapsing rows.
SQL query results displayed in database management interface

3. Database Design and Normalization

Bad database design leads to three nightmares: data redundancy (same info stored multiple times), update anomalies (changing data in one place but not others), and delete anomalies (losing data you wanted to keep). Normalization is the systematic cure.

The Normal Forms (1NF through BCNF)

1NF 2NF 3NF BCNF
Normal Form Requirement Problem It Solves
1NF (First) Atomic values; no repeating groups or arrays in a single cell Basic structure; each cell holds one value
2NF (Second) 1NF + no partial dependencies (non-key attributes depend on entire primary key) Redundancy in tables with composite keys
3NF (Third) 2NF + no transitive dependencies (non-key attributes don’t depend on other non-key attributes) Most common design target; eliminates most anomalies
BCNF (Boyce-Codd) 3NF + every determinant is a candidate key Edge cases where 3NF still has anomalies

The mnemonic: “Every non-key attribute must provide a fact about the key, the whole key, and nothing but the key.”

⚠️ Denormalization: When to Break the Rules

Normalization reduces redundancy but increases the number of tables—and therefore the number of JOINs required for queries. In read-heavy analytical systems (data warehouses), some denormalization is intentional to improve query performance. The key is knowing that you’re making a deliberate tradeoff, not a design mistake.

4. Indexing: How Databases Find Your Data Fast

Without indexes, a SELECT query with a WHERE clause scans every row in the table sequentially. With millions of rows, this is catastrophic. Indexes are the difference between millisecond and minute-long queries.

B-Trees: The Default Index Structure

B-Tree Clustered vs Non-Clustered Query Optimization

How B-Trees work: A B-Tree is a balanced tree structure where all leaf nodes are at the same depth. Each node contains multiple keys and pointers to child nodes. This structure enables O(log n) lookups, range scans, and sorted retrievals. Most database indexes (PostgreSQL, MySQL InnoDB) use B-Trees by default.

Clustered vs. Non-Clustered Indexes:

  • Clustered Index: The table data itself is physically organized according to the index key. Only one per table. In MySQL InnoDB, the primary key is always the clustered index.
  • Non-Clustered Index: A separate structure that stores index keys and pointers (row IDs or clustered keys) to the actual data rows. Multiple allowed per table.

When indexes hurt: Every index speeds up SELECT but slows down INSERT, UPDATE, and DELETE because the index must be maintained. Indexing every column is a classic beginner mistake.

Creating Effective Indexes

-- Single-column index on frequently filtered column
CREATE INDEX idx_students_email ON students(email);

-- Composite index for queries filtering on multiple columns
-- Order matters: most selective column first
CREATE INDEX idx_enrollments_student_course ON enrollments(student_id, course_id);

-- Unique index ensures no duplicates (automatically created for PRIMARY KEY)
CREATE UNIQUE INDEX idx_users_username ON users(username);

-- Partial index for queries that target a subset of rows
CREATE INDEX idx_active_users ON users(last_login) WHERE status = 'active';

5. Transactions and ACID Properties

Imagine transferring $100 from savings to checking. The system debits savings, then—before crediting checking—the server crashes. Your money just vanished. Transactions prevent this nightmare.

ACID Explained

Atomicity Consistency Isolation Durability
Property What It Means Real-World Analogy
Atomicity All operations in a transaction succeed or none do. No partial updates. You can’t be “half-married.” Either the whole ceremony happens or it doesn’t.
Consistency Transactions bring the database from one valid state to another, preserving all constraints. After transferring money, the total balance across all accounts remains unchanged.
Isolation Concurrent transactions don’t interfere with each other. Each appears to run alone. Two people booking the last concert ticket—only one should succeed.
Durability Once committed, transaction results survive system failures. Your bank balance doesn’t reset after a power outage.

Isolation Levels: The Performance-Consistency Tradeoff

Full isolation is expensive. Databases offer weaker isolation levels for better performance. Understanding the anomalies each prevents is critical:

  • Read Uncommitted: Lowest isolation. Allows dirty reads (seeing uncommitted changes). Rarely used.
  • Read Committed: Default in PostgreSQL. Prevents dirty reads but allows non-repeatable reads (same query returns different results within transaction).
  • Repeatable Read: Default in MySQL. Prevents dirty and non-repeatable reads but allows phantom reads (new rows appearing).
  • Serializable: Highest isolation. Transactions execute as if serially. Most expensive.

6. Database Paradigms Beyond Relational

Relational databases aren’t always the right tool. Different data shapes demand different storage models.

NoSQL Family Tree

Document Stores Key-Value Column-Family Graph
Type Examples Best For Avoid For
Document Store MongoDB, Couchbase, Firebase Flexible schemas, nested data, rapid iteration Complex JOINs, strict ACID across documents
Key-Value Store Redis, DynamoDB, etcd Caching, session storage, real-time counters Complex queries, relationships between values
Wide-Column Store Cassandra, HBase, ScyllaDB Massive write throughput, time-series data Ad-hoc queries, frequent schema changes
Graph Database Neo4j, Amazon Neptune, ArangoDB Highly connected data (social networks, recommendations) Simple tabular data, reporting/aggregation

7. Query Optimization and EXPLAIN

Writing correct SQL is step one. Writing fast SQL requires understanding how the database executes your query. The EXPLAIN command is your window into the query planner’s decisions.

Reading an EXPLAIN Plan

EXPLAIN ANALYZE Seq Scan vs Index Scan Cost Estimation
EXPLAIN ANALYZE
SELECT * FROM students WHERE email = 'arj@example.com';

Key signals to watch for:

  • Seq Scan (Sequential Scan): Reading every row. Fine for small tables; disaster for large ones without WHERE clause selectivity.
  • Index Scan: Using an index to find rows. Good, but watch for large result sets where bitmap index scan or sequential scan might be faster.
  • Nested Loop Join: For each row in outer table, scan inner table. Good when one table is small.
  • Hash Join: Build hash table of smaller table, probe with larger. Good for medium-to-large equi-joins.
  • Merge Join: Both inputs sorted on join key, merged efficiently. Requires indexes or explicit sorts.

Common optimization strategies: Add missing indexes, rewrite correlated subqueries as JOINs, use EXISTS instead of IN for subqueries, and ensure statistics are up-to-date with ANALYZE.

✅ The CS Student’s Database Checklist

Before you graduate (or interview), make sure you can:

  • Design a normalized schema for a given domain (e.g., library checkout system)
  • Write JOIN queries across 3+ tables with proper filtering and aggregation
  • Explain the difference between WHERE and HAVING
  • Create appropriate indexes for a given query workload
  • Describe ACID properties and isolation levels
  • Read an EXPLAIN plan and identify performance bottlenecks
  • Write a transaction with proper COMMIT and ROLLBACK handling
  • Choose between relational and NoSQL for a given use case

Databases Are Forever

Frameworks come and go. JavaScript libraries rise and fall. But database fundamentals—the relational model, SQL, indexing, transactions—have remained remarkably stable for decades. Mastering these concepts now pays dividends throughout your entire career, regardless of which technologies dominate the hype cycle.

The best way to internalize these concepts? Build something. Create a small application with a real database backend. Watch your queries slow down as data grows, then fix them with indexes. Experience a partial failure and appreciate what transactions give you. That’s how database theory becomes database intuition.

Learn Python for Backend → CS Database Notes →
Scroll to Top