SQL databases are reliable for complex queries, partial updates, transactions, and decoupling data modeling from application specific contexts. In this post we’ll cover how to normalize tables in 1NF, 2NF, and 3NF.
Image may be NSFW.
Clik here to view.
We have a table.
Name | Position | Team | Jersey No. | Home Stadium | Stadium City | Salary & Contract |
---|---|---|---|---|---|---|
Aaron Rodgers | QB | Packers | 12 | Lambeau Field | Green Bay | $22M, 5 years |
J.J. Watt | DL | Texans | 99 | NRG Stadium | Houston | $16.7M, 6 years. |
Earl Thomas | DB | Seahawks | 29 | Century Link Field | Seattle | $10M, 4 years |
Andrew Luck | QB | Colts | 12 | Lucas Oil Stadium | Indianapolis | $5.5M, 4 years |
Some issues off the cuff, right? What if I wanted to add a new team but had no player? What if a player plays multiple positions? Or two teams share the same home stadium? All these deal with data redundancy. And what’s the primary key, anyways?
1NF #
1NF defines the shape of a table.
Concepts
- Unique rows - A required primary key makes each row unique
- Atomacity - In every row, each column has a single value. Data in each cell won’t be decomposed further[1]
Benefits
Atomacity provides general purpose queries where selects and updates occur to single values. It contrasts with queries that need to parse, substring, and concatenate values, leaving room for data integrity errors.
Let’s break Salary
and Contract
into their own columns.
Name | Position | Team | Jersey No. | Stadium | Stadium City | Salary | Contract (yrs) |
---|---|---|---|---|---|---|---|
Aaron Rodgers | QB | Packers | 12 | Lambeau Field | Green Bay | $22M | 5 |
J.J. Watt | DL | Texans | 99 | NRG Stadium | Houston | $16.7M | 6 |
Earl Thomas | DB | Seahawks | 29 | Century Link Field | Seattle | $10M | 4 |
Andrew Luck | QB | Colts | 12 | Lucas Oil Stadium | Indianapolis | $5.5M | 4 |
In our table we could have many players with many same values– same name, same position, same team, or others. So the minimal fields to identify a unique row is:
{Team}, {Jersey_No}
We’re in 1NF.
2NF #
Both 2NF and 3NF deal with functional dependencies between key and non-key fields.
Concepts
A field Y
is “functionally dependent” on field(s) X
if X
‘s value always produces the same Y
value[2]. One example is how Stadium City
functionally depends on Home_Stadium
, also said Home_Stadium
determines Stadium City
. Lambeau Field is always in Green Bay, WI.
2NF is met when a table is in 1NF and all non-key fields depend on the entire key-ed fields. Tables with only one key field are automatically in 2NF[2].
Benefits
2NF minimizes data redundancy. Recall our primary key from 1NF,
{Team}, {Jersey_No}
Non-key fields Name
,Position
, Salary
and Contract
depend on the entire primary key, but Home_Stadium
and Stadium_City
only depend on Team
! Every time we insert a new row where Team
is the same, Home_Stadium
and Stadium_City
become re-inserted as duplicates. So say we insert Marshawn Lynch, a second Seahawk.
Name | Position | Team | Jersey No. | Home Stadium | Stadium City | Salary | Contract (yrs) |
---|---|---|---|---|---|---|---|
Earl Thomas | DB | Seahawks | 29 | CenturyLink Field | Seattle | $10M | 4 |
Marshawn Lynch | RB | Seahawks | 24 | CenturyLink Field | Seattle | $12M | 2 |
Aha! CenturyLink Field
and Seattle
are duplicate data.
More abstractly, fields relying on a subset of the primary key produce redundancy in each row where the subset is the same and the non-subset is different.
To acheive 2NF we move these into tables where non-key fields depend on the entire key field.
Team
Team | Home Stadium | Stadium_City |
---|---|---|
Packers | Lambeau Field | Green Bay |
Texans | NRG Stadium | Houston |
Seahawks | Century Link Field | Seattle |
Colts | Lucas Oil Stadium | Indianapolis |
primary key: {Team}
Roster
Team | Jersey_No | Position | Name | Salary | Contract |
---|---|---|---|---|---|
Packers | 12 | QB | Aaron Rodgers | $22M | 5 |
Texans | 29 | DL | J.J. Watt | $16.7M | 6 |
Seahawks | 99 | DB | Earl Thomas | $10M | 4 |
Colts | 12 | QB | Andrew Luck | $5.5M | 4 |
Seahawks | 24 | RB Marshawn Lynch | $12M | 24 |
composite primary key: {Team}, {Jersey_No}
Better. Level up.
3NF #
Concept
3NF furthers the principle that tables serve a single purpose. It prevents a non-key field C
from being functionally dependent on another non-key field B
, making C
have transitive-dependence on primary key A
.
- In our
Roster
tableName
determinesSalary
andContract
making them transitive dependencies on primary keysTeam
andPosition
. - Also, in the
Teams
tableHome_Stadium
determinesStadium_City
making it transitively dependent onTeam
.
Benefits
Non-transitive dependencies minimize data redundancy. 3NF is achieved by moving all non-key to non-key functional dependencies to their own table[2].
Our four normalized tables.
Roster
Team | Jersey_No | Position | Name |
---|---|---|---|
Packers | 12 | QB | Aaron Rodgers |
Texans | 29 | DL | J.J. Watt |
Seahawks | 99 | DB | Earl Thomas |
Colts | 12 | QB | Andrew Luck |
Seahawks | 24 | RB | Marshawn Lynch |
composite primary key: {Team} {Jersey_No}
Player
Name | Salary | Contract (yrs) |
---|---|---|
Aaron Rodgers | $22M | 5 |
J.J. Watt | $16.7M | 6 |
Earl Thomas | $10M | 4 |
Andrew Luck | $5.5M | 4 |
primary key: {Name}
Team
Team | Stadium |
---|---|
Packers | Lambeau Field |
Texans | NRG Stadium |
Seahawks | Century Link Field |
Colts | Lucas Oil Stadium |
primary key: {Team}
Stadium
Stadium | Stadium_City |
---|---|
Lambeau Field | Green Bay |
NRG Stadium | Houston |
Century Link Field | Seattle |
Lucas Oil Stadium | Indianapolis |
primary key: {Stadium}
We can now add more teams, players, positions and stadiums with much less redundancy.
What about Serial PKs? #
Serial primary keys prevent collisions and allow for more variable data as they remain unchanged. But they offer no value in the logical data model and organization, so they shouldn’t act as primary keys when normalizing tables. First create normalized tables, then add Serial PKs where necessary.
In this post we covered steps to create tables free of modification anomalies, minimizing the likelihood of redesign. It’s important to note these are guidelines and at times it’s worth the engineering trade off of less tables to improve performance at the expense of some data redundancy[4].
Normalized SQL tables have primary keys, atomic values (1NF), non-key fields that depend on the entire key fields (2NF), and with non-keys that only depend on key fields (3NF). Our NFL example used small tables but think of how much data sites like ESPN holds. If you have questions or comments, please reach out on
LinkedIn.
References #
[1] Date, C. (2006). Date on database: Writings 2000-2006. Berkeley, CA: Apress, Chapter 8.
[2] William Kent, “A Simple Guide to Five Normal Forms in Relational Database Theory”, Communications of the ACM 26(2), Feb. 1983, 120-125.
[3] Date, C. J. (2000), An Introduction to Database Systems (8th ed.). Addison-Wesley Longman
[4] Obasanjo, D., When Not To Normalize Your SQL Database. http://www.25hoursaday.com