Normalization in Databases

Normalization is the process of organizing data in a database to reduce redundancy and avoid anomalies. It uses functional dependencies to decide how tables should be structured.

The goal is simple:

Store each fact once, in the right place, with clear dependencies.

Normalization is about logical correctness of design, not performance tuning.


Why Do We Need Normalization?

Poorly designed tables may work initially, but they fail as data grows.

Common problems include:

Insertion Anomaly - You can’t insert some data without inserting unrelated data

Update Anomaly - Same data must be updated in multiple rows

Deletion Anomaly - Deleting one fact accidentally removes another

Normalization provides a systematic way to eliminate these problems.


First Normal Form (1NF)

A relation is in 1NF if:

  • Each attribute contains atomic (indivisible) values
  • No repeating groups or arrays inside a column

Example (Not in 1NF)

student_idcourses
1Math, Physics

After 1NF

student_idcourse
1Math
1Physics

1NF forces databases to behave like proper relations, not nested structures.


Second Normal Form (2NF)

A relation is in 2NF if:

  • It is already in 1NF
  • No partial dependency exists on a composite key

Partial Dependency - When a non-key attribute depends on part of a composite key, not the whole key.

Example

student_idcourse_idstudent_name
1101Alice
1102Alice
  • Here the primary key is the composite (student_id, course_id).
  • But student_name depends only on student_id, not on the full key.
  • That’s a partial dependency, which violates 2NF.

After 2NF

Split into separate tables:

  • Student

    student_idstudent_name
    1Alice
  • Enrollment

    student_idcourse_id
    1102

Third Normal Form (3NF)

A relation is in 3NF if:

  • It is in 2NF
  • No transitive dependency exists

Transitive Dependency


A → B
B → C
-----
So  A → C

Example

Key: (student_id, course_id)

student_idcourse_idcourse_namedepartment
1101CalculusMath dept
1102PhysicsScience dept
  • Here the primary key is the composite (student_id, course_id).
  • department depends on course_id (through course_name), not directly on (student_id, course_id).

After 3NF

Split into separate tables:

  • Student

    student_idstudent_name
    1Alice
  • Course

    course_idcourse_namedepartment
    101CalculusMath dept
    102PhysicsScience dept
  • Enrollment

    student_idcourse_id
    1101
    1102

3NF ensures non-key attributes depend only on the key.


Boyce–Codd Normal Form (BCNF)

BCNF is a stronger version of 3NF.

A relation is in BCNF if:

For every functional dependency X → Y, X must be a super key

BCNF removes edge cases where 3NF still allows anomalies.

Example

Suppose we have a table:

student_idcourse_idinstructor
1101Prof. Arjit
2101Prof. Arjit
3102Prof. Nini
  • Here the primary key is the composite (student_id, course_id).

  • Dependencies:

    • (student_id, course_id) → instructor is fine.
    • But also: course_id → instructor.

    Even though course_id is not a super key, yet it determines instructor, this violates BCNF even though it satisfies 3NF

After BCNF

Split into separate tables:

  • Enrollment

    student_idcourse_id
    1101
    2101
    3102
  • CourseInstructor

    course_idinstructor
    101Prof. Arjit
    102Prof. Nini

Conclusion

The journey from 1NF to BCNF shows how each step removes a deeper class of redundancy or dependency. In practice, most databases stop at 3NF or BCNF, striking a balance between correctness and usability.