Database Normalization and the First Three Normal Forms

Blog Post Published on:	26th October 2022
Title:	Database Normalization and the First Three Normal Forms
Lead Author:	Fred M. Beshears
Type of Blog Post:	mycirclepines_website

Database Normalization

https://en.wikipedia.org/wiki/Database_normalization

“””
Database normalization, or simply normalization, is the process of restructuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. It was first proposed by Edgar F. Codd as an integral part of his relational model.

Normalization entails organizing the columns (attributes) and tables (relations) of a database to ensure that their dependencies are properly enforced by database integrity constraints. It is accomplished by applying some formal rules either by a process of synthesis (creating a new database design) or decomposition (improving an existing database design).
"""

Objectives

A basic objective of the first normal form defined by Codd in 1970 was to permit data to be queried and manipulated using a “universal data sub-language” grounded in first-order logic.[1] (SQL is an example of such a data sub-language, albeit one that Codd regarded as seriously flawed.)[2]

The objectives of normalization beyond 1NF (first normal form) were stated as follows by Codd:

1. To free the collection of relations from undesirable insertion, update and deletion dependencies;

2. To reduce the need for restructuring the collection of relations, as new types of data are introduced, and thus increase the life span of application programs;

3. To make the relational model more informative to users;

4. To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.

—E.F. Codd, “Further Normalization of the Data Base Relational Model”[3]

An update anomaly. Employee 519 is shown as having different addresses on different records.

An insertion anomaly. Until the new faculty member, Dr. Newsome, is assigned to teach at least one course, his or her details cannot be recorded.

A deletion anomaly. All information about Dr. Giddens is lost if he or she temporarily ceases to be assigned to any courses.

When an attempt is made to modify (update, insert into, or delete from) a relation, the following undesirable side-effects may arise in relations that have not been sufficiently normalized:

Update anomaly. The same information can be expressed on multiple rows; therefore updates to the relation may result in logical inconsistencies. For example, each record in an “Employees’ Skills” relation might contain an Employee ID, Employee Address, and Skill; thus a change of address for a particular employee may need to be applied to multiple records (one for each skill). If the update is only partially successful – the employee’s address is updated on some records but not others – then the relation is left in an inconsistent state. Specifically, the relation provides conflicting answers to the question of what this particular employee’s address is. This phenomenon is known as an update anomaly.

Insertion anomaly. There are circumstances in which certain facts cannot be recorded at all. For example, each record in a “Faculty and Their Courses” relation might contain a Faculty ID, Faculty Name, Faculty Hire Date, and Course Code. Therefore we can record the details of any faculty member who teaches at least one course, but we cannot record a newly hired faculty member who has not yet been assigned to teach any courses, except by setting the Course Code to null. This phenomenon is known as an insertion anomaly.

Deletion anomaly. Under certain circumstances, deletion of data representing certain facts necessitates deletion of data representing completely different facts. The “Faculty and Their Courses” relation described in the previous example suffers from this type of anomaly, for if a faculty member temporarily ceases to be assigned to any courses, we must delete the last of the records on which that faculty member appears, effectively also deleting the faculty member, unless we set the Course Code to null. This phenomenon is known as a deletion anomaly.

“””

Functional Dependency

https://en.wikipedia.org/wiki/Functional_dependency

“””
A functional dependency is a constraint that describes the relationship between attributes in a relation.

In other words, a dependency FD: X → Y means that the values of Y are determined by the values of X.
“””

For example, an employee’s social security number (i.e. attribute X) determines the name of the employee (i.e. attribute Y). So, if you know an employee’s SSN, then you can look up the employee’s name.

Note that just because an employee’s name is functionally dependent on their social security number, this does not mean that social security numbers are functionally dependent on employee names. You may have two employees with the same name (e.g. John Doe), but they should have two different social security numbers.

Database Keys

https://www.geeksforgeeks.org/dbms-keys-candidate-super-primary-alternate-and-foreign/

“””

Candidate Key: The minimal set of attribute which can uniquely identify a tuple is known as candidate key. For Example, STUD_NO in STUDENT relation.

The value of Candidate Key is unique and non-null for every tuple.

There can be more than one candidate key in a relation. For Example, STUD_NO as well as STUD_PHONE both are candidate keys for relation STUDENT.

The candidate key can be simple (having only one attribute) or composite as well. For Example, {STUD_NO, COURSE_NO} is a composite candidate key for relation STUDENT_COURSE.

Super Key: The set of attributes which can uniquely identify a tuple is known as Super Key. For Example, STUD_NO, (STUD_NO, STUD_NAME) etc.

Adding zero or more attributes to candidate key generates super key.

A candidate key is a super key but vice versa is not true.

Primary Key: There can be more than one candidate key in a relation out of which one can be chosen as primary key. For Example, STUD_NO as well as STUD_PHONE both are candidate keys for relation STUDENT but STUD_NO can be chosen as primary key (only one out of many candidate keys).

Alternate Key: The candidate key other than primary key is called as alternate key. For Example, STUD_NO as well as STUD_PHONE both are candidate keys for relation STUDENT but STUD_PHONE will be alternate key (only one out of many candidate keys).

Foreign Key: If an attribute can only take the values which are present as values of some other attribute, it will be foreign key to the attribute to which it refers. The relation which is being referenced is called referenced relation and corresponding attribute is called referenced attribute and the relation which refers to referenced relation is called referencing relation and corresponding attribute is called referencing attribute. Referenced attribute of referencing attribute should be primary key. For Example, STUD_NO in STUDENT_COURSE is a foreign key to STUD_NO in STUDENT relation.
“””

First Normal Form

https://en.wikipedia.org/wiki/First_normal_form

“””
First normal form (1NF) is a property of a relation in a relational database. A relation is in first normal form if and only if the domain of each attribute contains only atomic (indivisible) values, and the value of each attribute contains only a single value from that domain.[1] The first definition of the term, in a 1971 conference paper by Edgar Codd, defined a relation to be in first normal form when none of its domains have any sets as elements.[2]

First normal form is an essential property of a relation in a relational database. Database normalization is the process of representing a database in terms of relations in standard normal forms, where first normal is a minimal requirement.

First normal form enforces these criteria:

Eliminate repeating groups in individual tables.
Create a separate table for each set of related data.
Identify each set of related data with a primary key

“””

Second Normal Form

https://en.wikipedia.org/wiki/Second_normal_form

“””
Second normal form (2NF) is a normal form used in database normalization. 2NF was originally defined by E.F. Codd in 1971.[1]

A relation that is in first normal form (1NF) must meet additional criteria if it is to qualify for second normal form. Specifically: a relation is in 2NF if it is in 1NF and no non-prime attribute is dependent on any proper subset of any candidate key of the relation.

A non-prime attribute of a relation is an attribute that is not a part of any candidate key of the relation.

Put simply, a relation is in 2NF if it is in 1NF and every non-prime attribute of the relation is dependent on the whole of every candidate key.

A functional dependency on part of any candidate key is a violation of 2NF. In addition to the primary key, the relation may contain other candidate keys; it is necessary to establish that no non-prime attributes have part-key dependencies on any of these candidate keys.

2NF and candidate keys

Multiple candidate keys occur in the following relation:

Even if the designer has specified the primary key as {Model Full Name}, the relation is not in 2NF because of the other candidate keys. {Manufacturer, Model} is also a candidate key, and Manufacturer Country is dependent on a proper subset of it: Manufacturer. To make the design conform to 2NF, it is necessary to have two relations:

“””

Third Normal Form

https://en.wikipedia.org/wiki/Third_normal_form

“””
Third normal form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that (1) the entity is in second normal form, and (2) all the attributes in a table are determined only by the candidate keys of that relation and not by any non-prime attributes. 3NF was designed to improve database processing while minimizing storage costs. 3NF data modeling was ideal for online transaction processing (OLTP) applications with heavy order entry type of needs.[1]

The third normal form (3NF) is a normal form used in database normalization. 3NF was originally defined by E.F. Codd in 1971.[2] Codd’s definition states that a table is in 3NF if and only if both of the following conditions hold:

The relation R (table) is in second normal form (2NF)
Every non-prime attribute of R is non-transitively dependent on every key of R.

A non-prime attribute of R is an attribute that does not belong to any candidate key of R.[3]

A transitive dependency is a functional dependency in which X → Z (X determines Z) indirectly, by virtue of X → Y and Y → Z (where it is not the case that Y → X).[4]

Nothing but the key

An approximation of Codd’s definition of 3NF, paralleling the traditional pledge to give true evidence in a court of law, was given by Bill Kent: “[Every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key.”[7] A common variation supplements this definition with the oath: “so help me Codd”.[8]

Requiring existence of “the key” ensures that the table is in 1NF; requiring that non-key attributes be dependent on “the whole key” ensures 2NF; further requiring that non-key attributes be dependent on “nothing but the key” ensures 3NF. While this phrase is a useful mnemonic, the fact that it only mentions a single key means it defines some necessary but not sufficient conditions to satisfy the 2nd and 3rd Normal Forms. Both 2NF and 3NF are concerned equally with all candidate keys of a table and not just any one key.

Chris Date refers to Kent’s summary as “an intuitively attractive characterization” of 3NF, and notes that with slight adaptation it may serve as a definition of the slightly stronger Boyce–Codd normal form: “Each attribute must represent a fact about the key, the whole key, and nothing but the key.”[9] The 3NF version of the definition is weaker than Date’s BCNF variation, as the former is concerned only with ensuring that non-key attributes are dependent on keys. Prime attributes (which are keys or parts of keys) must not be functionally dependent at all; they each represent a fact about the key in the sense of providing part or all of the key itself. (This rule applies only to functionally dependent attributes, as applying it to all attributes would implicitly prohibit composite candidate keys, since each part of any such key would violate the “whole key” clause.)

An example of a 2NF table that fails to meet the requirements of 3NF is:

Because each row in the table needs to tell us who won a particular Tournament in a particular Year, the composite key {Tournament, Year} is a minimal set of attributes guaranteed to uniquely identify a row. That is, {Tournament, Year} is a candidate key for the table.

The breach of 3NF occurs because the non-prime attribute Winner Date of Birth is transitively dependent on the candidate key {Tournament, Year} via the non-prime attribute Winner. The fact that Winner Date of Birth is functionally dependent on Winner makes the table vulnerable to logical inconsistencies, as there is nothing to stop the same person from being shown with different dates of birth on different records.

In order to express the same facts without violating 3NF, it is necessary to split the table into two:

“””

Artificial Keys

Artificial Keys are keys that have no business meaning. Some data modeling experts believe that many, if not all, tables should have an artificial primary key. Further, they believe that these keys should not be exposed to the end user. They see this as one aspect of good information hiding practice.

Note that the use of Artificial Primary Keys may violate the principles of data normalization.

Some object-relational mapping systems, such as the one used by Ruby on Rails, require that all tables have an artificial primary key. Most relational database systems can automatically generate a new, unique artificial key each time a row is added to a table. Artificial keys are also know as Surrogate Keys.

Relational model

https://en.wikipedia.org/wiki/Relational_model

“””
The relational model (RM) for database management is an approach to managing data using a structure and language consistent with first-order predicate logic, first described in 1969 by Edgar F. Codd, where all data is represented in terms of tuples [rows], grouped into relations [tables]. A database organized in terms of the relational model is a relational database.

Diagram of an example database according to the relational model

In the relational model, related records are linked together with a “key”.
The purpose of the relational model is to provide a declarative method for specifying data and queries: users directly state what information the database contains and what information they want from it, and let the database management system software take care of describing data structures for storing the data and retrieval procedures for answering queries.
“””

References

Boyce-Codd Normal Form
https://en.wikipedia.org/wiki/Boyce%E2%80%93Codd_normal_form

Fourth Normal Form
https://en.wikipedia.org/wiki/Fourth_normal_form

Fifth Normal Form
https://en.wikipedia.org/wiki/Fifth_normal_form

Sixth Normal Form
https://en.wikipedia.org/wiki/Sixth_normal_form

Domain-key Normal Form
https://en.wikipedia.org/wiki/Domain-key_normal_form

Denormalization
https://en.wikipedia.org/wiki/Denormalization

Object-relational mapping
https://en.wikipedia.org/wiki/Object-relational_mapping

Object-relational impedance mismatch
https://en.wikipedia.org/wiki/Object-relational_impedance_mismatch

Surrogate Key
https://en.wikipedia.org/wiki/Surrogate_key

Information Hiding
https://en.wikipedia.org/wiki/Information_hiding