Demystified: Database Normalization

I still remember those days at the university when I had to learn a lot of database related concepts, like composite primary keys, foreign keys, normalization, ERD (Entity-Relationship Diagram), referencial integrity, functional dependencies, and so on. At that time, most of these concepts seemed to be rubbish.

image

I started figuring out on my own how you design databases, and after some years of designing lots of databases I got myself thinking “darn I’m good at this!”. But there always comes that time when you have to do more than just design databases. You have to maintain them, to modify them for new data, and these “simple” tasks were driving me crazy. Why ? Because my databases were poorly designed, contained a lot of flaws. I started to feel very frustrated, so I chose to get some books on database design and started to read about those once useless concepts, and then I saw that I was very bad at designing databases.

The most difficult concept in database design is normalization. Basically, normalization means organizing your data into optimized table structures by following one or more sets of rules. To do this, you first have to analyze your data very carefully, mainly focusing on the relationships between all your data. By organizing your data you optimize your database, so you gain more speed for your queries and your application’s response time, you reduce the storage space needed by avoiding repetitive entries (redundancy), maintenance becomes a simple task, and the chance of breaking your database’s integrity goes from slim to even none.

So if you have a database that becomes increasingly difficult to maintain or your data gets all messed up and altered unwittingly, then it’s time for you to learn about normalization.

image

Databases=Complicated

Before I move on to normalization and the normal forms, I have to introduce (or remind) you some basic key concepts that will help you better understand the normalization process:

  • entity – every table in your database stores information about something, and that something is the entity; so if you want to create a database with all the employees of a company and their individual activities, you would create a table for the employees (the Employees entity), and a table for all the activities of that company (the Activities entity).
  • primary key - every table is structured into columns (or fields), each field containing a piece of information about the table’s data; for example, in the Employees table we have several fields like the employee’s name, address, the activity he’s assigned to, and so on. But how could we differentiate between two employees with the same name ? The solution is to have another field that uniquely identifies each employee, and this field is called a primary key. Most of the time, the primary key is nothing but a field of numbers, so in the above example, employees would be identified by their EmployeeID. But primary keys can be more than just one field in our tables. There are cases when primary keys consist of several fields from the table, when a simple ID wouldn’t be enough to uniquely identify each row of information (or record) within the table.
  • relationship – like the name suggests, a relationship is a link between entities, but it’s also a link between an entity’s fields. There are three possible relationships you could have in a database:
    • one-to-one (1-1) – this assumes that a record of one entity relates to just a single record of another entity, and vice versa. Lets take an example: a programmer has to do a specific task for a big project, and that task is assigned only to him and nobody else.
    • one-to-many (1-*) – in this case, a record of one entity can relate to one or more records of another entity. A good example would be of a teacher who teaches i several classrooms. So a teacher isn’t related to just one classroom, while a classroom is related to just one teacher.
    • many-to-many (*-*) – this relationship is a combination between a 1-* and a *-1 relationship. So we would say that many records of an entity relate to many records of another entity. As an example, a teacher teaches more classes, and a class is held by more teachers.
  • foreign key - a foreign key is often misunderstood as being a primary key within a related table in a 1-* relationship, but it’s definitely not a primary key. Lets take an example to better illustrate what a foreign key actually means. We have two tables in the image below: Teachers and Classrooms. Each teacher is assigned at least one classroom, and each classroom is assigned to at least one teacher.

clip_image002[4]

As you can see, the highlighted class field in the Teachers table is not a primary key, but a foreign key used to identify a class by its ID (the cID primary key) within the Classrooms table. So a foreign key consists of one or more fields within an entity whose value(s) match a primary key in another entity.

  • composite key - a composite key is a key (generally a primary key) composed of at least two fields within a table.
  • candidate key – a candidate key is a field that can be part of the primary key to uniquely identify a record; each entity can have more candidate keys, but it’s not necessary to use them all as part of the primary key.
  • alternate key – an alternate key is a candidate key that was not used as part of the primary key.
  • ERD (or Entity-Relationship Diagram) – an ERD is just a visual representation (a diagram) of a database’s structure (generally a representation of the tables within the database and the relationships between them). This diagram helps the database designer to better visualize his or her data, and can be drawn with the help of a specialized software or even by hand. The above image is a simple example of an ERD.
  • “thorough” – that’s actually not a key concept in database design, but I really have to remind you that normalizing a database involves thorough analysis of the data stored in it so that you can break it down into smaller pieces of interrelated information.

Now it’s time to move forward and talk about more serious things. Normalization is made by following a set of specific rules that apply to table structures, and these rules are grouped into so called Normal Forms. Excepting the first normal form (1NF), each normal form depends on the previous one. So you can’t take a table into the third NF without going with it through the 2NF and 1NF. How many of these normal forms are there ? Well, there are quite a few, but most of the database designers tend to stop at the 3NF (including it). We have the 1NF, 2NF, 3NF, BCNF (Boyce-Codd Normal Form), 4NF, 5NF, DKNF (Domain-Key Normal Form), 6NF, ONF (Optimal Normal Form), 7NF, N1NF (Non-first Normal Form), Denormalization, and it is said that there’s also a 8NF. With each new NF applied to your tables, the normalization process becomes so complex that you could get headaches. So let’s step through some normal forms and see how they make our lives easier.

image

First Normal Form (1NF)

Normalization isn’t mandatory, but it can make your life so much easier. Most of the database designers say that a database is normalized if you got it through at least the 1NF. So let’s start with this 1NF. The 1NF is the easiest one. First of all, it tells you to break all the fields of a table into atomic fields. An atomic field is a field whose value can be no longer broken into smaller parts. For example, if you had a table of employees with one of the fields being the employee’s name, this name field can be broken into two smaller parts: the first name and the last name. Another example is that of an address that is made of the streetName, buildingNumber, floor, apartmentNumber, city, country, and so on.

After we’ve done that, 1NF tells us to get rid of all repeating groups of data (groups of columns whose values remain the same for at least two records) by breaking our table into more interrelated tables. Below are two table structures, the left one that is completely messed up (there are fields that can be transformed to atomic fields), and the right one that solved the atomic fields issue but it still has groups of repeating data.

clip_image004[4]

Lets look now at the data they store. Below is the contents of left table from above, books1. We can see that the field author is not an atomic one.

clip_image006[4]

So we make it an atomic field by eliminating the author field and introducing two other fields, the authorFirstName and authorLastName fields:

clip_image008[4]

So we got past the first step of getting all our fields to atomic fields. But there’s still the repetition problem. Each book can have more than just one author, so for each of the authors we end up repeating 5 fields: title, category, price, items, and published. I have to admit that for a database with just under 100 books this is not a big problem, but what if we had a database with over a million books ? Things could go way beyond our hands. So let’s listen to the all mighty 1NF and break our table into smaller tables so that we avoid repetitive entries. First of all we have to review the information stored in this table. It contains information about some books (title, price, items in the store, published date) and the books’ authors (their names and the type of books they write). So we have to create a table for the books (without mentioning the authors in it), then one for the authors, and finally we have to create a table that links the books to the authors, like so:

clip_image010[4]

And here is how the information looks like in each table:

clip_image012[4]

clip_image014[4]

clip_image016[4]

As you can notice, every table contains atomic fields and there are no more repetitions among them, except the last table. The last table contains the links between each book and its authors, and there still are repetitive entries, but they come as numbers and not as text, and we also reduced the number of repetitive fields from 5 to 2.

Now imagine how it could be to have a table with lots of large-text fields that would repeat themselves for some hundred of times! So this is the lowest level you can get to.
Let’s review the rules of the 1NF:

  1. the table must contain only atomic fields
  2. eliminate redundant information by breaking your table into more smaller tables

Second Normal Form (2NF)

The 2NF says that all fields except the primary key’s fields have to depend on the whole primary key. So 2NF has two paths: the “easy” path, and the “hard” path. The easy path is when you have a table with a simple primary key (that is, a primary key composed of only one field). Why is it the “easy” path ? Because a table with a simple primary key that is in the 1NF is automatically in the 2NF! Yep, your heard me! So you have nothing to do to bring your table in the 2NF, because it’s already in the 2NF. Isn’t that great ? But I have to spoil your enthusiasm, because I haven’t told you what the “hard” path is.

Well, the hard path is only for those who have tables with composite primary keys. Why is it hard ? Because you have to make all your non-key fields fully dependent on every part of the primary key. So let’s take an example, shall we ?

clip_image018[4]

On the left we have a table for the employees of a company. Each employee is described by his or her name and skill. For the sake of this normal form, I chose to make a composite primary key composed of the employee’s ID and skill. So why is this table not in the 2NF ? If we check our single non-key field (that is, the eName field), we can easily notice that it depends only on the ID field. As a person, you cannot depend on your skill to identify yourself among the others, but if the ID was a social number, you could easily identify yourself with it. So here it is: the non-key fields are not fully dependent on the whole primary key. But how could you normalize this situation ? Either you change the primary key’s components, or you make it a simple primary key, or even break this table into two tables (a table for skills, and one for the employees, like we did at the 1NF).

Let’s review the rules of the 2NF:

  1. the table must be in the 1NF
  2. all the non-key fields are fully dependent on the whole primary key

Third Normal Form (3NF)

The 3NF is going to be the last normal form I’ll discuss in this article, since the rest of the normal forms could possibly get you all confused, and I have to admit that I, myself, don’t go past the 3NF when I design my databases.

The 3NF says that every non-key field must depend only on the primary key. Easier said, if you have a non-key field that depends on the whole primary key but also depends on an alternate key (that is, another non-key field), then your table is not in the 3NF. Below is another table illustrating this situation:

clip_image020[4]

This table contains information about the students from a highschool that are also hired at different companies. What’s the purpose of this table ? I really don’t know, but it explains very well this situation. So, we have a simple primary key (sID), and for each student we store vital information related to the company he’s or she’s working at. But why isn’t it in the 3NF ? Let’s assume that every non-key field depends on the primary key, ok ? But the companyNumber also depends on the company, so this is called a transitive dependency. To normalize this problem, you could easily break things into smaller things, that means you could split your table in two tables, one for the companies and the other for the students. Yep, normalization is always about splitting tables into more smaller tables, but you should also make some compromise that will eventually make your life easier, even if not at first sight.

Let’s review the rules of the 3NF:

  1. the table must be in the 2NF
  2. every non-key field must depend only on the whole primary key, in other words non-key fields must depend non-transitively on the primary key.

So, are you ready to take the next step in designing your databases?

Article written by Ursan-Mihalcea Razvan.

About these ads
3 comments
  1. Alexander said:

    Very good article. Keep in mind that like anything else good database design goes hand in hand with knowing the use cases that will use them and designing accordingly. That said, sometimes full normalization brings with it performance penalties. In my experience databases supporting business intelligence activities (BI a.k.a. "data warehouses" in previous lives) require some degree of denormalization to perform optimally. Test, test, test and denormalize if necessary.

  2. Alexander said:

    Sorry, English is not my native language… I think this version of my previous post is more understandable.

    Very good article. Like anything else good database design must take into account the nature of the applications that will use it. That said, sometimes full normalization brings with it performance penalties. In my experience databases supporting Business Intelligence activities (BI a.k.a. "data warehouses" in previous lives) require some degree of denormalization to perform optimally. Test, test, test and denormalize if necessary.

  3. ROCK said:

    Good Article.

Follow

Get every new post delivered to your Inbox.