Advertisements
Posted by: jsonmez | March 24, 2010

Naming Database Columns, When to Violate DRY

DRY = Don’t Repeat Yourself.

For the most part this is a good principle to follow.  But, there is one particular instance where I feel repeating myself makes things more clear.

It is very important for us to remember that principles are important, but ultimately they are very strong guidelines, not absolutes.  (It is hard to imagine that I just said that, since I am usually arguing on the merit of principles, but I digress.)

Traditions

In particular, when creating database tables, most people will create a table structure that looks like:

Looks perfectly normal and good.  But, I don’t like it.  Imagine you have to join these tables together to create a query.  (A pretty common occurence.)

SELECT * FROM Book
JOIN Author ON Book.author_id = Author.id
JOIN Publisher ON Book.id = Publisher.publisher_id

Did you catch my mistake above?  Probably not.  Most people would overlook it.  With the ER diagram above, and not very many columns on the tables, it is a little bit easier to see.

It is very easy to mix up which table is the parent and which is the child.  It is very easy to put the .id on the wrong side of the join clause.  The correct SQL should have been:

SELECT * FROM Book
JOIN Author ON Book.author_id = Author.id
JOIN Publisher ON Book.publisher_id = Publisher.id

Bucking the trend

Now take a look at this ER diagram:

The corresponding SQL for selecting from this database is:

SELECT * FROM Book
JOIN Author ON Book.author_id = Author.author_id
JOIN Publisher ON Book.publisher_id = Publisher.publisher_id

It is a small change, but it makes a huge difference.  Writing SQL to join the tables is much easier and much more difficult to get wrong.  Yes, I am repeating myself here, but I am doing it for a very good reason.  I am making the SQL much more clear and I am making it much easier to talk about these tables.

Think about the last time you had a conversation about database tables.  Did you say id?  Did you then have to qualify which id you were talking about?  If you never have “id”, you eliminate a large amount of ambiguity.

The major argument against what I am suggesting is that you don’t need to name the id for the book, book_id because it is a column on book, so you know that id refers to the id of the book.  I do not disagree with this.  For the most part that principle is sound.  The question though is “which one is easier to read, write and talk about?”

Advertisements

Responses

  1. Your article has a reasonably put argument, but then you spoil it in the last paragraph by claiming it would be acceptable to leave the ID of the Book table simply left as ID.

    It seems to me you’ve fallen into a very well used trap of failing to consider the future scope of the application.

    What if some future requirement necessitates the introduction of another table that has a foreign key to the Book table. That table is then going to suffer from having to join on Book.ID in the same manner as your Author and Publisher did initially.

    If you argument is to use fully qualified column names for foriegn keys, it should be applied consistently.

    • You are correct. I didn’t mean my last paragraph to say that we should do that. I was intending to indicate that some people may argue against what I had said, by saying “you don’t need to name the id for the book, book_id because it is a column on book, so you know that id refers to the id of the book”

      I don’t disagree with that logic, but overall I think the option I present here is better.

      Thanks for pointing that out though. I wasn’t very clear in what I was trying to say.

  2. I agree, prefix your id columns with the name of the table for clarity… it’s so easy to your JOINs mixed up otherwise.

    BTW I’m not sure this is against the spirit of DRY, which is mostly about not writing the same code over and over. Using slighter longer column names for clarity would probably be encouraged.

    • It could be considered a violation of DRY because you are repeating the name of the table in the column, but I agree with you DRY is mostly applied to writing the same code over again.

  3. I agree. Clarity in queries can almost eliminate a lot of thought regarding looking at code. Especially if you are new to a schema, that little prefix helps to identify the behavior the original dev went for.

  4. Hmm. I basically agree with your convention for explicitly naming “surrogate” keys like this. “Natural” keys tend not to exhibit this problem. For the book example, wouldn’t “ISBN” be a better choice of key?
    As a general principle, never use a surrogate when a natural key is available.
    This brings to light a broader problem: What’s to stop us putting garbage like “1234” or “Fred” in our “ISBN” column? ISBN numbers have a distinct and validatable structure. Lack of effective type (or “domain” in the relational parlance) support in SQL databases is the *real* problem here. I should be able to define an “ISBN” data type and write declarative constraints such that it must take the format “nnn-n-nnn-nnnnn-n”. We can then define a book table with, for example, a title and an ISBN-typed “ISBN” column and place a unique constraint upon that column (actually an ideal candidate for a primary key with nonclustered index). Now, the database will actually prohibit the matching of ISBNs with surrogate author IDs (or whatever).
    MS SQL Server does feature a user-defined type which will help solve the problem you cite above, although it is sadly lacking in being able to constrain what data you can place in a type.
    It’s worth getting “SQL and Relational Theory” by C.J.Date, and “Practical Issues in Database Management” by Fabian Pascal for true enlightenment in database kung-fu.
    All the best

    • Great points! Thanks, and thanks for the book recommendations.

  5. Just revisiting this article as a result of some recent comment activity. How come your argument has only been applied to the ID columns and not the Name columns in the Author and Publisher tables?

    • Oh, because id is used as the foreign key, so when you do joins, it is more confusing for multiple id columns.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

%d bloggers like this: