Relational Databases

This is a quick overview of relational databases to accompany the lectures run by Dr. Bing Wang for 08119 - IT Skills and Professionalism - at The University of Hull. As usual, this isn't intended to be an instructional write-up, it's designed to help my fellow students in tandem with the lecture materials.

What is a database?

Database is a very broad term: it's any collection of organised data. A library could be seen as a form of database, if you take a book to be a record and consider that books are ordered by various attributes on shelves, then you can very quickly arrive at an appropriate analogy for a database. However, a library is more aptly classed as a data store, but that's beyond the scope of this write up.

Relational databases

For the sake of this module, we're talking about relational databases. A relational database can be represented in a tabular form, where every row in the table is a record and every column is an attribute. Relational refers to how we can link these tables together, and how individual attributes are related to others within a single record.

Terms

Bold terms are the correct terms when talking about databases in theory, italicised terms are more common and are used in practical applications.

  • DBMS stands for Database management system
  • A database is a collection of relations.
  • A relation is a table, or a set of records, that share the same defining schema.
  • A tuple, or record, is a row representing a single item of data. Every tuple must be unique.
  • An attribute or field, is a column that defines the label for that piece of data. For example: all "Address" attributes would be in the same column.
  • A view is a result set which is returned from the DBMS based on a query.

Advantages

  • Data is stored in a standardised format: attributes are identical over multiple records, even if the content is different.
  • Data is not duplicated: Data can be reused for different purposes by linking tables. For example: The HR department can share their personnel records with the management department and the data will be the same between the two.
  • Data can be freely reorganised: A list of personnel could be retrieved by their age, salary or name, depending on what's required at the time.
  • Data is interchangeable: Multiple applications can use the same data source.

Keys

Candidate Keys

A candidate key is any attribute that meets the following criteria: 1. There are no repeated values for that attribute (uniqueness) 2. There is no subset for which the first criteria holds. (minimality)

Candidate keys allow us to identify all the possible ways in which we can refer to records individually.

Primary Keys

One of the candidate keys must be chosen to be the primary key for a relation. This is because, realisticly, we only ever need to refer to individual records in one manner, not multiple.

However, designating any of the attributes in a traditional dataset as "primary" is difficult, as in many cases you can not guarentee that the attribute will always be a candidate key. For this reason, we typically create an additional attribute that will always satisfy the requirements of a candidate key, this is known as a surrogate key.

Surrogate Keys

A surrogate key provides no additional information to a record, other than to ensure it is unique within that relation. An example of a surrogate key would be a Student ID, or Bank Account Number.

Relationships

Relationships are able to link the data in multiple tables together. Relationships can be one of three types:

  • One-to-one
    • Each car has exactly one registration number
    • Each registration number refers to exactly one car
  • One-to-many
    • Each car has exactly one manufacturer
    • Each manafacturer makes many cars
  • Many-to-many
    • Each driver may have many cars
    • Each car may have many drivers

ANSI-SPARC Architecture

The American National Standards Institute, Standards Planning And Requirements Committee architecture presents a standardised three-level design for databases.

This approach is beneficial because:

  • It allows users to be presented with different, customised views based on the same underlying data set.
  • The data is agnostic to its physical storage requirements. The data could be moved from one physical disk to another and function identically.
  • If the database design is changed, the user's experience should remain identical.

Most modern database management systems (DBMS) are based on this architecture, despite the fact it never became a formal standard.

  • The external level describes how the data may be presented in different ways for different users. That is, it provides a way of defining "views" of the data.
  • The conceptual level describes how individual attributes are related to one another on a record-by-record basis and defines any constraints between rows.
  • The internal level describes how the data is stored on a physical level depending on the implementation. This includes storing data in memory, or on a physical disk.

Structured Query Language (SQL)

SQL is probably the simplest language you'll ever learn, using it in practice however can easily be the most complicated and brain-numbingly frustrating programming you've ever done.

What is it?

Structured Query Language is a language used for managing data within databases. Originally conceived by IBM in the early 1970s, it's now the most common data definition language in existence, with most DBMSs supporting it. However, very few of these DBMSs follow the SQL standard to the letter, and many have additional extensions or functions not covered by the original standard.

SQL is commonly used in a client-server basis, where the server is remote from the program creating the queries and requesting the data. In an application, users may be aware that their actions are creating SQL queries under the hood: many data breaches in large-scale web applications have been caused by "SQL injections" in the past few years.

SQL is not a traditional programming language. You can't write real, useful, programs in it, nor should you try. Instead SQL is known as a transformational language: it describes the transformations required between a set of input data and the output required.

SELECT Queries

SELECT Queries are the bread and butter of SQL. They are used to retrieve and transform data from a database for use in your programs. A simple SELECT query might look something like this:

SELECT "Hello World";

This would return a single row with a single field: a string containing "Hello World".

This is simple, trivial and honestly: useless. We could spice this up a little:

SELECT DAYNAME(NOW());

Still pretty trivial, but this does a few more things: the NOW() function returns the current time and date, which DAYNAME() uses to return the name of the current day. It's "Wednesday" as I write this apparently. We're still not working on a database though, that's probably important.

The FROM Keyword

Let's introduce a new keyword: FROM - The FROM keyword allows us to set which table we'd like to retrieve data from. This gives us queries in the form of:

SELECT [fields] FROM [tables];

A practical example of this could be:

SELECT name, date_of_birth, account_number FROM banking_records;

Which, as you can imagine, retrieves multiple rows each filled with a person's name, date of birth and their bank account number. Hopefully this shows you that SQL really just is plain English, with some constraints to help the computer understand you.

The WHERE keyword

We know now how we can pull data on a large scale, however this isn't always useful: If a user is viewing their bank statement, they don't need to know the names and account numbers of every other person in the bank, do they? So how do we filter data? Enter the WHERE keyword, which is used for filtering the result set with a set of conditionals.

The general format for the WHERE keyword is:

SELECT [fields] FROM [tables] WHERE [condition];

It's also worth noting that you can use * to select all the fields in a row. In practice a WHERE clause could be used as follows:

SELECT * FROM banking_records WHERE account_number=667;

This literally translates as: "Select all the rows in the banking records where the account number is 667". Super Simple Stuff.

WHERE clauses can be combined together through the use of the OR and the AND keywords. They can also be prefixed with the NOT keyword to invert the condition. The conditionals in WHERE clauses are very similar to most languages:

  • = - Equal
  • <> - Not Equal
  • > - Greater than
  • < - Less than
  • >= - Greater than or equal to
  • <= - Less than or equal to

The ORDER BY Keyword

Staying with the banking system, let's say we want to write a query that returns all the people in our bank, ordered by their balances from highest to lowest. This is where the ORDER BY keyword comes in.

The general form for ORDER BY is:

SELECT [fields] FROM [tables] WHERE [condition] ORDER BY [field] [DESC/ASC];

(DESC stands for DESCending, ASC stands for ASCending)

So, for our example we'd want something like:

SELECT * FROM banking_records ORDER BY balance DESC;

Hopefully, you don't even need a translation at this stage - it really is that straightforward.

The ROWCOUNT query

But what if our bank had 5 million customers? A query that returns 5 million rows is going to return a lot of data, and we don't really care about the 4999990 people with the lowest balances. Conveniently, we can set a limit of the maximum number of records to include in the next query with ROWCOUNT.

SET ROWCOUNT [amount];
SELECT [fields] FROM [tables] WHERE [condition] ORDER BY [field] [DESC/ASC];

So, the top 10 richest people in bank could be queried efficiently with:

SET ROWCOUNT 10;
SELECT * FROM banking_records ORDER BY balance DESC;

Functions

SQL supports many functions for modifying data that is returned from the server. This can be classified as row-level functions, aggregate functions and string functions.

Aggregates

Aggregate functions iterate over an entire result set an return a single field as a result. For example the COUNT(*) function returns the number of rows that would be in a result set.

SELECT COUNT(*) FROM banking_records;

Would return the number of bank accounts in our banking_records table.

There are many aggregation functions, for example: SUM(field), AVG(field), MIN(field), MAX(field).

SELECT SUM(balance) FROM banking_records; -- Selects the amount of money the bank has
SELECT AVG(balance) FROM banking_records; -- Selects the average balance
SELECT MIN(balance) FROM banking_records; -- Selects the lowest balance in the bank
SELECT MAX(balance) FROM banking_records; -- Selects the highest balance in the bank

Row-level functions

You've already seen a few row-level functions from where I used DAYNAME() and NOW() at the start of this section. Row-level functions can also be used to manipulate data that has been returned from the database. It is important to note that they do not change the content of the database when used in a SELECT query. Arithmetic is an example of a row-level function, for example:

SELECT balance - overdraft AS real_balance FROM banking_records;

There are also functions such as ROUND() which work this way. For example:

SELECT ROUND(balance/25) as rounded_balance_share FROM banking_records;

Would round a 25th of every customer's balance to the nearest whole number.

Summary

Don't worry if you can't remember the exact syntax of a SELECT query, or the name of every individual function - you can look them up if you need to! Many of these things are ordered in this way because of the underlying nature of relational databases. To best understand query syntax & design you must first understand your database fully, which will be covered as we move through the module.