Notes on Databases

From Ggl's wiki

Jump to: navigation, search

Contents

How to choose a database?

  • access pattern:
    • mostly read or write?
    • multiple writes? (e.g. concurrency) Need locking?
  • availability
  • consistency
  • transactions
  • relations? primary key only? foreign key? multi-dimension?
  • type of requests
  • limits in storage space
  • persistent or temporary data?
  • structured data? constraints? Schema may change?

This year might see the rise of NoSQL databases. I don't think the problem lies in SQL. SQL in itself is only a language though it represents the underlying relational model. It is a solution to store some types of data, not all. As in any problem, one needs to reformulate and understand what is to be solved. The questions above aims to drive a designer in understanding what data she manipulates.

After reading some documentation about PostgreSQL, I see it as good candidate for data that require consistency and are bound with relations. I also studied Cassandra with is great to access denormalized data that needs to be always available. If I need to walk or enumerate connections between people or things I may use a graph database. And for simple persistence a structured file like JSON or YAML should fit great. I may also store data in a key/value store like tokyocabinet. Each of these type of database addresses a pattern. The difficulty belongs to the understanding of the pattern.

Find the pattern and you will find the best database.

Then you will accept that your applications will access different databases.

One of the biggest concern with RDBMS is availability. Replication addresses this concern. Master/slave mode with a single master is a simple architecture, enough for many projects. However it is not scalable in itself because an increase of writes implies adding master servers and then change the architecture for a multi-master that is not supported by all RDBMS servers. Even if the RDBMS provides multi-master, does it ensure replication is always synchronized?

One of the main difference in access pattern: "scarce writes, lots of read", with reads that imply relations between objects naturally leads to a RDMS. Lots of write will lead to a distributed architecture and maybe eventual consistency. BigTable architecture tries to maintain consistency by ensuing read/write atomicity at the row level but is it really necessary?

As you see, the pattern emphasizes on scalability. Most "small" applications don't need massive scalability. It needs to allow its design to evolute. With loosely-coupled modules one may replace the data backend later. Then designer should concentrate on a programming interface that does not stuck the programmer with a static data model.

Concepts

Properties

ACID

  • Atomicity
  • Consistency
  • Isolation
  • Durability

CAP theorem: only two properties of shared-data systems among (data) Consistency, (system) Availability and Partition-tolerance (e.g. tolerance to network partition) can be achieved simultaneously.

BASE, Basically Available, Soft-state and Eventually consistent

Distributed systems

Consistent Hashing

Consistent hashing can be implemented with two hashes. A first hash that associates an unique value to the key, like md5 or sha1. Then another hash that reduces the target keyspace to a ring. It could be a truncation of the hash to 8 bytes.

Vector Clocks

(for version reconciliation)

Technique to implement a partial order in a set. Vector clocks establish a partial order of objects in a distributed system because it is not possible to have exactly synchronized systems.

A partially ordered set (poset) defines a binary relation "≤" over a set P that is:

  • Reflexive: a ≤ a;
  • Antisymmetric: if a ≤ b and b ≤ a then a = b;
  • Transitive: if a ≤ b and b ≤ c then a ≤ c.

Inspired by Lamport timestamps that adds partial ordering with minimum overhead.

Database Models

Normalization

  1. To free the collection of relations from undesirable insertion, update and deletion dependencies;
  2. To reduce the need for restructuring the collection of relations as new types of data are introduced, and thus increase the life span of application programs;
  3. To make the relational model more informative to users;
  4. To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.

—E.F. Codd, Further Normalization of the Data Base Relational Model[1]

Denormalization

When one stores data in a relational database, she usually uses the normal model.

Relational Databases

PostgreSQL

  • type constraints
  • type inheritance
  • transactions
  • aggregates
  • custom data types, "domain" (CREATE DOMAIN)
  • composite data types (postgresql >= 8.0) e.g. type with multiple fields (CREATE TYPE <typename> AS)
  • CURSOR

Notes from PostgreSQL second edition by Korry Douglas and Susan Douglas, Sams editions:

  • create an unconstrained domain for primary and foreign keys
  • use SEQUENCE to provide unique id as primary key
  • add indexes

Key/Value Stores

Definition

Key/Value Store (kvs) are system that stores data identified by a key and provides an interface to get the data from the key. They provides a simple interface like put/get/out e.g. insert, retrieve and delete.

Data may be stored ordered or unordered depending on the underlying representation. A commonly used kvs is berkeleydb (created by Sleepycat and then acquired by Oracle). Our examples are based on Tokyocabinet which implements a similar interface.

Tokyocabinet

Tokyocabinet is a database engine written in C. It stores data in:

  • Hash
  • Array
  • B-Tree
  • Table (column-oriented)

Examples in haskell:

Add two (key, value) pairs in the Hash database /tmp/test.hdb:

import Data.Maybe ( fromJust )
import Database.TokyoCabinet
import qualified Data.ByteString.Char8 as BS
 
hdb = new :: TCM HDB 

-- |Function type could be infered.
test :: TCM (Maybe BS.ByteString, Maybe BS.ByteString)
test = do
    db <- hdb
    open db "/tmp/test.hdb" [OWRITER, OCREAT]
    put db (BS.pack "a") (BS.pack "1")
    put db (BS.pack "b") (BS.pack "2")
    sync db
    v1 <- get db (BS.pack "a")
    v2 <- get db (BS.pack "b")
    return (v1, v2) 

main = do
    values <- runTCM test
    printValues values
    where
        printValues (v1, v2) =
            BS.putStrLn (fromJust v1) >>
            BS.putStrLn (fromJust v2)

Add a single user entry in the Table Database /tmp/test.tdb:

module Main
    where
import Control.Monad ( unless )
import Database.TokyoCabinet.TDB
import Database.TokyoCabinet.TDB.Query hiding (new)
import qualified Database.TokyoCabinet.Map as TCMap
import qualified Database.TokyoCabinet.TDB.Query as TCQuery (new)

tdb = new

data Profile = Profile {
      name :: String
    , age  :: Int
    }
    deriving Show

insertProfile :: TDB -> Profile -> IO Bool
insertProfile db profile =
      do m <- TCMap.new
         TCMap.put m "name" (name profile)
         TCMap.put m "age" (show . age $ profile)
         Just pk <- genuid db
         put db (show pk) m
 
main = do
    db <- tdb
    open db "/tmp/test.tdb" [OWRITER, OCREAT] >>= showDbError db
    insertProfile db (Profile "foobar" 42)
    query <- TCQuery.new db
    addcond query "name" QCSTREQ "foobar"
    proc query $ \pk cols -> do
        Just name <- TCMap.get cols "name"
        Just age <- TCMap.get cols "age"
        putStrLn $ "name: " ++ name
        putStrLn $ "age: " ++ age
        return $ QPPUT cols
    close db >>= showDbError db
    return ()
    where
        showDbError db = flip unless $ ecode db >>= error . show

Distributed Databases

Amazon Dynamo

Amazon Dynamo is a distributed eventually-consistent key/value store. The two main requirements for Amazon Dynamo are to be able to alway write and provide low-latency (low-latency is related to alway write at the application level). Always write is related to availability. If we think again about the CAP theorem, here we emphasize on Availability and partition-tolerance (to handle lots of requests with low-latency). Then it decreases Consistency in order to maximize the two other criteria. It leads to eventual-consistency. If the set of replicas is partitioned, some nodes may not get some versions of writes.

"Amazon found every 100ms of latency cost them 1% in sales. Google found an extra .5 seconds in search page generation time dropped traffic by 20%." -- Latency is Everywhere and it Costs You Sales - How to Crush it Date and the quote was from a document cited in Radar Theme: Web Ops on August, 8th of 2008.

Werner Vogel clearly explains and summarizes the trade-offs:

A system that is not tolerant to network partitions can achieve data consistency and availability, and often does so by using transaction protocols. To make this work, client and storage systems must be part of the same environment; they fail as a whole under certain scenarios, and as such, clients cannot observe partitions. An important observation is that in larger distributed-scale systems, network partitions are a given; therefore, consistency and availability cannot be achieved at the same time. This means that there are two choices on what to drop: relaxing consistency will allow the system to remain highly available under the partitionable conditions, whereas making consistency a priority means that under certain conditions the system will not be available.

Both options require the client developer to be aware of what the system is offering. If the system emphasizes consistency, the developer has to deal with the fact that the system may not be available to take, for example, a write. If this write fails because of system unavailability, then the developer will have to deal with what to do with the data to be written. If the system emphasizes availability, it may always accept the write, but under certain conditions a read will not reflect the result of a recently completed write. The developer then has to decide whether the client requires access to the absolute latest update all the time. There is a range of applications that can handle slightly stale data, and they are served well under this model.

In principle the consistency property of transaction systems as defined in the ACID properties (atomicity, consistency, isolation, durability) is a different kind of consistency guarantee. In ACID, consistency relates to the guarantee that when a transaction is finished the database is in a consistent state; for example, when transferring money from one account to another the total amount held in both accounts should not change. In ACID-based systems, this kind of consistency is often the responsibility of the developer writing the transaction but can be assisted by the database managing integrity constraints.
  • Always able to write:
    • Distributed => consistent hashing (O(1) routing for low latency)
    • Eventual consistency => multi-version reconciliation on read => vector clocks or "last write wins"
  • always able to write
  • data available across multiple datacenters
  • reliability and scalability depends on how its application state is managed
  • tradeoffs between:
    • availability
    • consistency
    • cost-effectiveness
    • performance
  • primary-key access to datastore
  • simple key/value interface (get()/put())
  • data partitioned and replicated using consistent hashing
  • object versioning using vector clocks
  • consistency among replicas:
    • quorum-like technique (minimum of replies from Read (R) and Write (W) nodes)
    • decentralized replica
    • synchronization protocol
  • gossip based distributed failure detection and membership protocol
  • SLA on 99.9% percentile
  • limit latency with O(1) (zerop-hop DHT) routing
  • must scale incrementally

Requests are sent to a coordinator node. The coordinator node is localized at the first position after the key position on the consistent hashing ring.

The coordinator node manages replication on the N following node in the preference list.

Google BigTable

  • Keyspace/Row/Column-family data model
  • atomic read/write at the row level
  • Tablets
  • SSTables (Sorted String Table)
  • Distributed filesystem (GFS)
  • Distributed Lock Manager (Chubby)

Cassandra

Cassandra is a distributed eventually-consistent (distributed architecture and replication as in Amazon Dynamo) column-family based store (as in Google BigTable data model).

Easy to install on a Debian system.

$ svn checkout https://svn.apache.org/repos/asf/cassandra/trunk cassandra-svn
$ svn cassandra-svn
$ sudo apt-get install ant jsvc
$ dpkg-buildpackage -rfakeroot
$ sudo dpkg -i ../cassandra_0.6.0~rc1-1_all.deb

That's all. It's running now :).

$ cassandra-cli -host localhost -port 9160
$ show keyspaces

Hmmm, actually it was not a good idea to get the source from svn ;). With a stable version (0.5.1) tarball from http://cassandra.apache.org/download/ , it works:

$ cassandra -f # as root

-f keeps the process in background to see the logs on stdout

Now I connect the cli to the local instance:

$ cassandra-cli -host localhost -port 9160
cassandra> show keyspaces
Keyspace1
system
cassandra> set Keyspace1.Standard2['jsmith']['first'] = 'John'
Value inserted.
cassandra> set Keyspace1.Standard2['jsmith']['last'] = 'Smith'
Value inserted.
cassandra> set Keyspace1.Standard2['jsmith']['age'] = '42'
Value inserted.
cassandra> get Keyspace1.Standard2['jsmith']              
=> (column=last, value=Smith, timestamp=1270739577358)
=> (column=first, value=John, timestamp=1270739567861)
=> (column=age, value=42, timestamp=1270739584095)
Returned 3 results

I kept the default testing configuration and keyspaces definition:

<Keyspaces>
    <Keyspace Name="Keyspace1">
       <ColumnFamily CompareWith="BytesType" Name="Standard1"/>
       <ColumnFamily CompareWith="UTF8Type" Name="Standard2"/>
       <ColumnFamily CompareWith="TimeUUIDType" Name="StandardByUUID1"/>
       <ColumnFamily ColumnType="Super"
                   CompareWith="UTF8Type"
                   CompareSubcolumnsWith="UTF8Type"
                   Name="Super1"
                   Comment="A column family with supercolumns, whose column and subcolumn names are UTF8 strings"/>
    </Keyspace>
</Keyspaces>

The keyspace Keyspace1 contains 4 ColumnFamily and I added rows in Standard2.

Simple Test Application

Using Python gunicorn (wsgi server), gevent (event handling library), paste (wsgi applications), tenjin (template engine) and pycassa (cassandra module).

Install gunicorn, gevent, paste, webob, authkit and tenjin from pypi:

$ easy_install -U gevent
$ easy_install -U gunicorn
$ easy_install -U paste
$ easy_install -U webob
$ easy_install -U authkit
$ easy_install -U tenjin

Install pycassa from github:

$ git://github.com/vomjom/pycassa.git
$ cd pycassa
$ python setup.py build
$ python setup.py install

Now test pycassa:

>>> import pycassa as cassandra
>>> cassandra.connect(['localhost:9160'])
>>> cf = cassandra.ColumnFamily(client, 'Keyspace1', u'Standard2')
>>> cf.get('jsmith')
{'age': '42', 'last': 'Smith', 'first': 'John'}

Let create a foobar entry:

>>> cf.insert('foobar', {'age': '30', 'last': 'Bar', 'first': 'Foo'})
1270826275772411

The simple application is inspired by Stack Overflow. It is a simple forum-like web application:

  • Questions
    • question (key)
    • author
    • tags
    • answers
    • nr_answers
    • nr_votes
    • nr_views
  • Users
    • email
    • name
    • age
    • location
    • date_created
    • date_lastvisit
  • Tags
    • name
    • count
  • Answers
    • author
    • content (key)
    • date_created
    • nr_votes

I add the following ColumnFamily (currently don't use Super Columns):

    <Keyspace Name="Forum">
      <KeysCachedFraction>0.01</KeysCachedFraction>
      <columnFamily CompareWith="UTF8Type" Name="Posts" />
      <columnFamily CompareWith="UTF8Type" Name="Users" />
      <columnFamily CompareWith="UTF8Type" Name="Tags" />
      <ColumnFamily CompareWith="UTF8Type" Name="Answers" />
    </Keyspace>

We create a first user:

>>> cf = cassandra.ColumnFamily(client, 'Forum', 'Users')
>>> cf.insert('foobar', {
...     'email': 'foobar@example.com',
...     'name': 'foobar',
...     'age': '30',
...     'location': 'New York',
...     'date_created': '2010/04/09',
...     'date_lastvisit': '2010/04/09'})
1270828509523568

To remove a key:

>>> cf = cassandra.ColumnFamily(client, 'Forum', 'Questions')
>>> for i in cf.get_range(): pprint(i)
... 
("What's best between Cassandra and Riak?",
 {'answers': '[]',
  'author': 'foobar',
  'nr_answers': '0',
  'nr_views': '0',
  'nr_votes': '0',
  'tags': "['nosql', 'db']"})
('Is it working?',
 {'answers': '[]',
  'author': '?',
  'nr_answers': '0',
  'nr_views': '0',
  'nr_votes': '0',
  'tags': '[]'})
>>> cf.remove('Is it working?')
1271065788629521
>>> for i in cf.get_range(): pprint(i)
... 
( "What's best between Cassandra and Riak?",
 {'answers': '[]',
  'author': 'foobar',
  'nr_answers': '0',
  'nr_views': '0',
  'nr_votes': '0',
  'tags': "['nosql', 'db']"})
('Is it working?', {})

An application based on user interaction should work better with authentication and a way to know who is the current user. I use AuthKit from the Pylons project to manage authentication.

Cassandra does not provide a query language like SQL. Then to avoid sorting and processing the data retrieve from the database in the code, we should denormalize the data in the keyspace. To denormalize means storing data very closed to the expected result. As Cassandra stores data in a sorted string table (SSTable) we can take advantage of this property to get an result sorted as we want. For example, if we want the list of questions ordered by date, it is a good practice to have a table well structured for that.

WTF is a SuperColumn? An Intro to the Cassandra Data Model, an article written by a Digg developer who actually uses Cassandra, explains this in a clear fashion. Then we need to know which queries we will make in order to define the structure of the tables.

The first query is "give me the last 10 questions ordered by date".

We can access an single entry with its key:

>>> cf.get('What-is-a-Super-Column-')
{'nr_views': '0', 'author': 'foobar', 'tags': 'cassandra, db', 'question': 'What is a Super Column?', 'nr_votes': '0', 'answers': '[]', 'nr_answers': '0'}

Before inserting an entry we should check if it already exists with cf.get_count(key).

Graph database

Neo4j

Rest API

References

  1. Codd, E.F. "Further Normalization of the Data Base Relational Model", p. 34
Personal tools