PostgreSQL Synchronization Tool

mf040

4.11/5 (4 votes)

19 Aug 2009CDDL3 min read

40.9K

689

Idea and implementation of a simple and easy database synchronization tool.

Download source code - 35.7 KB

Introduction

This article introduces a method of synchronizing two PostgreSQL databases. Although, this seems to be an easy task, no product (slony, londiste, ...) really satisfied the needs within the maps.bremen.de project. Either they have special prerequisites that didn't apply for our problem, or they didn't support synchronizing of large objects.

Large objects are used to store tiles of a street/aerial map within PostgreSQL. My GIS-server queries the database and gets the tiles out. By using this construction, we are getting a flexible infrastructure for updating and maintaining different versions of the maps.

Everything was working fine until the service needs to be spread over three servers. How can we easily synchronize the databases? I really found no working solution that was clean and easy to use.

psync

As I'm using rsync on a regular basis, I thought why isn't there a tool for databases that simply does nothing more than synchronizing two databases on demand - no sophisticated triggers or additional changes of the database. psync is a simple application that synchronizes two PostgreSQL databases. The concept can be adapted or even encapsulated so that other databases can also be used.

The idea of psync is to use two database connections and update the tables and their content like this:

fetch all tables of the source and destination databases
for each table in the source, do:

create table if not exist on destination
sync contents of table

for each table in destination, do:

delete table if not exists on the source

Although the algorithm looks simple, there are many problems to overcome.

Duplicating tables
Handling of special columns, e.g., oid for large objects
Constraints on tables
Special structures of extensions

Duplicating Tables

In order to duplicate a table, we need to find out which columns exist and which primary keys we have and so on. With this information, we can assemble a CREATE TABLE command that creates the table. In PostgreSQL, the information is stored in a table named information_schema.columns: column_name, is_nullable, and data_type. Primary keys are a little bit more complicated as we need two tables information_schema.table_constraints and information_schema.key_column_usage. See PSync.PrimaryKeys for the whole SQL statement.

Large Object Identifiers

The current implementation of psync focuses on the oid data type for handling large objects. Large objects store binary data and are identified using an integer. Tables with an oid column store only references to a large object. Most replication solutions can't handle large objects.

Certainly, we can't just copy the identifiers from the source to the destination, we need to compare the data itself. Always copying the whole object is not a solution as they may be big and this will waste network bandwidth and time when they are equal. The solution used by psync is hash values (currently the MD5 algorithm is used). In order to keep the traffic low, these hashes are calculated by the database using a custom function without transferring any data. psync compares these values and copies the object on demand.

The PostgreSQL function for calculating an MD5 hash value is (original source):

SQL

CREATE OR REPLACE FUNCTION md5(id oid)
RETURNS text
as $$
DECLARE
 fd        integer;
 size      integer;
 hashval   text;
 INV_READ  constant integer := 262144; -- 0x40000 from libpq-fs.h
 SEEK_SET  constant integer := 0;
 SEEK_END  constant integer := 2;
BEGIN
 IF id is null THEN
   RETURN NULL;
 END IF;
 fd   := lo_open(id, INV_READ);
 size := lo_lseek(fd, 0, 2);
 PERFORM lo_lseek(fd, 0, 0);
 hashval := md5(loread(fd, size));
 PERFORM lo_close(fd);
 RETURN hashval;
END;
$$
language plpgsql stable strict;
comment on FUNCTION md5(id oid) is 'Calculates the md5 sum of a large object.';

The process gets even more complex as there maybe more than one oid column in a table. The following code segment shows the process of constructing the SQL query for retrieving the hash values:

bool first;
string _select = "select";
string _from = " from";
string _where = " where";

// Compose the select block
first = true;
foreach(int oid in oids) {
  if (!first)
    _select += ",";
  // Get the md5 hash value for the oid-column
  _select += string.Format(" md5({0})", result.ColumnName(oid));
  first = false;
}

// Data from the table we are currently working on
_from += table;

// The row identified by the primary keys of the table
first = true;
foreach(string key in primaryKeys) {
  if (!first)
    _where += " and";
  _where += string.Format(" {0}='{1}'", key, result[row, result.ColumnIndex(key)]);
  first = false;
}
command = _select + _from + _where;

After building the SQL command, the results are compared:

Result r1, r2;
try {
  r1 = dst.Exec(command);
  if (r1.Rows > 0) {
    r2 = src.Exec(command);

    // Compare large objects
    for(int c=0; c<r1.Columns; c++) {
      if (r1[0, c] != r2[0, c]) {
        // md5 hashes are different
        int id = CopyOid(result.GetInt(row, oids[c]));
        command = string.Format("update {0} set {1}='{2}'{3}", 
                                table, result.ColumnName(oids[c]), id, _where);
        dst.Exec(command).Dispose();
      }
    }

    r1.Dispose();
    r2.Dispose();
  }
  else {
    // Row missing on destination
    r1.Dispose();
    command = string.Format("insert into {0} values(", table);
    first = true;
    for(int col=0; col<result.Columns; col++) {
      if (!first)
        command += ",";
      if (oids.Contains(col)) {
        int id = CopyOid(result.GetInt(row, oid));
        command += string.Format(" '{0}'", id);
      }
      else 
        command += string.Format(" '{0}'", result[row, col]);
      first = false;
    }
    command += ")";
    dst.Exec(command).Dispose();
  }
} catch(PostgreSQLException e) {
  // Large object id exist on destination but does not reference an object
  foreach(int oid in oids) {
    int id = CopyOid(result.GetInt(row, oid));
    command = string.Format("update {0} set {1}='{2}'{3}", 
                            table, result.ColumnName(oid), id, _where);
    dst.Exec(command).Dispose();
  }
}

Using the Code

The code is implemented in two parts. The npq namespace implements a low-level wrapper around libpq (class PG) together with a high-level interface for easier use.

Example use of the library:

PostgreSQL db = new PostgreSQL("hostaddr='127.0.0.1' port='5432' requiressl='1' " + 
                               "user='XXX' password='XXX' dbname='XXX'");
Console.WriteLine("Server {0} Protocol {1}", db.Version, db.Protocol);

Result result = db.Exec("select * from some_table");
Console.WriteLine("Result {0}", result.Valid);
Console.Write("{0,5} ", "Nr");
for(int f=0; f<result.Columns; f++)
  Console.Write("| {0,15}:{1,8}", result.ColumnName(f), result.ColumnType(f));
Console.WriteLine();
for(int n=0; n<result.Rows; n++) {
  Console.Write("{0,5} ", n);
  for(int f=0; f<result.Columns; f++)
    Console.Write("| {0,24}", result[n, f]);
}
// You'll need to explicitly call Dispose to free associated ressources
result.Dispose();
db.Dispose();

Handling of large objects is done like this:

LargeObject lo = new LargeObject(db);
lo.Open(007);
byte[] tmp = new byte[1024];
int s = 0;
FileStream fs = new FileStream(result[n, 0]+".jpg", FileMode.Create);
while ((s = lo.Read(tmp, 1024)) > 0)
  fs.Write(tmp, 0, s);
fs.Close();
lo.Close();

The code was developed under Linux using Mono and MonoDevelop. It should aslo work under Windows but was not tested. Linux users must add the following configuration (npq.dll.config) to correctly map the library.

XML

<configuration>
  <dllmap dll="libpq.dll" target="libpq.so" os="!windows" />
</configuration>

History

19.08.2009: Initial version.

License

This article, along with any associated source code and files, is licensed under The Common Development and Distribution License (CDDL)