Introduction
This article introduces a method of synchronizing two PostgreSQL databases. Although, this seems to be an easy task, no product (slony, londiste, ...) really satisfied the needs within the maps.bremen.de project. Either they have special prerequisites that didn't apply for our problem, or they didn't support synchronizing of large objects.
Large objects are used to store tiles of a street/aerial map within PostgreSQL. My GIS-server queries the database and gets the tiles out. By using this construction, we are getting a flexible infrastructure for updating and maintaining different versions of the maps.
Everything was working fine until the service needs to be spread over three servers. How can we easily synchronize the databases? I really found no working solution that was clean and easy to use.
psync
As I'm using rsync on a regular basis, I thought why isn't there a tool for databases that simply does nothing more than synchronizing two databases on demand - no sophisticated triggers or additional changes of the database. psync is a simple application that synchronizes two PostgreSQL databases. The concept can be adapted or even encapsulated so that other databases can also be used.
The idea of psync is to use two database connections and update the tables and their content like this:
- fetch all tables of the source and destination databases
- for each table in the source, do:
- create table if not exist on destination
- sync contents of table
- for each table in destination, do:
- delete table if not exists on the source
Although the algorithm looks simple, there are many problems to overcome.
- Duplicating tables
- Handling of special columns, e.g., oid for large objects
- Constraints on tables
- Special structures of extensions
Duplicating Tables
In order to duplicate a table, we need to find out which columns exist and which primary keys we have and so on. With this information, we can assemble a CREATE TABLE
command that creates the table. In PostgreSQL, the information is stored in a table named information_schema.columns: column_name, is_nullable, and data_type. Primary keys are a little bit more complicated as we need two tables information_schema.table_constraints and information_schema.key_column_usage. See PSync.PrimaryKeys
for the whole SQL statement.
Large Object Identifiers
The current implementation of psync focuses on the oid
data type for handling large objects. Large objects store binary data and are identified using an integer. Tables with an oid
column store only references to a large object. Most replication solutions can't handle large objects.
Certainly, we can't just copy the identifiers from the source to the destination, we need to compare the data itself. Always copying the whole object is not a solution as they may be big and this will waste network bandwidth and time when they are equal. The solution used by psync is hash values (currently the MD5 algorithm is used). In order to keep the traffic low, these hashes are calculated by the database using a custom function without transferring any data. psync compares these values and copies the object on demand.
The PostgreSQL function for calculating an MD5 hash value is (original source):
CREATE OR REPLACE FUNCTION md5(id oid)
RETURNS text
as $$
DECLARE
fd integer;
size integer;
hashval text;
INV_READ constant integer := 262144;
SEEK_SET constant integer := 0;
SEEK_END constant integer := 2;
BEGIN
IF id is null THEN
RETURN NULL;
END IF;
fd := lo_open(id, INV_READ);
size := lo_lseek(fd, 0, 2);
PERFORM lo_lseek(fd, 0, 0);
hashval := md5(loread(fd, size));
PERFORM lo_close(fd);
RETURN hashval;
END;
$$
language plpgsql stable strict;
comment on FUNCTION md5(id oid) is 'Calculates the md5 sum of a large object.';
The process gets even more complex as there maybe more than one oid
column in a table. The following code segment shows the process of constructing the SQL query for retrieving the hash values:
bool first;
string _select = "select";
string _from = " from";
string _where = " where";
first = true;
foreach(int oid in oids) {
if (!first)
_select += ",";
_select += string.Format(" md5({0})", result.ColumnName(oid));
first = false;
}
_from += table;
first = true;
foreach(string key in primaryKeys) {
if (!first)
_where += " and";
_where += string.Format(" {0}='{1}'", key, result[row, result.ColumnIndex(key)]);
first = false;
}
command = _select + _from + _where;
After building the SQL command, the results are compared:
Result r1, r2;
try {
r1 = dst.Exec(command);
if (r1.Rows > 0) {
r2 = src.Exec(command);
for(int c=0; c<r1.Columns; c++) {
if (r1[0, c] != r2[0, c]) {
int id = CopyOid(result.GetInt(row, oids[c]));
command = string.Format("update {0} set {1}='{2}'{3}",
table, result.ColumnName(oids[c]), id, _where);
dst.Exec(command).Dispose();
}
}
r1.Dispose();
r2.Dispose();
}
else {
r1.Dispose();
command = string.Format("insert into {0} values(", table);
first = true;
for(int col=0; col<result.Columns; col++) {
if (!first)
command += ",";
if (oids.Contains(col)) {
int id = CopyOid(result.GetInt(row, oid));
command += string.Format(" '{0}'", id);
}
else
command += string.Format(" '{0}'", result[row, col]);
first = false;
}
command += ")";
dst.Exec(command).Dispose();
}
} catch(PostgreSQLException e) {
foreach(int oid in oids) {
int id = CopyOid(result.GetInt(row, oid));
command = string.Format("update {0} set {1}='{2}'{3}",
table, result.ColumnName(oid), id, _where);
dst.Exec(command).Dispose();
}
}
Using the Code
The code is implemented in two parts. The npq
namespace implements a low-level wrapper around libpq (class PG
) together with a high-level interface for easier use.
Example use of the library:
PostgreSQL db = new PostgreSQL("hostaddr='127.0.0.1' port='5432' requiressl='1' " +
"user='XXX' password='XXX' dbname='XXX'");
Console.WriteLine("Server {0} Protocol {1}", db.Version, db.Protocol);
Result result = db.Exec("select * from some_table");
Console.WriteLine("Result {0}", result.Valid);
Console.Write("{0,5} ", "Nr");
for(int f=0; f<result.Columns; f++)
Console.Write("| {0,15}:{1,8}", result.ColumnName(f), result.ColumnType(f));
Console.WriteLine();
for(int n=0; n<result.Rows; n++) {
Console.Write("{0,5} ", n);
for(int f=0; f<result.Columns; f++)
Console.Write("| {0,24}", result[n, f]);
}
result.Dispose();
db.Dispose();
Handling of large objects is done like this:
LargeObject lo = new LargeObject(db);
lo.Open(007);
byte[] tmp = new byte[1024];
int s = 0;
FileStream fs = new FileStream(result[n, 0]+".jpg", FileMode.Create);
while ((s = lo.Read(tmp, 1024)) > 0)
fs.Write(tmp, 0, s);
fs.Close();
lo.Close();
The code was developed under Linux using Mono and MonoDevelop. It should aslo work under Windows but was not tested. Linux users must add the following configuration (npq.dll.config) to correctly map the library.
<configuration>
<dllmap dll="libpq.dll" target="libpq.so" os="!windows" />
</configuration>
History
- 19.08.2009: Initial version.