woensdag 24 februari 2010

Database compatabilities

At the moment, SAINT only supports Microsoft Access as a database to store your data. Because most of the tools are quite tightly bound to accessing this data structure, you are really bound to Access. That is really becoming a problem.

First of all, Access itself is a problem. Access has some limitations that limit the amount of data that it can handle to about 2GB. However, that limit includes temporary space needed for the execution of some complex queries. In practice, you will thus run into this limit much earlier in the form of a vague error message. Access has other problems too, but this is the main issue. Access does have two big strengths: first of all, it is easy to handle it's databases. You can just copy over files, and it is already installed in many professional environments. Second, it has a quite good visual query designer, making Access less of a database than it is an analysis environment in the context of using it with SAINT.

The second problem with Access has to do with Qt, the toolkit that is used as the base of SAINT. Qt supplies a set of database drivers, that makes it possible to interface with different databases. It does not have a driver for Access/JET however. SAINT accesses the databases using a driver for the ODBC layer. That has an impact on the speed of the connection, as well as on the supported features. A solution would be of course to write a database driver for Qt myself, but that is not a trivial task.

The third problem is that of cross platform compatibility. Access only runs on Microsoft Windows, but in the scientific world, macs and linux machines are not all that uncommon. It would be very good if SAINT would be able to work on those systems as well. Because Qt itself is cross platform, that is not so hard, were it not that SAINT is bound (too) tightly with Access and thus with Windows.

So, it was time to come up with a solution.
Like I said, Qt supports database drivers for different databases. These work cross platform. However, they don't entirely abstract away the differences between the databases themselves. Different databases support, for instance, different data types. They also use a slightly different query languages. The base is the same (SQL), but the dialects differ. That is easy to understand: the databases are different for a reason, and they support different features. That also means that you sometimes have to talk to them in a slightly different way. Qt helps with this, but not enough.

This issue makes it hard to simply exchange one database for another. That would also create other problems for the users. What to do with existing data sets that are in Access already? An optimal solution would be to build in support for several data back-ends, and let the user choose in an easy way which one to use. This way, the tools can be used for small, simple sets that work perfectly on Access or some other small local database, but also on huge sets that run on a remote server.

To achieve that, I need to extend Qt's abstraction from the database back-end. Currently, it supports basic manipulation and querying of the data in the database, but it does not support changing the database structure. That feature is needed in many tools though. Many parts of SAINT create new tables, or extend existing ones. The solution I have decided on focuses on extending the Qt SQL driver model to support these operations, and then re-factoring the existing SAINT code to use this abstraction and to remove the current explicit SQL code that is used everywhere. All SQL will then be generated by the abstraction layer, making it possible to support different databases as a back-end for SAINT by just replacing the database driver. Of course, there is also some work to be done in the interface, as connecting to a remote database over the Internet needs a different setup than just pointing to a file on your local file system.

I am developing this code in a separate branch of the code for now. It will be for SAINT version 2.0. So far, the progress is promising, but it is still far from production ready. Expect a first beta release of this code somewhere in April.

woensdag 10 februari 2010

Changes in ISI data importer: increased compatibility with other sources

Recently, I was contacted by somebody who tried to use SAINT on a dataset converted from PubMed to the ISI text format. That did not work. It turns out that the ISI data importer relied on the existence of a really ISI specific field to ensure records were unique. That field does not exist in the data the converter tool created, and thus the application regarded all of them as the same: the code was empty for all of them.

A small update fixed that. Now, instead of only relying on the ISI code field, the program will try other options. The first option tried is to see if there is a valid DOI code. That code is unique too, if it is available. If it is, that code will be used as the unique identifier. If there is no DOI code, an artificial code will be generated based on the journal name, the ISSN number, the year, the full name of the first author and the title of the article. That should yield a pretty unique identifier. The code will be prepended with an identifier that tells you where the code came from. These can be "isi", "doi" or "saint".

As a side-effect of this change, I have also added a new field in the articles table output. The DOI field is now included. That also creates an interesting matching opportunity with the cited references.

All very nice perhaps, but... The changes can lead to two problems:
1) If you import data from multiple sources, you may run into double values. This can be the case if an article occurs for instance in both an ISI and another source. In the ISI source, there will be an ISI identifier, so that will be used as the unique code, even if there is a DOI field too. If the same article appears in another data source that does not have the ISI identifier, we'll run into a problem. The DOI field may be used as the identifier, and the articles are no longer identified as the same one.

2) If you use this newest version of the ISI data importer to augment data from an earlier version, old articles will not be recognized as the same anymore, because in the previous version, there was no "isi" prefix to the code. If you plan to do that, you should append the string "isi:" (w/o the quotes) before every code field. You can do that by running the query below:

UPDATE Articles SET Articles.code = "isi:"+[Articles].[code];