SAINT toolkit

dinsdag 14 juni 2011

Release!

Yes, it is finally there: a new release of SAINT!

The version I have put online is version 2011.01 beta 1. It contains all the basic tools, and supports both Access and MySql as database backends. Many issues were fixed, but it is still a beta. Please let me know of any problems, especially of regressions.

Get it while it is fresh!

dinsdag 15 maart 2011

Saint as Qt Ambassador

Recently, SAINT was added as a Nokia Qt Ambassador program. That means that it being used as a showcase for the capabilities of Qt and the different areas Qt is used in.
Check SAINT on as a Qt showcase here!

maandag 14 maart 2011

Database layer working: ISI parser to proof it

Moving away from Access as the only database backend supported by SAINT has been on the TODO list for quite a long time. As it turned out, creating a system to allow for multiple database backends was not all that easy. It required changes on many levels, which I'll detail a bit below. But don't let that spoil the big news: it works!

Yes, that's right, you can now actually use different databases than just Access with SAINT. At this moment, I have it working with just the ISI parser, but that is already a very important step. I use it as a testbed for all the other tools, which basically need the same infrastructure that ISI parser does, though usually to a lesser extend. That comes down to: if it works for the ISI parser, it should (mostly) work with the rest of the tools too. I will release a beta to demonstrate the working ISI parser as soon as possible.

Nitty-gritty stuff
So... what kind of changes were needed in the SAINT software stack to get all this work? Quite a lot, as it turned out.

Driver backend
The first thing that needed to be done, was to come up with a driver backend that would support all the operations we need on databases and that will work for all of them. The standard Qt database drivers only support abstractions for manipulating the data (Data Manipulation Language or DML), not the data structure (Data Definition Language or DDL). Since SAINT manipulates the data structure all the time (creating tables to store results, for instance), we need support for all of these operations. But: not all databases do that in the same way! SQL is a standard only to point... To further complicate things, not all database support the same data types, nor do these types map to the datatypes used inside SAINT and Qt in a trivial way. To store a value like a date or a long text string, you need different data types in Access and MySql, for instance.

I choose to extend the standard Qt Sql database drivers with additional methods, and in that way create new drivers that can be used with the standard Qt database system, but that offer support for far more operations. Currently, I have drivers implemented for Access (on top of the Qt ODBC driver) and MySql (using the standard Qt MySql driver).

Abstracting database access inside the tools
In order to able to actually use these shiny new drivers, all access to the database needed to be abstracted out of the code. That is: nowhere in the code could direct SQL be used anymore, nor could I use things like type names in the form as strings any more. All SQL statements needed to be generated by the current driver based on a common format. I used Qt's own QSqlRecord for that, in combination with a hugely extended list of supported query types for QSqlDriver::sqlStatement(). Where this last statement is barely documented in the Qt documentation, is turns out to be essential if you want to make your application database agnostic.

User interface
Creating a user interface to select a file on your local file system is not all that complicated. But coming up with a good interface to select or create a generic database, either as a file or running on a server, is a different matter. It took quite a bit of effort to design a build a UI that seems to be easy enough to use, and that can be embedded into the different tools with ease. It will probably need some more tweaking, but here is an idea of the current UI:

The widget that can be embedded in the different tools to select the database:

It features a drop down box with the 10 most recently used databases (with an icon representing the type of database), and a button on the right to open up a dialog box that allows to select another database currently not in the list. This widget will allow drag & drop in the near future, so you can simply drop your exising database file or .sdbd file (more on that later) on it.

The Database Selection Dialog that you get when clicking the button:

This dialog lists the databases that are known to SAINT. For each of them, a small file is stored in your documents directory (under SAINT/database descriptions/, to be exact) containing all the information that is needed to connect with the database, along with some meta information. You can even send these files to your college to easily share access to a server-based database if you want.

Select or create another database:

This is the page that allows you to select or create a database that has not been used before. The drop down box on the top displays a list with available database types. Currently on Windows, you can choose from Microsoft Access and MySql. The rest of the dialog will adapt itself accordingly. Showing on this screenshot is what you get when selecting an Access database.

Dialog for MySql databases:

The above displays the same dialog, but with the MySql database type selected. Once you enter the relevant data on the server and your credentials, available databases on the server will automatically be displayed.

All database types will insist on an Alias to be created for the database. This alias is the name for the database that you defined by selecting a file name, or by entering a server address, port, user credentials and database name. It is used in the file name for the .sdbd (Saint DataBase Description), in the drop down for the database selector widget and in the list of existing databases. A suitable alias will be suggested automatically.

dinsdag 30 maart 2010

Word stemming

One of the issues with working with (bibliometric) data is data cleaning. While we already have an experimental tool called Record Grouper that is being refactored at the moment, we also needed a quick-and-dirty way to do some basic cleaning on words.

To facilitate this need, I have added a new feature to the Word Splitter tool. The tool now builds an additional table called Wordstems, and it adds an additional field to the Words column with a reference to a record in this Wordstems table. Each word is now being stemmed by using the Porter stemming algorithm. I am looking into replacing this algorithm by another, more accurate one, but the idea will stay the same if that happens.

So now, each word will get it's stem (according to this Porter algorithm) associated with it using the stem's ID number. This makes it easy to treat words that have the same stem as the same word, so we can get rid of the difference between "robot", "robots", "robotic" and "robotics" easily if we want.

On a note that relates to the last posting on database compatibilities. While the Word Splitter still only works with MS Access, under the hood a lot has changed. I have implemented a driver system with extends the database drivers that are available in Qt by default, and works around a couple of bugs on the side. Based on this, I have created an Access Driver that is now being used by the Word Splitter tool. While it is not completely database independent yet, it is a good start and a good test for the new driver.

woensdag 24 februari 2010

Database compatabilities

At the moment, SAINT only supports Microsoft Access as a database to store your data. Because most of the tools are quite tightly bound to accessing this data structure, you are really bound to Access. That is really becoming a problem.

First of all, Access itself is a problem. Access has some limitations that limit the amount of data that it can handle to about 2GB. However, that limit includes temporary space needed for the execution of some complex queries. In practice, you will thus run into this limit much earlier in the form of a vague error message. Access has other problems too, but this is the main issue. Access does have two big strengths: first of all, it is easy to handle it's databases. You can just copy over files, and it is already installed in many professional environments. Second, it has a quite good visual query designer, making Access less of a database than it is an analysis environment in the context of using it with SAINT.

The second problem with Access has to do with Qt, the toolkit that is used as the base of SAINT. Qt supplies a set of database drivers, that makes it possible to interface with different databases. It does not have a driver for Access/JET however. SAINT accesses the databases using a driver for the ODBC layer. That has an impact on the speed of the connection, as well as on the supported features. A solution would be of course to write a database driver for Qt myself, but that is not a trivial task.

The third problem is that of cross platform compatibility. Access only runs on Microsoft Windows, but in the scientific world, macs and linux machines are not all that uncommon. It would be very good if SAINT would be able to work on those systems as well. Because Qt itself is cross platform, that is not so hard, were it not that SAINT is bound (too) tightly with Access and thus with Windows.

So, it was time to come up with a solution.
Like I said, Qt supports database drivers for different databases. These work cross platform. However, they don't entirely abstract away the differences between the databases themselves. Different databases support, for instance, different data types. They also use a slightly different query languages. The base is the same (SQL), but the dialects differ. That is easy to understand: the databases are different for a reason, and they support different features. That also means that you sometimes have to talk to them in a slightly different way. Qt helps with this, but not enough.

This issue makes it hard to simply exchange one database for another. That would also create other problems for the users. What to do with existing data sets that are in Access already? An optimal solution would be to build in support for several data back-ends, and let the user choose in an easy way which one to use. This way, the tools can be used for small, simple sets that work perfectly on Access or some other small local database, but also on huge sets that run on a remote server.

To achieve that, I need to extend Qt's abstraction from the database back-end. Currently, it supports basic manipulation and querying of the data in the database, but it does not support changing the database structure. That feature is needed in many tools though. Many parts of SAINT create new tables, or extend existing ones. The solution I have decided on focuses on extending the Qt SQL driver model to support these operations, and then re-factoring the existing SAINT code to use this abstraction and to remove the current explicit SQL code that is used everywhere. All SQL will then be generated by the abstraction layer, making it possible to support different databases as a back-end for SAINT by just replacing the database driver. Of course, there is also some work to be done in the interface, as connecting to a remote database over the Internet needs a different setup than just pointing to a file on your local file system.

I am developing this code in a separate branch of the code for now. It will be for SAINT version 2.0. So far, the progress is promising, but it is still far from production ready. Expect a first beta release of this code somewhere in April.

woensdag 10 februari 2010

Changes in ISI data importer: increased compatibility with other sources

Recently, I was contacted by somebody who tried to use SAINT on a dataset converted from PubMed to the ISI text format. That did not work. It turns out that the ISI data importer relied on the existence of a really ISI specific field to ensure records were unique. That field does not exist in the data the converter tool created, and thus the application regarded all of them as the same: the code was empty for all of them.

A small update fixed that. Now, instead of only relying on the ISI code field, the program will try other options. The first option tried is to see if there is a valid DOI code. That code is unique too, if it is available. If it is, that code will be used as the unique identifier. If there is no DOI code, an artificial code will be generated based on the journal name, the ISSN number, the year, the full name of the first author and the title of the article. That should yield a pretty unique identifier. The code will be prepended with an identifier that tells you where the code came from. These can be "isi", "doi" or "saint".

As a side-effect of this change, I have also added a new field in the articles table output. The DOI field is now included. That also creates an interesting matching opportunity with the cited references.

All very nice perhaps, but... The changes can lead to two problems:
1) If you import data from multiple sources, you may run into double values. This can be the case if an article occurs for instance in both an ISI and another source. In the ISI source, there will be an ISI identifier, so that will be used as the unique code, even if there is a DOI field too. If the same article appears in another data source that does not have the ISI identifier, we'll run into a problem. The DOI field may be used as the identifier, and the articles are no longer identified as the same one.

2) If you use this newest version of the ISI data importer to augment data from an earlier version, old articles will not be recognized as the same anymore, because in the previous version, there was no "isi" prefix to the code. If you plan to do that, you should append the string "isi:" (w/o the quotes) before every code field. You can do that by running the query below:

UPDATE Articles SET Articles.code = "isi:"+[Articles].[code];

maandag 25 januari 2010

ISI importer demonstration

After some good responses on my video demonstrating the Matrix Builder tool, I decided to make a video demonstrating the ISI Data Importer tool. Again, the video is available at Youtube.

The video demonstrates the complete process from selecting your input file to starting the actual parsing, and highlights the new features of the parser that help you identify the next step. I hope it will be useful to you!

Update (January 27, 2010):
OK, I am still learning how to deal with Youtube, and how to optimize the videos properly. I have changed the video for an HD version. If all goes well, that should work now. Watch in full screen at 720p quality for optimal viewing. It may take a little time for Youtube to process the video though.