vrijdag 17 juli 2009

Building a new tool: Relation Calculator

As announced in the previous posting about development prioritites, I will be working on a tool to calculate scores for relations between objects in the database. That sounds very general, and it actually is. Allow me to elaborate...

background
The basic way of creating relations in our relational-database based data store is to create SQL queries that express the relation (e.g. correlation or distance measure) you are after. You can for instance relatively easily express a bibliometric coupling measure in SQL. It is also possible, but already more complicated, to express a Jaccard index over, say, title word similarities between articles in your database.

This approach, while very flexible in theory, still has some limitations in actual practice. The first is user-related: not every researcher wanting to do this kinds of analysis is a hero in creating SQL queries. That limits the usefulness of the tool set, not because what people want can not be done, but because it has a too high learning curve to actually do so. Providing standard database views for standard analyzes only solves this to a limited level.

Another limitation is that database engines are not always as efficient as they could be in evaluating the expressions that you need to construct often used measures. Also, they tend to use only a single thread to do a single query, thus making limited use of the resources of our modern multi-core computers. Using GPGPU techniques to speed up calculations is completely out of the scope of SQL for the foreseeable future. All this means that our calculations take a lot longer then they need to take, and sometimes run into arbitrary limits that they need not run into.

As stated above, providing standard views can only work up to a point. We want to be able to calculate relationships between all kinds of items in the database (articles, journals, authors, ...), and we want to be able to use different measures for them as well (Jaccard, cosine, Salton, ...). Providing standard solutions for all of them is simply not doable. It would result in an exponential increase of pre-defined views for everything you want to add, and would basically create a mess in what is embedded in the database. That's not a very inviting prospect. What we need is something that is flexible enough to allow calculating relations between all kinds of items in all kinds of ways we can think of.

So, we need a tool that can make these calculations:
  1. more easy to use (at least for the standard analyzes), and
  2. faster to compute, while still
  3. be as flexible as possible.
introducing Relation Calculator
The Relation Calculator tool should become that tool for you. It uses a concept of building blocks, that can be put together to create a calculation that makes sense. Each building block will be responsible for a small part of the chain, like selecting which database to operate on, selecting views and fields, loading the data from the database, calculating a Jaccard index, etc. Each building block has inputs and/or outputs, that can be connected to each other. In this way, a calculation for a relation can be defined. Some building blocks will present a UI to the user during execution of the calculation, for instance to allow selecting a database or some parameter like a threshold. Other building blocks will just perform some service, or maybe even one simple logical operation. Of course, these configurations of blocks can be loaded and saved, to be re-used later on. A set of pre-defined configurations can then be presented to the user.

New building blocks can be added as plugins later on, making the tool extensible. Another option to extend the functionality is to use building blocks that execute a script as their payload. For instance, you would be able to define a function in JavaScript that expresses a relationship between two authors based on some input data. That script can then be used in a configuration. The possibilities are virtually endless.

I envision a graphical environment where a user would be able to drag and drop building blocks on a canvas and graphically connect the input and outputs of the blocks. This would create a very simple way to define new configurations to calculate new relations.

relation to existing code
There already is some code that does the kind of work I described here. Like I mentioned in my previous blog, the current Record Grouper basically calculates such relations already. There is also already some code available that lays the basis for a script-based relation calculator. These existing pieces will of course not be just thrown away. They will have to be refactored to be able to use them as building blocks in the new Relation Calculator.

status
I am now very bussy implementing the infrastructure for all this. Though it is a lot of work, I am confident that it will work. Some details still have to be filled in, but I don't expect major obstacles in the near future. I hope to get a basic model working soon.

The idea is that