SAINT toolkit: 2009

woensdag 9 december 2009

Big changes in Matrix Builder

More news on updated tools: the Matrix Builder.

While simple in concept, the Matrix Builder is a very important tool in our toolbox. And it has seen some important changes! Let me take you through them.

Just like the ISI parser, the matrix compiler has had some subtle interface changes. The window can now be minimized, and the close button has been removed. The old OK button has been renamed to Run. This will result in less confusion.

More importantly, is the introduction of properties. That is: more data can be attached to the output than just connection strengths and vertex labels. There are many such properties possible, and we have a wish list for almost all of them. The basics are implemented, and all that remains now is to implement more 'plugins' to support more properties.

Quite a few nice properties have already been implemented. The most useful will probably be the vertex size property (scale node sizes according to some value), and the vertex colouring.

The new version will bear version number 1.3

ISI parser updates

It has been a long time since my last blog. That does not mean that developments have stopped! I'll try to fill you in on what happened in the meantime. I'll start with the oldest tool in the box: the ISI data importer.

Recently, I made an update for the ISI data importer. Most changes are pretty small, but still...
The first issue that was addressed, is a limitation that existed with selecting files. If you selected lots of files, especially if they had long file names in a deep level of you directory hierarchy, it could happen that you'd run over 32 thousand characters for the file names. That resulted in leaving some files out, without any warning! The issue was addressed by changing the way multiple selected files are displayed in the file selection widget. Instead of just listing all the file names including their paths (which you will never read anyway for 50+ files), you now get a listing like

'file_1' and 999 other files in directory ''file_0.txt' and 999 other files in directory 'C:/Documents and Settings/andre/Desktop/data'

Much more readable. The way the display is formatted, depends on the number of files selected.

Another issue that was addressed, is the laggy display of the progress. If you hid or obscured the progress window during a parsing operation, it would take untill the next new file untill it was updated again. This is now fixed. What's more, the window can be minimized, and the main window is hidden during the parsing process. The progress bars are also changed. The files progress bar has been replaced by a write progress bar, that displays the amount of parsed records (articles) being written to the database.

A new feature has also been introduced: it is now possible to add data to an existing database! That means that if you select an existing database as your output file, you are now forced to choose what to do with that. You can either augment the existing data (no duplicates will be made) or you can overwrite the existing database completely.

A user interface change will enforce that you actually have make a choice: The OK button has been renamed in a Run button, and the Close button has been removed. Closing is done by just closing the window. The Run button will only be available if no problems have been detected. If there are problems that prevent running, hovering your mouse over a small warning-sign icon will tell you about what they are.

This new version of the ISI data importer will get version number 1.2.

donderdag 6 augustus 2009

Progress with Relation Calculator

The relation calculator tool that I introduced in my last blog, is shaping up nicely. Sure, it is not all smooth sailing, but it is really starting to look like something that could be very useful indeed. Let's use a screenshot of the current version to illustrate where things are heading:

What do you see?
The main thing you'll notice is the area on the right, where you see boxes that are connected by lines. Each one of these boxes represents one simple step in the process of doing an analysis. In this case: calculating a Jaccard coefficient based on a co-word/cited reference combinations between journal papers. A nice, basic selection of such basic steps - components in the terminology of the application - is already there, though not fully implemented yet. The list of components can be extended later on using plugins. The available components are visible in the list on the top left.

Each of the components have inputs on their left and/or outputs on their right. Outputs can be connected to one or more inputs for other components, thus creating a graph. Note that no circular connections are allowed. The user can use drag & drop to put components on the screen, and to connect inputs and outputs together.

So, how do these separate components become an analysis?
On the bottom left of the screen, you can see the Execution order window. Here you can see the order in which the components will be executed. This is determined by their connections, the question if they present an interface at run time, and their positions on the screen. You can open and save analysis sequences for reuse and distribution.

Once you have hooked up every component in the right order, you can run it. You will be presented with a wizard-type interface that guides you through the UI elements that each of the components presents (if any), and that present information about the progress of your analysis if there are long run times involved.

vrijdag 17 juli 2009

Building a new tool: Relation Calculator

As announced in the previous posting about development prioritites, I will be working on a tool to calculate scores for relations between objects in the database. That sounds very general, and it actually is. Allow me to elaborate...

background
The basic way of creating relations in our relational-database based data store is to create SQL queries that express the relation (e.g. correlation or distance measure) you are after. You can for instance relatively easily express a bibliometric coupling measure in SQL. It is also possible, but already more complicated, to express a Jaccard index over, say, title word similarities between articles in your database.

This approach, while very flexible in theory, still has some limitations in actual practice. The first is user-related: not every researcher wanting to do this kinds of analysis is a hero in creating SQL queries. That limits the usefulness of the tool set, not because what people want can not be done, but because it has a too high learning curve to actually do so. Providing standard database views for standard analyzes only solves this to a limited level.

Another limitation is that database engines are not always as efficient as they could be in evaluating the expressions that you need to construct often used measures. Also, they tend to use only a single thread to do a single query, thus making limited use of the resources of our modern multi-core computers. Using GPGPU techniques to speed up calculations is completely out of the scope of SQL for the foreseeable future. All this means that our calculations take a lot longer then they need to take, and sometimes run into arbitrary limits that they need not run into.

As stated above, providing standard views can only work up to a point. We want to be able to calculate relationships between all kinds of items in the database (articles, journals, authors, ...), and we want to be able to use different measures for them as well (Jaccard, cosine, Salton, ...). Providing standard solutions for all of them is simply not doable. It would result in an exponential increase of pre-defined views for everything you want to add, and would basically create a mess in what is embedded in the database. That's not a very inviting prospect. What we need is something that is flexible enough to allow calculating relations between all kinds of items in all kinds of ways we can think of.

So, we need a tool that can make these calculations:

more easy to use (at least for the standard analyzes), and
faster to compute, while still
be as flexible as possible.

introducing Relation Calculator
The Relation Calculator tool should become that tool for you. It uses a concept of building blocks, that can be put together to create a calculation that makes sense. Each building block will be responsible for a small part of the chain, like selecting which database to operate on, selecting views and fields, loading the data from the database, calculating a Jaccard index, etc. Each building block has inputs and/or outputs, that can be connected to each other. In this way, a calculation for a relation can be defined. Some building blocks will present a UI to the user during execution of the calculation, for instance to allow selecting a database or some parameter like a threshold. Other building blocks will just perform some service, or maybe even one simple logical operation. Of course, these configurations of blocks can be loaded and saved, to be re-used later on. A set of pre-defined configurations can then be presented to the user.

New building blocks can be added as plugins later on, making the tool extensible. Another option to extend the functionality is to use building blocks that execute a script as their payload. For instance, you would be able to define a function in JavaScript that expresses a relationship between two authors based on some input data. That script can then be used in a configuration. The possibilities are virtually endless.

I envision a graphical environment where a user would be able to drag and drop building blocks on a canvas and graphically connect the input and outputs of the blocks. This would create a very simple way to define new configurations to calculate new relations.

relation to existing code
There already is some code that does the kind of work I described here. Like I mentioned in my previous blog, the current Record Grouper basically calculates such relations already. There is also already some code available that lays the basis for a script-based relation calculator. These existing pieces will of course not be just thrown away. They will have to be refactored to be able to use them as building blocks in the new Relation Calculator.

status
I am now very bussy implementing the infrastructure for all this. Though it is a lot of work, I am confident that it will work. Some details still have to be filled in, but I don't expect major obstacles in the near future. I hope to get a basic model working soon.

The idea is that

dinsdag 30 juni 2009

Development priorities

We have set some development priorities after the first release that we did recently. These are partly reflected in the Ticket tracker, but to be honest that doesn't quite do it justice yet.

A small but important update is to change the output format for the Matrix Compiler tool. Currently, it outputs to a full matrix in DL format. This is a bit inflexible, as it does not allow attributes other than a label for the nodes, and it outputs a full matrix even for a large, sparce matrix. That results in bigger output files, and thus longer processing/io time. In the (hopefull near) future, it will allow for the addition of including attributes to nodes as well as connections. You can track the progress on this issue here. We're already testing it...

A larger task has to do with reworking the Record Grouper and the Relation Calculator. The first part of that job is to specialize the Record Grouper to do only that: group objects based on some kind of relation between them. This means ripping out a large piece of complicated code (but not throwing that away, see later) and focus on a good UI to make it easier to work with. This means fine tuning the layout, but also add options to undo, store and re-load your work, etc. That is a lot of work in itself, but it will simplify the code considderably making it easier to maintain.

Another large task is to get the Relation Calculator into a usable shape. This is a complex tool. The basic idea is that will become a specilized tool to calculate a similarity or distance measure between any two objects in the database, and be as flexible as possible as to how to calculate that. Currently, we only use SQL queries to calculate such scores, but that is sometimes limited, often complex, and usually relatively slow because most SQL backends don't use multi-threading for single queries, let alone utilize things like letting the videocard do work for you.

You can express a lot of such measures in SQL, but it is often complex to do, especially if you are not that used to using databases. That makes the current way of working harder to use for novices, but also for seasoned researchers who just are not that into SQL. It is however more flexible than being stuck to defaults too much. In this tool, we want to make the standard things easy, and yet be as flexible as possible to enable more advanced use.

The goal is to make the standard analysis that you would normally run on a database generated by the ISI Data Importer and processed by the Word Splitter and optionally the Record Grouper as easy as selecting them and optionally set some time slices or threshold, after which ready-to-analyze output is generated. We aim to release a first working combination of these tools at the end of August.

maandag 29 juni 2009

New name: SAINT

We did it: we came up with a nice acronym as a name for the toolkit. Since Science Research Tools is a bit generic, and SciSA toolkit too much linked to our department name, we came up with a new name: SAINT. That is an acronym for Science Assessment (or Analysis, at your choice) Integrated Network Toolkit. This name will be used everywhere to refer to the toolkit from now on, though it will take a bit of time for it to be used everywhere consistently. Note that our URL's will not change, so there is no need to update your bookmarks.

SAINT will save you (time).
That's just one of the many possible catchphrases of course... We'll come up with some new ones in due time to advertise the toolkit to the outside world. If you have any suggestions, please let us know!

vrijdag 26 juni 2009

Mini-demo during coffee breaks

The official demonstration session on the e-Social science conference today is scheduled for the last 20 minutes of the lunch break. Because of the nature of the lunches (a three course affair that seems to run over the allotted time every time so far), I have decided to cancel this demonstration. Experience by others so far are that no-one shows up for these sessions.

For those interested, I will be giving mini demonstrations with my laptop only in the coffee breaks in the morning and/or afternoon. Simply find me, and ask! I am exited to fold open my laptop for you to show you our work.

Introducing: Matrix compiler

In this third instalment of the 'Introducing...' series, I will be talking about the Matrix Compiler tool. While you can already see a lot of interesting things by just looking at the tables and views in the database that contains your data, visualization can be a huge benefit to recognize patterns. That means that you will need to somehow transform your data out of the database into a format that you can use for visualization.

The Matrix Compiler is a tool that can do this. It can use the database and translate it's data into a format that you can read into Pajek, a well known visualization and network analysis package.

Analysing and visualizing networks means that you will be dealing with two kinds of things. First, there are the objects that are connected, which we^* will call the nodes. Next, there are the connections between those nodes themselves. At least the latter should be available as a table or query/view in your database. For briefness, I will call them all a view from here. Depending on how you build up your database, it is possible that your connections view will contain complete labels (or other attributes), but it very possibly may only contain an ID of the node in the database, and the label for the node is defined in some other view. The Matrix Compiler supports both modes of operation: the labels for the nodes can be retrieved from either the connections view, or be looked up from an external view.

After opening the database and indicating the name of the output file, the matrix can be set up. When constructing the matrix, you start with selecting the connection view, by placing the cursor in the corresponding box and either typing the name or selecting it from the view by double clicking on it. You then select which field in that view represents, respectively, the value for the relationship, the row and the column. A connection view thus needs to have at least three fields: two to represent the nodes you are connecting, and one to indicate the strength of that connection.

The next step is to define the structure of your matrix. There are many options for that. If the types of objects in the rows and columns are the same, you may want to create a square matrix where all the nodes appear as both a row and a column. This makes sense if you are, for instance, constructing a matrix to represent co-authorships, but not if you want to display in which journals authors publish.

If you chose to create a square matrix, you may also choose if you want the matrix to be symmetric or not. In a symmetric matrix, the value of M(a,b) would be identical to M(b,a). Again, in case of co-authorships the meaning of author a sharing a co-authorship with author b is the same as saying that author b is sharing a co-authorship with author a. But for a citation relationship, a citation from a to b is different from one from b to a.
In both cases, you can also choose if you want to set the diagonal, that is M(a,a) to a set value, or if you want to use values occuring in your data. This can make sense to filter out things like self-citations from your data.
If your data contains data for the same relation more than once, you can choose how to deal with that. The options are to use the first occurrence, use the last, add the occurring values or multiply all the occurring values.

The last step of creating your matrix, is to choose where the labels for the nodes should come from. As explained above, there are two basic options. If the correlation view already contains the labels, just select the appropriate option and you are done. If that view contains references to nodes though, you can now select where to get the actual labels. To continue our example on the co-authorships, it would make sense to select the Authors table as the table to find the labels, and use the full author name as the label for your nodes in the network.

Note that if you use an external source of labels, you can choose how to deal with nodes in your label view that do not appear in the correlation view. For instance, authors in your Authors table may not have any co-authorships. That means that they will not show up in your correlation view. You may or may not want to include these unconnected nodes. The choise is yours.

Note that outputting very big matrix files, can take some time, as the output size is O(n²). We are planning to change the output format shortly, from a matrix form to a list form. That will result in smaller output files for big matrices, and will also allow the inclusion of attributes other than a label to both the nodes and the connections.

* Pajek itself uses a different terminology. It instead talks about Vertices for the nodes, and Arcs and Edges for the connections between these nodes.

woensdag 24 juni 2009

Introducing: Word Splitter

In the second installment of the "Introducing" series, I will tell you something about our small utility called Word Splitter. The idea of the word splitter is simple. You point it to a field in your database that contains text, like a title or an abstract. The word splitter then builds up a table with all the words that occur in the database in that field, and a table with pointers between the record identifier that the text came from and the record identifier in the words table, so you can find which words belong to which text-record. To make it possible to re-construct the text based on the sentence, the position the word was in is also stored in that pointer table.

Our tool can do more than that though. First of all, it can process multiple text fields from your database at the same time, making it more efficient to work with for you. So, you can split both that title as well as that abstract simultaniously.

Furthermore, the Word Splitter can use stopwords. Stopwords are, in their basic form, lists of words that are ignored when splitting the text, for instance because they are too common. That means that if you use stopwords, not all words from the text will be stored in the Words table, nor will pointers occur in the couple table. However, for different purposes, you may want to think of the procedure in two ways. One option is to first split the complete text, then remove the stopwords, and then store the words and their positions after removing the stopwords to the database. This will result in consecutive word positions in the couple table, even if there used to be one or more stopwords between two words in the original text. Alternatively, you can split the complete text, note each word's position, remove the stopwords and only then store them into the database. This will result in word positions that reflect the original position of the word in the text, but leaves them non-consecutive.
To provide maximum flexibility in the analysis, both these positions are stored in the couple table.

The Word Splitter can use several stop word lists at once, and furthermore knows three kinds of lists. First, it can use simple text files that contain lists of words. Second, it can use a field in an existing database that also contains a list of words. And last, it can use lists of regular expressions. These expressions are patterns that each word is matched against, and if it matches, it is regarded as a stop word. That allows you to, for instance, filter out numbers or dates without having to write them all out.

To make it easy to use these stop word lists, you can create sets of these lists, and store such a set as a file again. This way, you can easily review the stop words you used for your analysis, and you can re-use the same set later on. You can also use one such a set as your default set of stop words.

This was a basic introduction to our Word Splitter tool. I hope you will like using it!

Installer for windows online

With the official launch of the toolkit only a night away, I have just uploaded an installer for the Windows platform to the file storage we have on Assembla. Of course, our website has been updated to reflect that. Other files you can find there include documentation, but also testcases to reproduce bugs.

Tomorrow at 11 AM, I will do the first demonstration (note, the time has changed from the earlier announcement) on the e-Social Science conference in Cologne. I will give a little bit of background, and then quickly move on to actually showing the attendants the tools on some real life data. Of course, I will also show some pretty pictures that we made using the tools, courtesy of Edwin (thanks!)

I hope everything will go all right. There will be another demonstration on friday, so plenty of opportunities to see the tools in action!

vrijdag 19 juni 2009

Open source repository and issue tracker online

Yesterday, we reached an important milestone in the project. We have put a public repository online that contains the complete source code for all the tools. That's right: you can download the sources, tinker with them, and use them however you want.

We selected Assembla as our hosting for this project. It supports the Git distributed source code repository system, and nicely integrates that with an issue tracker. So far, it seems to be pretty flexible and works nicely. While the institute is developing her new website, we have put up a temporary website on the toolkit as well. The address will not change once the new site is up, so bookmark away!

If you are familiar with C++, or want to learn that: try your hand to help develop these tools. It's really not all that difficult. Of course, just reporting issues, suggesting documentation updates or giving ideas for improvements and extensions are also very valuable contributions!

Introducing: the ISI Data Importer

This is the first of what is to become a series of postings to introduce all the tools in the toolkit. I hope it will give a clear overview of what kind of tools we offer, and what they do.

The ISI data importer is aimed at importing bibliographic data that you downloaded from ISI/Web of Knowledge. You can download data on the resulting articles from your searches in a text format. The ISI Data Importer tool can read these files and output them to a structured database format. The usage of structured databases is one of the basic ideas of the Scrience Research Toolkit. Using structured, standard databases to house the data allows us to use standard tools. Databases have been in development for decades, and are quite efficient for many tasks that suit the kind of work we do with the data. Also, getting the data in a form that is as structured as possible, gives us maximum flexibility.

The interface of the ISI Data Importer is quite simple:

On the first tab, you select the input file or files. You can select as many files as you want, as long as they are located in a single directory. As Web of Knowledge only allows you to download a maximum of 500 records in one go, you can end up with lots of separate files that all contain a fraction of your data. Simply select them all, and they will all be imported in a single run. Double records will automatically be filtered out, so if you have created several sets that can overlap in their results, you will end up with a single, unified set without double data points that can ruin your similarity measures later on.

On the output tab, you can select an output file. Currently the only supported database backend are Microsoft Access files, but we are working on extending that to include other and better database backends. Access can be a bit limiting and slow, especially if you work with large datasets. The filename you select does not need to exist yet. It will simply be created for you if it doesn't.

Optionally, you can filter the data on the document types. Some of the more frequently occurring document types are included in the list on the Filter page. If you are missing an option, let me know, and I'll add it. Better yet: simply patch the list yourself, the sources are available!

woensdag 17 juni 2009

Demonstration on e-Social Science conference

As announced in the introductory posting, we will be launching our toolkit on the e-Social Science conference in Colone, Germany. We will be doing that in a 20 minute demonstration session, where will we demonstrate how you can easily use a set of data downloaded from ISI/Web of Knowledge to create some maps of a research field, using a couple of database queries and our tools.

update, june 18:
~~As soon as I know the exact time, date and location of this session, I will post it here.~~
There will be two demonstration sessions:

11:00 – 11:30: Thursday 25 June
~~16:00 – 16:30: Thursday 25 June~~
13:40 – 14:00: Friday 26 June.

All demo's have been allocated to take place in the main foyer of Maternushaus.

If you happen to be at this conference, don't hesitate to join in for this demonstration!

vrijdag 12 juni 2009

First post

Every blog needs an introductory posting, and this one is no different. What is it about? What can we expect to hear? Why even bother blogging? Those are the kinds of questions both you and I would like to see answered. "You and I", I hear you wonder? Yes, because it is not completely clear to me either what exactly I will and will not write about yet. So let's start with giving some idea about what I am doing, and why that is interesting to keep a weblog about.

The Science System Assessment department of the Rathenau Instituut is dealing, among other things, with trying to apply bibliometrics and patentometrics to map the dynamics of science and knowledge transfer. The problem that our department quickly ran into was that the available tools that can deal with this kind of data are relatively far between, require many manual, error prone and labour-intensive steps and don't fit together well. Worse, we soon ran into limitations with the amount of data we could handle in them that started to affect our research.

So, we decided to build some tools ourselves. Seeing that the tools that were (and are) available are not open, we had to start from scratch. That presented both a challenge and an opportunity, because in this way we could also rethink basic issues of how these tools should work. We decided to go for a design where all tools work against standard relational databases in which we structure the available data as well as possible. We also wanted the tools to be easy to use, so a clear graphical interface was a must. Since I have experience developing software using the excellent C++ based Qt toolkit, I chose to use that as the environment to build these tools in. As an added bonus, cross platform compatibility as well as database back end independence come practically for free.

As the first tools began to be available in early versions, more and more ideas about what else we could do and needed begun to pop up, and soon the idea to build some tools led to a complete toolkit that is still growing. Now the time has come to make these tools available to you too. The toolkit will officially be launched at the 5th conference of the National Centre for e-Social Science in Cologne. The first tree tools will be released in their "1.0" or "ready to use" versions, while the rest of them are made available "as is". Because we would have liked to contribute to the exisiting tools but could not, we have decided to avoid the same issue with our initiative.

We would love to hear from you, and even better, to work with to to improve these tools! We expressely invite you to use them, test them, and improve on them. To make that possible, we are making all sourcecode available under a liberal open source licence. We will also make a public issuetracker available, as well as a forum and other collaboration tools.

And that brings us to the why of this blog: we feel that it is important to keep you up to date with what is happening, what we are planning, and what others are doing with these tools. This blog is one of the ways in which you can do that. We are also working on a nice website, and a temporary website will be up soon. If you have other ideas about how to communicate, want to aggregate your own blog, or have any other comments: don't hessitate to contact me. I'd love to hear from you!

SAINT toolkit