Posts tonen met het label introduction. Alle posts tonen
Posts tonen met het label introduction. Alle posts tonen

vrijdag 26 juni 2009

Introducing: Matrix compiler

In this third instalment of the 'Introducing...' series, I will be talking about the Matrix Compiler tool. While you can already see a lot of interesting things by just looking at the tables and views in the database that contains your data, visualization can be a huge benefit to recognize patterns. That means that you will need to somehow transform your data out of the database into a format that you can use for visualization.

The Matrix Compiler is a tool that can do this. It can use the database and translate it's data into a format that you can read into Pajek, a well known visualization and network analysis package.

Analysing and visualizing networks means that you will be dealing with two kinds of things. First, there are the objects that are connected, which we* will call the nodes. Next, there are the connections between those nodes themselves. At least the latter should be available as a table or query/view in your database. For briefness, I will call them all a view from here. Depending on how you build up your database, it is possible that your connections view will contain complete labels (or other attributes), but it very possibly may only contain an ID of the node in the database, and the label for the node is defined in some other view. The Matrix Compiler supports both modes of operation: the labels for the nodes can be retrieved from either the connections view, or be looked up from an external view.

After opening the database and indicating the name of the output file, the matrix can be set up. When constructing the matrix, you start with selecting the connection view, by placing the cursor in the corresponding box and either typing the name or selecting it from the view by double clicking on it. You then select which field in that view represents, respectively, the value for the relationship, the row and the column. A connection view thus needs to have at least three fields: two to represent the nodes you are connecting, and one to indicate the strength of that connection.

The next step is to define the structure of your matrix. There are many options for that. If the types of objects in the rows and columns are the same, you may want to create a square matrix where all the nodes appear as both a row and a column. This makes sense if you are, for instance, constructing a matrix to represent co-authorships, but not if you want to display in which journals authors publish.

If you chose to create a square matrix, you may also choose if you want the matrix to be symmetric or not. In a symmetric matrix, the value of M(a,b) would be identical to M(b,a). Again, in case of co-authorships the meaning of author a sharing a co-authorship with author b is the same as saying that author b is sharing a co-authorship with author a. But for a citation relationship, a citation from a to b is different from one from b to a.
In both cases, you can also choose if you want to set the diagonal, that is M(a,a) to a set value, or if you want to use values occuring in your data. This can make sense to filter out things like self-citations from your data.
If your data contains data for the same relation more than once, you can choose how to deal with that. The options are to use the first occurrence, use the last, add the occurring values or multiply all the occurring values.

The last step of creating your matrix, is to choose where the labels for the nodes should come from. As explained above, there are two basic options. If the correlation view already contains the labels, just select the appropriate option and you are done. If that view contains references to nodes though, you can now select where to get the actual labels. To continue our example on the co-authorships, it would make sense to select the Authors table as the table to find the labels, and use the full author name as the label for your nodes in the network.

Note that if you use an external source of labels, you can choose how to deal with nodes in your label view that do not appear in the correlation view. For instance, authors in your Authors table may not have any co-authorships. That means that they will not show up in your correlation view. You may or may not want to include these unconnected nodes. The choise is yours.

Note that outputting very big matrix files, can take some time, as the output size is O(n2). We are planning to change the output format shortly, from a matrix form to a list form. That will result in smaller output files for big matrices, and will also allow the inclusion of attributes other than a label to both the nodes and the connections.



* Pajek itself uses a different terminology. It instead talks about Vertices for the nodes, and Arcs and Edges for the connections between these nodes.

woensdag 24 juni 2009

Introducing: Word Splitter

In the second installment of the "Introducing" series, I will tell you something about our small utility called Word Splitter. The idea of the word splitter is simple. You point it to a field in your database that contains text, like a title or an abstract. The word splitter then builds up a table with all the words that occur in the database in that field, and a table with pointers between the record identifier that the text came from and the record identifier in the words table, so you can find which words belong to which text-record. To make it possible to re-construct the text based on the sentence, the position the word was in is also stored in that pointer table.

Our tool can do more than that though. First of all, it can process multiple text fields from your database at the same time, making it more efficient to work with for you. So, you can split both that title as well as that abstract simultaniously.

Furthermore, the Word Splitter can use stopwords. Stopwords are, in their basic form, lists of words that are ignored when splitting the text, for instance because they are too common. That means that if you use stopwords, not all words from the text will be stored in the Words table, nor will pointers occur in the couple table. However, for different purposes, you may want to think of the procedure in two ways. One option is to first split the complete text, then remove the stopwords, and then store the words and their positions after removing the stopwords to the database. This will result in consecutive word positions in the couple table, even if there used to be one or more stopwords between two words in the original text. Alternatively, you can split the complete text, note each word's position, remove the stopwords and only then store them into the database. This will result in word positions that reflect the original position of the word in the text, but leaves them non-consecutive.
To provide maximum flexibility in the analysis, both these positions are stored in the couple table.

The Word Splitter can use several stop word lists at once, and furthermore knows three kinds of lists. First, it can use simple text files that contain lists of words. Second, it can use a field in an existing database that also contains a list of words. And last, it can use lists of regular expressions. These expressions are patterns that each word is matched against, and if it matches, it is regarded as a stop word. That allows you to, for instance, filter out numbers or dates without having to write them all out.

To make it easy to use these stop word lists, you can create sets of these lists, and store such a set as a file again. This way, you can easily review the stop words you used for your analysis, and you can re-use the same set later on. You can also use one such a set as your default set of stop words.

This was a basic introduction to our Word Splitter tool. I hope you will like using it!

vrijdag 12 juni 2009

First post

Every blog needs an introductory posting, and this one is no different. What is it about? What can we expect to hear? Why even bother blogging? Those are the kinds of questions both you and I would like to see answered. "You and I", I hear you wonder? Yes, because it is not completely clear to me either what exactly I will and will not write about yet. So let's start with giving some idea about what I am doing, and why that is interesting to keep a weblog about.

The Science System Assessment department of the Rathenau Instituut is dealing, among other things, with trying to apply bibliometrics and patentometrics to map the dynamics of science and knowledge transfer. The problem that our department quickly ran into was that the available tools that can deal with this kind of data are relatively far between, require many manual, error prone and labour-intensive steps and don't fit together well. Worse, we soon ran into limitations with the amount of data we could handle in them that started to affect our research.

So, we decided to build some tools ourselves. Seeing that the tools that were (and are) available are not open, we had to start from scratch. That presented both a challenge and an opportunity, because in this way we could also rethink basic issues of how these tools should work. We decided to go for a design where all tools work against standard relational databases in which we structure the available data as well as possible. We also wanted the tools to be easy to use, so a clear graphical interface was a must. Since I have experience developing software using the excellent C++ based Qt toolkit, I chose to use that as the environment to build these tools in. As an added bonus, cross platform compatibility as well as database back end independence come practically for free.

As the first tools began to be available in early versions, more and more ideas about what else we could do and needed begun to pop up, and soon the idea to build some tools led to a complete toolkit that is still growing. Now the time has come to make these tools available to you too. The toolkit will officially be launched at the 5th conference of the National Centre for e-Social Science in Cologne. The first tree tools will be released in their "1.0" or "ready to use" versions, while the rest of them are made available "as is". Because we would have liked to contribute to the exisiting tools but could not, we have decided to avoid the same issue with our initiative.

We would love to hear from you, and even better, to work with to to improve these tools! We expressely invite you to use them, test them, and improve on them. To make that possible, we are making all sourcecode available under a liberal open source licence. We will also make a public issuetracker available, as well as a forum and other collaboration tools.

And that brings us to the why of this blog: we feel that it is important to keep you up to date with what is happening, what we are planning, and what others are doing with these tools. This blog is one of the ways in which you can do that. We are also working on a nice website, and a temporary website will be up soon. If you have other ideas about how to communicate, want to aggregate your own blog, or have any other comments: don't hessitate to contact me. I'd love to hear from you!