dinsdag 30 maart 2010

Word stemming

One of the issues with working with (bibliometric) data is data cleaning. While we already have an experimental tool called Record Grouper that is being refactored at the moment, we also needed a quick-and-dirty way to do some basic cleaning on words.

To facilitate this need, I have added a new feature to the Word Splitter tool. The tool now builds an additional table called Wordstems, and it adds an additional field to the Words column with a reference to a record in this Wordstems table. Each word is now being stemmed by using the Porter stemming algorithm. I am looking into replacing this algorithm by another, more accurate one, but the idea will stay the same if that happens.

So now, each word will get it's stem (according to this Porter algorithm) associated with it using the stem's ID number. This makes it easy to treat words that have the same stem as the same word, so we can get rid of the difference between "robot", "robots", "robotic" and "robotics" easily if we want.

On a note that relates to the last posting on database compatibilities. While the Word Splitter still only works with MS Access, under the hood a lot has changed. I have implemented a driver system with extends the database drivers that are available in Qt by default, and works around a couple of bugs on the side. Based on this, I have created an Access Driver that is now being used by the Word Splitter tool. While it is not completely database independent yet, it is a good start and a good test for the new driver.