woensdag 24 juni 2009

Introducing: Word Splitter

In the second installment of the "Introducing" series, I will tell you something about our small utility called Word Splitter. The idea of the word splitter is simple. You point it to a field in your database that contains text, like a title or an abstract. The word splitter then builds up a table with all the words that occur in the database in that field, and a table with pointers between the record identifier that the text came from and the record identifier in the words table, so you can find which words belong to which text-record. To make it possible to re-construct the text based on the sentence, the position the word was in is also stored in that pointer table.

Our tool can do more than that though. First of all, it can process multiple text fields from your database at the same time, making it more efficient to work with for you. So, you can split both that title as well as that abstract simultaniously.

Furthermore, the Word Splitter can use stopwords. Stopwords are, in their basic form, lists of words that are ignored when splitting the text, for instance because they are too common. That means that if you use stopwords, not all words from the text will be stored in the Words table, nor will pointers occur in the couple table. However, for different purposes, you may want to think of the procedure in two ways. One option is to first split the complete text, then remove the stopwords, and then store the words and their positions after removing the stopwords to the database. This will result in consecutive word positions in the couple table, even if there used to be one or more stopwords between two words in the original text. Alternatively, you can split the complete text, note each word's position, remove the stopwords and only then store them into the database. This will result in word positions that reflect the original position of the word in the text, but leaves them non-consecutive.
To provide maximum flexibility in the analysis, both these positions are stored in the couple table.

The Word Splitter can use several stop word lists at once, and furthermore knows three kinds of lists. First, it can use simple text files that contain lists of words. Second, it can use a field in an existing database that also contains a list of words. And last, it can use lists of regular expressions. These expressions are patterns that each word is matched against, and if it matches, it is regarded as a stop word. That allows you to, for instance, filter out numbers or dates without having to write them all out.

To make it easy to use these stop word lists, you can create sets of these lists, and store such a set as a file again. This way, you can easily review the stop words you used for your analysis, and you can re-use the same set later on. You can also use one such a set as your default set of stop words.

This was a basic introduction to our Word Splitter tool. I hope you will like using it!

Geen opmerkingen:

Een reactie posten