vrijdag 19 juni 2009

Introducing: the ISI Data Importer

This is the first of what is to become a series of postings to introduce all the tools in the toolkit. I hope it will give a clear overview of what kind of tools we offer, and what they do.

The ISI data importer is aimed at importing bibliographic data that you downloaded from ISI/Web of Knowledge. You can download data on the resulting articles from your searches in a text format. The ISI Data Importer tool can read these files and output them to a structured database format. The usage of structured databases is one of the basic ideas of the Scrience Research Toolkit. Using structured, standard databases to house the data allows us to use standard tools. Databases have been in development for decades, and are quite efficient for many tasks that suit the kind of work we do with the data. Also, getting the data in a form that is as structured as possible, gives us maximum flexibility.

The interface of the ISI Data Importer is quite simple:
On the first tab, you select the input file or files. You can select as many files as you want, as long as they are located in a single directory. As Web of Knowledge only allows you to download a maximum of 500 records in one go, you can end up with lots of separate files that all contain a fraction of your data. Simply select them all, and they will all be imported in a single run. Double records will automatically be filtered out, so if you have created several sets that can overlap in their results, you will end up with a single, unified set without double data points that can ruin your similarity measures later on.

On the output tab, you can select an output file. Currently the only supported database backend are Microsoft Access files, but we are working on extending that to include other and better database backends. Access can be a bit limiting and slow, especially if you work with large datasets. The filename you select does not need to exist yet. It will simply be created for you if it doesn't.

Optionally, you can filter the data on the document types. Some of the more frequently occurring document types are included in the list on the Filter page. If you are missing an option, let me know, and I'll add it. Better yet: simply patch the list yourself, the sources are available!

