woensdag 10 februari 2010

Changes in ISI data importer: increased compatibility with other sources

Recently, I was contacted by somebody who tried to use SAINT on a dataset converted from PubMed to the ISI text format. That did not work. It turns out that the ISI data importer relied on the existence of a really ISI specific field to ensure records were unique. That field does not exist in the data the converter tool created, and thus the application regarded all of them as the same: the code was empty for all of them.

A small update fixed that. Now, instead of only relying on the ISI code field, the program will try other options. The first option tried is to see if there is a valid DOI code. That code is unique too, if it is available. If it is, that code will be used as the unique identifier. If there is no DOI code, an artificial code will be generated based on the journal name, the ISSN number, the year, the full name of the first author and the title of the article. That should yield a pretty unique identifier. The code will be prepended with an identifier that tells you where the code came from. These can be "isi", "doi" or "saint".

As a side-effect of this change, I have also added a new field in the articles table output. The DOI field is now included. That also creates an interesting matching opportunity with the cited references.

All very nice perhaps, but... The changes can lead to two problems:
1) If you import data from multiple sources, you may run into double values. This can be the case if an article occurs for instance in both an ISI and another source. In the ISI source, there will be an ISI identifier, so that will be used as the unique code, even if there is a DOI field too. If the same article appears in another data source that does not have the ISI identifier, we'll run into a problem. The DOI field may be used as the identifier, and the articles are no longer identified as the same one.

2) If you use this newest version of the ISI data importer to augment data from an earlier version, old articles will not be recognized as the same anymore, because in the previous version, there was no "isi" prefix to the code. If you plan to do that, you should append the string "isi:" (w/o the quotes) before every code field. You can do that by running the query below:

UPDATE Articles SET Articles.code = "isi:"+[Articles].[code];

Geen opmerkingen:

Een reactie posten