Tuesday, March 17, 2009

Perl script that generates CSV and BerkeleyDB versions of LTWA list

UPDATE: If you're wondering why I just didn't save the LTWA database and search it every time I needed to abbreviate a journal, it's because I wanted to optimize for speed downstream. That is, I did all the possible processing now to speed things up later. I've also augmented the script to do the opposite (and save the results in other smaller files) so that I can tell the downstream process to take longer (and possibly have better results with less spurious entries in the hash table). I've also updated the downstream script to save any successful lookups locally to speed up successive runs. Again, contact me if you want more details.
I'm sick of looking up ISO 4 standard journal abbreviations from the List of Title Word Abbreviations (LTWA) hosted at ISSN's LTWA online. The most annoying thing about LTWA online is that you can't get one big list unless you have them mail you a paper copy (for a price). So you have to resort to clicking each letter and waiting for the list for that letter to come up.

So I wrote a Perl script that automatically cURLs each LTWA online page down, processes it, and generates both CSV and BerkeleyDB (BDB) hash files containing a list of words and their associated official LTWA abbreviation. I use the BDB file in another script to automatically generate BibTeX database files for each of my journal papers (that script first checks a list of known-good journal abbreviations before trying to generate the abbreviation itself).

There were several challenges to such a task, and the list isn't perfect. I focused on one-word entries. For more complicated abbreviations, I figured I'd lean on my list of known-good journal abbreviations. That still left LTWA entries like "psycholog-" and "bulletin-" which use "-" to imply "and any other character." So I used a typical /usr/share/dict/words list to generate a list of English words that matched each pattern. Because such lists don't usually include plurals, I used Lingua::EN::Inflect to generate plurals and then took all of the plurals that included the singular (i.e., that would also match the LTWA pattern).

So that works well for me. Someday I might put the script and/or the files it produces on-line. For the moment, if you want any of these, contact me and let me know. I'll share.

No comments: