ISFDB:Author Names Cleanup

=Project Description= The Author Names Cleanup project aims to find and eliminate mistyped, duplicate, poorly formatted or otherwise erroneous Author records.

=Sub-projects=

Questionable Suffixes
"Authors.pl" is a Perl script which searches a flat file of all ISFDB Author names extracted from the MySQL database for unusual suffixes. "Suffixes" are defined as any characters following a comma. "Usual" suffixes are defined as:


 * Sr.
 * Jr.
 * II
 * III
 * IV
 * Ph.D.
 * M.D.

Script
use strict; my $mainfile = "c:/ISFDB/Authors.txt"; open(AUTHORS,$mainfile) || die("can't open file $mainfile"); while () { my $string = $_; # Put the suffix (anything after ",") into $suffix[1] my @suffix = split /,\s*/, $string; # if there is more than 1 comma, then there is an error if ($#suffix > 1) { print $string; next; }	# If there is a suffix, check if it's in the list of approved suffixes if ($#suffix == 1) { if (!($suffix[1] =~ /^Sr.$|Jr\.$|^II$|^III$|^IV$|^Ph\.D\.$|^M.D.$/)) { print "$string"; }		next; }	next; }
 * 1) foreach (@lines) {

Identified questionable suffixes
Results as of 09/10/06 when run aganst the 08/27/06 ISFDB backup file using ActivePerl/Windows XP: (solved cases moved below)

John Pierce, M.S. Epaminondas T. Snooks, D.T.G. Ronald V. Dorn, Jr. M.D. Arthur W. Weir, D.Sc. Zuprik-Curtis Enterprises, Inc. Universal City Studios, Inc. Ben R., Ph.D. Games Neal Barrett, Jr Yvonne, Fern Solow Peter, Dr. Beckmann Charles, Waugh Arlan Andrews, Kris Andrews, Joe Giarratano Jenifer, A. Ruth O. David, Dr West Riley W., Jr Sanson Hiccup Horrendous, III Haddock Louis, Jr. Porter Joseph S., Jr. Nye Rob, Jr Potchak Jr, Bill Martin Normand, R. Bernier Wilson, Tortosa Robert S., Jr. Sanders Douglas M., Sir Price Joseph, Jr Covino Lovelee, I. Dagum John, A. Hall James, Sir Knowles Mark, Edward Hall MJ Studios, cover art Jim Seward Seton Hall University, Dr. Dermot Quinn David, Niall Wilson Jimmie E., Jr. Cain Todd, F. Davis Hugh J., Jr. Luke

Cleaned up Suffixes

 * 1) Neal, Jr. Barrett - just merged into  There seem to remain also  and  although they are empty; so - merge them too?
 * 2) Mishima, Yukio - just merged into
 * 3) Gordon, R. Dickinson - just removed from a stray pub with Gordon R. Dickson
 * 4) Chesterton scholar, Aidan Mackey - merged into  several weeks ago
 * 5) Richard Gilliam, Wendy Webb, Edward E. Kramer, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Thomas R. Hanlon; Richard Gilliam, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Kathleen M. Massie-Ferch; Janet Berliner, Uwe Luserke, Martin H. Greenberg; Richard Gilliam, Martin H. Greenberg, Edward E. Kramer - seem to have been split long time ago as no longer present
 * 6) Brenda, W. Clough; Michael, Moorcock - long gone

Is there a use for documenting this so meticulously, or would it be enough just to delete from the list above? --JVjr 06:50, 10 Mar 2007 (CST)


 * "PhD, C. Malcolm Trowbridge" changed to "C. Malcolm Trowbridge, Ph.D." and "Ph.D, Sandra Eubanks" merged with "Sandra Eubanks, Ph.D." BLongley 11:20, 31 Mar 2007 (CDT)


 * I checked a few more: we've lost these and I suspect in these ways:

Michael, W. Perry  ->  Michael W. Perry Arlan Keith Andrews, Sr -> pseudonym John C. Wright, Esq. -> pseudonym Roscoe Clark, F.R.C.S. -> pseuodonym Evelyn A. Archer, P.I. -> pseuodonym The 1992 James Tiptree, Jr Award Judges -> pseuodonym The 1995 James Tiptree, Jr Award Judges -> pseuodonym J. A, Lawrence -> J. A. Lawrence Rockwell, Carey - > Carey Rockwell Mary, H Herbert -> Mary H. Herbert James, White - > James White Zora, N. Hurston - > Zora Neale Hurston Francis M., Jr. Nevins -> Francis M. Nevins, Jr. Dora and MacGregor,Eleanor Pantell -> Dora Pantell, Ellen MacGregor Stuart, Gordon -> Stuart Gordon Stanley Grauman, Weinbaum -> Stanley Grauman Weinbaum, pseuodonym Walter, Jr. Wangerin -> Walter Wangerin, Jr. Roberta Carter, Rogers, Jacqueline Clark - > Roberta Carter Clark, Jacqueline Rogers William, F. Nolan -> William F. Nolan W., Rev Awdry -> Reverend W. Awdry Emmett O., III Saunders -> Emmett O. Saunders, III Esme Nichola Author Winter, Barbara Illustrator Shilletto -> Esme Nichola Shilletto, Barbara Winter Mike, Jr. Deodato -> Gone? Kenneth, Jr. Faig -> Kenneth W. Faig Jr. Philip Harbottle, Editor -> Philip Harbottle Michael Simon Bodner, PhD -> Michael Simon Bodner, Ph.D. Richard, J. O'brien -> Richard J. O'Brien D., M. Brown -> D. M. Brown
 * BLongley 16:52, 12 Apr 2007 (CDT)

Suspected Duplicate Author Names
Need to develop a specification and write a script, possibly multiple scripts depending on what kinds of requirements we will come up with. Also, we will need a way of finding out whether, e.g., Nancy Farmer the popular children's author is the same person as Nancy Farmer the co-author of Update One - Federal Fisheries Management: A Guidebook to the Magnuson Fishery Conservation and Management Act (1987). Please discuss on the Talk page.

Is this the right place to list authors I think are probably duplicates? I suspect that Robert Boyer is the same as Robert H. Boyer, likewise the two Zahorskies. WimLewis 18:01, 15 Mar 2007 (CDT)

Anonymous, uncredited, etc.
Originally compiled by Marc Kupper 14:10, 16 Nov 2006 (CST):

Malformed URLs
Some Author records, e.g. Tad Williams', have bad URLs in the "Web page" field, which were possibly created by the ISFDB1-to-ISFDB2 conversion. We need to find and fix them. Ahasuerus 19:07, 27 Dec 2006 (CST)

Questionable First/Middle/Last Names
The following script was developed in a hurry and needs to be cleaned up. It assumes that "c:/ISFDB/Authors.txt" is a dump of the MySQL Author table. Ahasuerus 22:07, 9 Apr 2007 (CDT)

use strict; my $mainfile = "c:/ISFDB/Authors.txt"; my $first = '[A-Z]{1}[-\'a-z]{1,25}\s'; my $last = "[A-Z]{1}[-a-z']{1,25}"; my $init = '[A-Z]{1}\. '; my $middle = '[A-Z]{1}[-a-z]{1,20}\s'; open(AUTHORS,$mainfile) || die("can't open file $mainfile"); while () { my $string = $_; # Put the suffix (anything after ",") into the $suffix[1] my @suffix = split /,\s*/, $string; # if there is more than 1 comma, then there is an error if ($#suffix > 1) { next; }	# If there is a suffix, check if it's in the list of approved suffixes if ($#suffix == 1) { if (!($suffix[1] =~ /^Sr.$|Jr\.$|^II$|^III$|^IV$|^Ph\.D\.$|^M.D.$/)) { next; }	}	$_ = $suffix[0]; next if /^($first)($last)$/; #FirstName LastName next if /^($init)($last)$/; #Initial. LastName next if /^($init)($middle)($last)$/; #Initial. MiddleName LastName next if /^($first)($init)($last)$/; #FirstName Initial. LastName next if /^($first)($init)($init)($last)$/; #FirstName Initial1. Initial2. LastName next if /^($init)($init)($init)($last)$/; #Initial1. Initial2. Initial3. LastName next if /^($init)($init)($last)$/; #Initial1. Initiail2. LastName next if /^($first)($middle)($last)$/; #FirstName MiddleName LastName my @word = split /\s+/; my $lname = $word[$#word]; my $fname = $word[0]; if ($lname =~ /	print "$_";	next; } =pod	print "$_" if / II$/;	print "$_" if / III$/;	print "$_" if / Jr.$/; =pod
 * 1) my $space = '\s';

Here's my stab at a Python version of an author names script. It requires you to have the Python MySQL module installed so that it can retrieve the author names from the db. You'll need to edit the first few lines as appropriate for your setup.

It's short because most of the complexity has been compressed into that one big regexp. Essentially it's just flagging any name that doesn't follow a simple pattern of (several-names-or-initials optional-byname-particle lastname optional-suffix-with-comma). It flags plenty of perfectly valid names, but it narrows things down enough that it's easy to scan the list by eye. --WimLewis 02:59, 10 Apr 2007 (CDT)


 * 1) If your mysql module isn't installed in the system path, include the path here
 * 2) import sys
 * 3) sys.path.append('/Users/Shared/wiml/pmysql/MySQL-python-1.2.1_p2/build/lib.darwin-8.9.0-Power_Macintosh-2.3')

import MySQLdb import re

conn = MySQLdb.connect( user='root', db='isfdb' ) sess = conn.cursor

sess.execute('SELECT a.author_id, a.author_canonical FROM authors a;')

auname = re.compile('(?:(?:[A-Z]\.|[A-Z][a-z]+) )*(?:|[Dd]e ?|[Dd]u ?|O\'|Mac|Mc|[Vv]on )[A-Z][A-Za-z]+(?:, [A-Za-z\.]+)?')

while 1: row = sess.fetchone if row is None: break (oid, name) = row if not auname.match(name): print oid, name