ISFDB talk:Data Consistency/Serial Mismatches

Some of these don't look like problems. "New Destinies" IS (well, WAS) a magazine, but to allow for the multiple printings, we've had to class it as "Anthology". No consistency problem there really. BLongley 15:50, 18 Sep 2007 (CDT) Using "Serial" to group several pieces within the same book looks wrong though - abuse "Series" for that, I'd say, unless they really are reprints of magazine entries. BLongley 15:50, 18 Sep 2007 (CDT) "(Excerpt)" looks wrong too, unless they're numbered excerpts leading to a whole work. BLongley 15:50, 18 Sep 2007 (CDT) "(Complete Novel)" is a known problem: I know that's what it says in many magazines, but reprints of the exact same work don't cut the mustard by today's standards. Create "Novella/Novellette" Variants of the magazine entries before inclusion in works where they're considered Short-Fiction maybe? Or make the "(Complete Novel)" titles variants of what we'd call them now? BLongley 15:50, 18 Sep 2007 (CDT) I'm not sure where this is going, but I'd sort this lot by title. "Journal from Ellipsia" still confuses me with double-entries here. BLongley 15:50, 18 Sep 2007 (CDT)


 * Posting the Python code here as per Bill's request:

bad_title = 0 bad_pub = 0 bad_xref = 0 mismatches = {} exceptions = {} import MySQLdb db = MySQLdb.connect('localhost', username, password, 'ISFDB') xref_cursor = db.cursor title_cursor = db.cursor pubs_cursor = db.cursor xref = [] titles = {} pubs = {} response = xref_cursor.execute("select * from pub_content;") i = 0 while 1: #retrieve one row of pub_content data row = xref_cursor.fetchone #break out of the loop when we finish the last retrieved row if row == None: break i = i +1 if i % 25000 == 0: print "Imported ",i," cross-reference records" #xref_id = int(row[0]) if type(row[1]) is not long: bad_xref = bad_xref + 1 #print type(title_id) continue title_id = int(row[1]) if type(row[2]) is not long: bad_xref = bad_xref + 1 #print pub_id continue pub_id = int(row[2]) xref.append([title_id,pub_id]) print "Retrieved ",len(xref)," cross-reference records"
 * 1) initialize the variables that will keep track of different type mismatches
 * 1) import the MySQLdb module that will allow Python to access MySQL
 * 1) establish a connection to MySQL and call it 'db'
 * 1) create a cursor for the pub-content table
 * 1) create a cursor for the title table
 * 1) create a cursor for the pubs table
 * 1) initialize the list that will hold xref information
 * 1) initialize the dictionary that will hold title information
 * 1) initialize the dictionary that will hold publication information
 * 1) retrieve all rows in the pub_content (cross-reference) table

response = title_cursor.execute("select title_id,title_ttype,title_title from titles;") i = 0 while 1: #retrieve one row of titles data row = title_cursor.fetchone #break out of the loop when we finish the last retrieved row if row == None: break i = i +1 if i % 25000 == 0: print "Imported ",i," title records" title_id = int(row[0]) titles[title_id] = row[1] response = pubs_cursor.execute("select pub_id,pub_ctype,pub_title from pubs;") i = 0 while 1: #retrieve one row of titles data row = pubs_cursor.fetchone #break out of the loop when we finish the last retrieved row if row == None: break i = i +1 if i % 25000 == 0: print "Imported ",i," publications records" pub_id = int(row[0]) pubs[pub_id] = row[1]
 * 1) now retrieve all rows in the titles table
 * 1) now retrieve all rows in the pubs table

i = 0 for index in range(len(xref)): i = i + 1 if i % 25000 == 0: print "Analyzed ",i," cross-reference records" xref_record = xref[index] title_id = xref_record[0] pub_id = xref_record[1] title_type = titles.get(title_id, "bad key") pub_type = pubs.get(pub_id, "bad key") if (title_type == "bad key"): bad_xref = bad_xref + 1 continue if (pub_type == "bad key"): bad_xref = bad_xref + 1 continue if (title_type == "CHAPTERBOOK"): mismatches["chapterbooks"] = mismatches.get("chapterbooks", 0) + 1 continue if pub_type == title_type: continue if (pub_type == "None") or (pub_type == ""): mismatches["missing publication type"] = mismatches.get("missing publication type", 0) + 1 continue if (title_type == "ESSAY") or (title_type == "REVIEW") or (title_type == "COVERART") or (title_type == "INTERIORART") or (title_type == "BACKCOVERART") or (title_type == "INTERVIEW"): continue if title_type == "OMNIBUS": if pub_type != "OMNIBUS": mismatches["omnibuses"] = mismatches.get("omnibuses", 0) + 1 continue if (title_type == "ANTHOLOGY") and (pub_type != "ANTHOLOGY"): mismatches["anthologies"] = mismatches.get("anthologies", 0) + 1 continue if (title_type == "EDITOR"): if (pub_type != "MAGAZINE") and (pub_type != "FANZINE") : mismatches["editor records"] = mismatches.get("editor records", 0) + 1 continue if (title_type == "SERIAL"): if (pub_type != "MAGAZINE") and (pub_type != "FANZINE") : mismatches["serials"] = mismatches.get("serials", 0) + 1 exceptions["serials", (title_id, pub_id)] = "" continue if (title_type == "COLLECTION") and (pub_type != "COLLECTION"): mismatches["collections"] = mismatches.get("collections", 0) + 1 continue if (title_type == "NOVEL"): if pub_type == "NOVEL": continue if pub_type == "OMNIBUS": continue if (pub_type == "MAGAZINE") or (pub_type == "FANZINE"): mismatches["novel/magazines"] = mismatches.get("novel/magazines", 0) + 1 continue mismatches["novels (other)"] = mismatches.get("novels (other)", 0) + 1 continue if (title_type == "SHORTFICTION"): if (pub_type == "NONFICTION") or (pub_type == "NOVEL"): mismatches["short fiction"] = mismatches.get("short fiction", 0) + 1 continue if (title_type == "POEM"): if (pub_type == "NONFICTION"): mismatches["poems"] = mismatches.get("poems", 0) + 1 continue if (title_type == "NONFICTION"): if (pub_type != "NONFICTION"): mismatches["nonfiction"] = mismatches.get("nonfiction", 0) + 1 continue if (title_type == "NONGENRE"): if (pub_type != "NOVEL"): mismatches["nongenre"] = mismatches.get("nongenre", 0) + 1 continue mismatches["uncategorized"] = mismatches.get("uncategorized", 0) + 1 #print title_data[0][1],title_type," vs. ",pub_data[0][1],pub_type
 * 1) iterate over the xref list and compare title/publication type

print "Bad title records: ",bad_title print "Bad publication records: ",bad_pub print "Bad cross-references: ",bad_xref for key in exceptions: title_id = key[1][0] pub_id = key[1][1] response = title_cursor.execute("select title_ttype,title_title from titles where title_id = "+str(title_id)+";") row1 = title_cursor.fetchall response = pubs_cursor.execute("select pub_ctype,pub_title,pub_tag from pubs where pub_id = "+str(pub_id)+";") row2 = pubs_cursor.fetchall print "|-" print "|"+str(row1[0[1])+"]" print "|"+str(row1[0][0]) print "|[2])+" "+str(row2[0][1])+"]" print "|"+str(row2[0][0])

for key in mismatches: print "Mismatched",key,":",mismatches.get(key, 0)

db.close
 * Ahasuerus 23:45, 18 Sep 2007 (CDT)


 * Some of the title appear to be correct. The titles have appeared in serial form in collections and anthologys first, but may not have been published as complete novels. So there listing as a serial maybe correct, they just have not been published in the usual magazine form.Kraang 06:12, 20 Sep 2007 (CDT)


 * The point about sorting by title is well taken, I will add it next when I have a moment. I will also add a Status column, where we could enter "FIXED" as we update the data. Ahasuerus 23:44, 19 Sep 2007 (CDT)


 * The table has been resorted by title and a "Fixed?" column has been added. However, my attempts to teach myself Python turned out to be more painful than expected (something to do with old dogs and new tricks), so I ended up converting the MySQL database to a more familiar format and parsing it using vintage 1960s technologies. I'll attack other type mismatches shortly. Ahasuerus 23:27, 29 Sep 2007 (CDT)