Book List Processing

A series of processing steps were employed in generating the book entries in the initial database for the Thomas Jefferson’s Libraries project. These are summarized below.

1. Book entries found in manuscript book lists, print bibliographies, and sale catalogues were transcribed using wordprocessing software, Microsoft Word.

2. These page transcriptions in Word were transformed into XML (or Extensible Markup Language) markup through a scripting process, thereby creating XML versions of each of the book lists and catalogues, including the five-volume Sowerby Catalogue.

3. Key fields from the Sowerby Catalogue (for example title, author, and publisher) were harvested through a programmatic process to form a Provisional Master List of book entries in PubMan, a content management system used to organize and store all of the XML data generated for the project.

4. Custom-built tools were developed in PubMan to link individual list entries in each book list to its corresponding book entry if found in the Provisional Master List. Where no corresponding book entry was found, a new record was created in the Provisional Master List to represent that book entry.

5. Once an initial linking process was completed, a merge process was developed to combine all of the book entries from the various book lists and catalogues to form a Master List of book entries in PubMan. This Master List is made up of Master Records. Each Master Record consists of key bibliographic information that identifies the book, together with machine-generated URLs (uniform resource locators or links) to wherever this title is found in each of the book lists and catalogues.

6. Another custom script was built to harvest the chapter headings or subject classifications Jefferson assigned to titles in his manuscript book lists. These headings were then used to automatically populate the Jefferson classification or subject fields found in the database.

7. MARC (or Machine Readable Cataloging) records consisting of bibliographic descriptions and Library of Congress Subject Headings of extant copies of Jefferson’s books at the Library of Congress and the University of Virginia, or imprints similar to books we know Jefferson owned were matched against the Master Record entries. These MARC records were downloaded from WorldCat and other online catalogs and converted into MODS (or Metadata Object Description Schema) XML records utilizing MarcEdit, an open source bibliographic utility software. These MODS records were then processed through a custom-built MODS Processor tool and added into our PubMan content management system. Once these MODS were available in PubMan, they were linked and then merged with their matching Master Records to enrich these records with copy-specific information obtained from WorldCat and other online catalogs. The resultant database was made available to the public in early 2008.

8. Since late 2008, efforts have been focused on migrating the book entries from the initial database into LibraryThing. Book records for the portion of Jefferson's library that was sold to Congress in 1815 were first created in LibraryThing in 2007 by a group of LibraryThing volunteers using the Sowerby Catalogue and Trist Catalogue. These are now being expanded and enriched to include links to page images and transcriptions of Jefferson's manuscript book lists, information on when Jefferson owned a title and the various editions he owned, information regarding the location of the extant copy if known. Additional records have and are being added to the LibraryThing database to cover books Jefferson owned and knew about throughout his lifetime, not just the books sold to Congress in 1815. The project database currently includes over 5,000 book entries, and is expected to grow to between 8,000 to 9,000 entries as the project continues.

back to top