Monday, December 19, 2011

The Challenge of Digital Formats

Did you realize that our online PDF family currently has fifteen separate descriptions, including entities like versions 1.3, 1.4, 1.6, and 1.7, and the multiple flavors of PDF/A (three now, four more on the way)? See Digital Formats, part 1: Lots of ‘Em and More to Come from The Signal, Digital Preservation, the blog of the digital preservation section of the Library of Congress. The real concern about the multiplicity of file formats is the viability of any one format.

As genealogists we run into this problem frequently with people who did some "genealogy" a few years or more ago on their computer and now expect to be able to resurrect the files. I recently dealt with a situation where there was a old Personal Ancestral File backup file from a damaged hard disk. I was able to open the file as a text file and thereby make the information in the older file available. The file would not open with Ancestral Quest or any of the current programs. An alternative would have been to find an older program that might have retrieved the data.

As time passes, this issue becomes more and more pertinent to what we do every day. Is my particular genealogy program's file format going to survive? And if so, how long? But if you do not know, even with your present programs, what kind of file format you are using, then what chance have you to know if the format is going to persist?

In another example, when I use Adobe Photoshop to edit my photos, I can potentially save the images into twenty different image formats from Photoshop and I have no idea what some of them are or why I would use them? What if I use a format that is not recognized by any other program?

Yet another example, if you have the latest version of Microsoft Word, take a look at the potential file formats for saving a file. Go to any Word document and then to the File menu to Save as. There is a box you may not have focused on called Format: This is a pull-down menu. Check out the list. I looked at my copy of Microsoft Word:Mac 2011 and found 14 different formats in addition to the now standard .docx. To get an idea of the overall problem, look at Wikipedia:List of file formats (alphabetical).

The Library of Congress discusses a number of what it calls sustainability factors. These include the following:
  • Disclosure. This is the degree to which the specifications and software tools are available to access the digital content.
  • Adoption. How much the format is already being used.
  • Transparency. Whether or not the format is open to human readability. In the case of the file from PAF this saved the day.
  • Self-documentation. Does the format include its own description?
  • External dependencies. How closely is the format allied to a certain hardware or operating system? 
  • Impact of patents. Can the format be used without a license?
  • Technical protection mechanisms. Think DVD and VCR copy protection schemes.
This is only a brief summary of a very complex subject. When we digitize a document, do we automatically think that we have ended the process of preservation? If so, we are sadly mistaken. But it is comforting to know that someone is thinking about the problems.

If you know a genealogist with an older computer or computer system or who is still using a very old program (more than 7 or 8 years old) then please initiate a discussion about migrating their data to a newer system or hardware, at least for a backup.

