System-independent archiving of project files with PDF/A-3

05 Dec 2018 By Dietrich von Seggern

PDF/A-3 PDF/A

The constantly growing demands on formats were no longer able to meet the constantly growing requirements for reliable long-term digital storage of important documents. In addition, there was a desire for a searchable, compact format that was robust and widely usable across platforms. In 2005, an ISO standard based on the PDF format started to write a new chapter in the history of digital archiving of documents: PDF/A. In this blog post, I am writing about how system-independent archiving of project files with PDF/A-3 can be achieved and how the advantages of PDF/A-3 are presented in the form of application scenarios and the different points of view regarding PDF/A-3.

The first part of the standard, PDF/A-1, can be regarded as the classic variant. I will not discuss the regulations in detail here; it is important in this context that PDF attachments, i.e. embedded files, are not permitted. If the second part of the standard is used, PDF/A-compliant files may be embedded. PDF/A-3, on the other hand, allows attachments in any format. This feature led to astonishment amongst users. After all, it is a common opinion in the IT industry that native formats are not suitable for long-term archiving. Doesn't this open a back door to what PDF/A actually wanted to avoid: the 'Format-Zoo' in the archives and thus the loss of control? Wouldn't it have been possible to avoid all the work that had to be done on the PDF/A issue?

First of all, it can be said that PDF/A-3 standardizes the way in which files are embedded in PDFs and thus ensures that they can be reliably found within the internal PDF structure. This is an important aspect because of the many possibilities PDF offers for embedding files, as files can be extracted without having to read the complete PDF file. In the case of electronic invoices, for example, where a machine-readable XML part can be embedded, the processing program is not an application specializing in PDF, but rather, for example, a commercial software. In addition, the ISO standard requires certain metadata that defines the file type of the embedded file, the type of relationship between it and the main PDF (e.g. source file), and the reference point (e.g. document, page, or page component).

Let me come back to the question asked at the beginning about the archive suitability of PDF/A-3: The fact is that PDF/A-1 and PDF/A-2 are "frozen", robust archive formats based on PDF (PDF 1.4 or ISO 32000), while the PDF/A-3 format only contains specifications for the "envelope" and the type of embedding. Thus it cannot be used for "blind" archiving in the same way as PDF/A-1 and PDF/A-2, since the long-term suitability of the embedded files is not guaranteed per se. The justified questions are countered by the fact that PDF/A-3 was already in practical use shortly after its publication. So here, I will demonstrate the usefulness of PDF/A-3 using three practical examples.

One of the strengths and weaknesses of PDF is that, as a "digital paper", it cannot reproduce multidimensional content. Typical examples are spreadsheets, where one dimension, the one that contains the functions, is lost after conversion to PDF. In PDF/A-3, the Excel spreadsheet can be embedded in the PDF so that both a robust archive file and the complete spreadsheet with formulas are available for as long as the application exists. The main advantage of hybrid archiving of Office documents is that the robust, unambiguous representation in the form of a PDF file and thus a doubtless exchange between several parties is made possible. In addition, the archive material is provided at an early stage in the life cycle of the document so that problems (e.g. when fonts are no longer available) are avoided during a later conversion. In addition, the source file can still be "worked with".

The next use case is the hybrid archiving of machine-readable and human-readable documents. The most common example are electronic invoices in which the visual image of the invoice, i.e. PDF/A, is available and the invoice data is additionally embedded as an XML data set. In Germany, this is defined in the ZUGFeRD data model. ZUGFeRD stands for "Zentraler User Guide des Forum elektronische Rechnung Deutschland". The new format will supplement the existing EDI standards nationwide and replace paper-based processes. Similar initiatives can be found in Mexico with "Comprobante Fiscal Digital a través de Internet" (CFDI) and in Brazil with "Nota Fiscal Eletrônica 2.0" (NF-e). The common advantage of these standards, which are based on PDF/A-3, is that the embedded file can be easily processed by machine and at the same time a robust visualization via the PDF is guaranteed. The combination of PDF and XML also enables data exchange and automated processing by the recipient without having to sacrifice the reliability of the paper substitute, the PDF.

The third application example is the archiving of project files such as those currently being archived by Sika, among others. The globally active, independent Swiss specialty chemicals company is subject to strict documentation requirements for new and further development processes for quality assurance reasons. The company's products have been in use for many years. In the event of complaints, Sika must be able to prove its product development processes at any time by means of test results and laboratory reports. This is done by means of project files in which development processes and release reports are documented. The associated workflows, which used to be paper-based, are now largely carried out electronically. For this reason, the resulting documents must also be stored digitally. The chemical manufacturer thus creates a valuable knowledge base in the long term, in which it has stored formulation and product knowledge relating to its manufacturing processes for a long future. The documentations - mainly office files - are to be stored system-independently in such a way that they remain summarized as electronic project files. For this reason, Sika opted for PDF/A-3. Before archiving project files, it must be clarified how to deal with files that cannot be transferred to PDF/A, such as audio or image files. Using so-called blacklisting, for example, it can be ruled out that a conversion software will attempt to convert them to PDF/A at all. Furthermore, rules can define how to proceed in this case.

PDF/A-3 therefore requires additional rules to ensure that the file can be archived. However, with such specifications, archivists and their clients will ensure the reproducibility of documents in the future and take advantage of the benefits of PDF/A-3. What do you think?

Back to overview