
Comment Adobe nous a tous piégés avec le PDF
Audio Summary
AI Summary
The PDF format, ubiquitous for documents like reports, invoices, and contracts, is publicly and freely available in theory. However, Adobe has achieved near-monopoly status, transforming it into a significant revenue stream. Guillaume, who developed his own open-source PDF generator, delves into the format's inner workings, revealing a system as ingenious as it is intricate, with origins tracing back to the very inception of the format.
The story of PDF begins not with Adobe, but with Xerox in the United States. Primarily known for photocopiers, Xerox was a hub of innovation, even before the rise of Silicon Valley. Their Xerox Park research center was a birthplace for groundbreaking technologies like the computer mouse, object-oriented programming (with languages like Smalltalk), and graphical user interfaces (GUIs). Interestingly, GUIs and PDFs share a common ancestor in this era. The 1960s saw a legendary demonstration showcasing live video conferencing and precursors to hypertext links, highlighting the advanced concepts being explored.
Among the brilliant minds at Xerox were John Warnock and Charles Geschke. In the 1980s, as personal computers began entering businesses, they recognized a critical need for a standardized printing system. Existing printers were highly diverse, ranging from text-only machines resembling automated typewriters to matrix printers that relied on pixel grids. This fragmentation made it incredibly difficult to ensure a document printed consistently across different devices. The challenge was to create a format that could describe not just text, but also images and vector graphics (curves and lines), and communicate these instructions to any printer.
Their solution was PostScript, a page description language that defined what should be printed on a sheet of paper. Warnock and Geschke founded Adobe to develop this technology. Their business model involved selling PostScript and the associated software to printer manufacturers, enabling them to integrate this language into their hardware.
The link between printing and GUIs becomes apparent when considering Steve Jobs's encounter with Adobe's PostScript demonstration. Jobs recognized its potential not just for printing, but for creating graphical interfaces on computer screens. He envisioned operating systems that moved beyond text-based interfaces to visual ones, and PostScript, or something similar, could provide the language to describe how elements should appear on screen. While a partnership with Adobe didn't fully materialize as Jobs might have hoped, the ideas influenced the development of macOS and NeXTSTEP. This shared origin underscores the deep connection between GUIs and the PDF format.
However, PostScript, while powerful, was primarily designed for printing and was quite complex and large, not ideal for easy exchange and storage. As the 1990s approached with the dream of a "paperless office," Adobe sought a format that could define page layouts like PostScript but was optimized for transfer, use, and storage. This led to Project Camelot at Adobe, which aimed to create a format based on PostScript but with improvements in compression, making files smaller and more efficient for transfer. The goal was a format that could be read on any machine, regardless of hardware, and displayed consistently on both paper and screens, irrespective of resolution. This initiative evolved into PDF.
The promise of PDF was to offer a stable, universally readable output format, a stark contrast to other document formats. Images, for example, are often raster-based (pixel grids), making text search difficult and files large. Word processing documents, like Microsoft Word's .doc files, are designed for editing and information storage, but their visual output can vary significantly across different software versions and platforms, leading to formatting issues. PDF, on the other hand, prioritizes the exact rendering of a document. Its core strategy is to store the visual representation, ensuring fidelity across devices, much like PostScript's original intent for printing.
The initial adoption of PDF was not immediate. A significant hurdle was that Adobe's early PDF reader software was not free, costing around $50. Creating PDFs also required more expensive software. This changed in the mid-1990s. While Adobe made the PDF specification publicly available, they initially kept the format proprietary. Their strategy, however, was to make the *reader* free. With the release of Acrobat Reader for free in 1994 (with version 1.1 of PDF), widespread adoption became possible. This allowed businesses to create PDFs using their paid software and distribute them freely to customers who could then read them on the free reader. The revenue model shifted to those who *created* PDFs, rather than those who merely consumed them. This approach aimed to establish PDF as the de facto standard for document exchange.
The open specification allowed other companies to develop software for reading and writing PDFs. However, Adobe maintained a significant advantage. As the creators of the format, they possessed a deeper understanding of its intricacies and nuances. When new versions of the PDF specification were released, Adobe's software was invariably the first and most accurate implementation. If ambiguities arose in the specification, Adobe's interpretation was often considered the definitive one, giving them a technological edge and effectively controlling the format's evolution.
The complexity of the PDF specification itself is a major factor. It's a substantial document, exceeding a thousand pages, requiring significant time and expertise to fully comprehend and implement. This technical challenge, coupled with the need for robust tools to verify PDF correctness, has meant that developing high-quality PDF software has been a years-long endeavor for many. The specification itself is also distributed in PDF format, a somewhat ironic and inconvenient choice for those working with it. The responsibility for the specification has since transitioned from Adobe to the International Organization for Standardization (ISO).
The business landscape around PDF has evolved. When Adobe transitioned the specification to ISO, a consortium of companies formed the PDF Association to participate in its development. While the specification became more openly developed, ISO charges for access to its standards, making them a paid commodity. This presents a challenge for open-source developers who often work on their own time and may not have the resources to purchase these specifications.
The PDF Association actively promotes PDF, emphasizing its ongoing evolution beyond simple text and line drawings. Modern PDFs can incorporate audio, video, 3D models, and even JavaScript, blurring the lines with web pages. Features like interactive forms, digital signatures, and accessibility tagging for users with disabilities have become integral. The format has expanded significantly from its initial conception.
Participation in the ISO working groups and the PDF Association involves significant financial commitment. While individual contributions might be a few hundred dollars annually, corporate participation can range into the tens of thousands of dollars per year, reflecting the political and strategic importance of having a voice in the specification's development. Large corporations benefit from having a stake in defining the future of PDF.
Today, PDF continues to generate revenue, primarily through creation and specialized tools. While readers are widely available for free across browsers and operating systems, Adobe's Acrobat suite remains a leading solution for advanced PDF creation. Companies also pay for tools that perform tasks like optical character recognition (OCR) for scanned documents, convert standard PDFs into print-ready or accessible formats, or validate PDF compliance for legal or archival purposes. For instance, creating archival-quality PDFs that ensure long-term readability and accessibility requires specialized software to add metadata. Adobe's solutions are prominent in these areas.
Developers can also face licensing costs for PDF generation software. While many free tools exist, including browser "print to PDF" functions and text-based solutions, automating complex document generation often requires commercial software. These can range from open-source libraries to expensive, single-machine licenses. This ecosystem means that while end-users may not directly pay to read PDFs, various entities in the creation and processing chain often incur costs.
Guillaume's personal journey into the intricacies of PDF generation began when he worked in IT for pharmacies. The need to automate the creation of numerous documents—invoices, course materials, drug information—for thousands of monthly outputs highlighted the limitations of manual methods. Initial attempts with formats like LaTeX proved too complex for graphic designers to style, and solutions like OpenOffice/LibreOffice presented difficulties in automation and consistent output. This widespread problem led Guillaume to seek a more robust solution.
His company, focused on open-source software, explored transforming HTML and CSS into PDFs. This approach leveraged the widespread familiarity with web technologies. CSS, even in its early versions, contained specifications for printing, allowing for distinct styling for different output media. The challenge was that standard web browsers, optimized for scrolling web pages, were not ideal for generating print-ready PDFs with proper page breaks and layout management.
This led to the development of a custom rendering engine. Guillaume and a student intern, Simon, embarked on building a tool that could take HTML and CSS input and produce PDF output. This project, named Wezy Print, became a significant open-source library. The core idea was to treat PDF generation as a rendering process, similar to how browsers render web pages, but with PDF as the output format.
The internal structure of a PDF file, as demonstrated, reveals a series of objects and commands. A simple example shows instructions for drawing a rectangle, including its coordinates, dimensions, and styling (like dotted lines). These commands describe the visual elements to be rendered. The PDF specification, with its complex matrices and transformation rules, can make it challenging to predict the exact visual output without specialized tools.
Wezy Print has evolved significantly over its 15-year history, becoming a robust engine that implements many of the PDF specification's features, including those beyond basic rendering, such as metadata handling, color management, footnotes, page numbering, and page layout variations. Its success is evident in its widespread adoption. While direct usage statistics are hard to track, millions of downloads per month indicate its significant impact. It's written in Python and ranks within the top 1% of Python libraries. Large corporations, including tech giants like Google and Microsoft, as well as SAP and Mercedes-Benz, are known to utilize Wezy Print or its dependencies, all within the open-source framework.