
Why You're Still Better Than AI at Editing Documents - CS50 Tech Talk
AI Summary
This CS50 tech talk features Nick Bernal, Director of Engineering at Superdoc.dev, discussing the complexities of file formats, particularly DocX, and how modern AI tools interact with them. Bernal highlights that while simple file formats like JPEG are easily identifiable by their initial bytes, DocX files are significantly more intricate.
Bernal introduces Superdoc.dev as an open-source developer toolset for building DocX applications, aiming to bring Microsoft Word-like functionality into custom applications and automation pipelines. He emphasizes that documents, especially DocX, are more akin to software than mere text, containing underlying structure, logic, and rules.
The core of the discussion revolves around the challenges of using Large Language Models (LLMs), like ChatGPT and Claude, to edit DocX documents. Bernal demonstrates with examples that current LLMs often struggle with precise document manipulation. For instance, asking an LLM to add a paragraph using "track changes" might result in the paragraph being added, but without the actual track change functionality, making it impossible to review or accept/reject the modification. Similarly, requests to restructure parts of a document, like splitting definitions into numbered lists, can lead to incomplete or incorrect results, potentially rendering the document incoherent and invalid. These failures stem from the LLMs lacking the necessary interface and understanding of the underlying DocX structure.
Bernal then delves into the nature of DocX files, revealing that they are essentially zip archives containing multiple XML files and other components. He explains that the user-friendly interface of applications like Microsoft Word or Google Docs is an abstraction layer that hides this complexity. This abstraction allows users to focus on content creation rather than the intricate internal workings of the file format. However, this abstraction can break down when documents move between different applications or systems, leading to loss of functionality and meaning, as demonstrated by examples of fields not rendering correctly in Google Docs compared to Word, or endnotes being converted to footnotes.
The evolution of the DocX format is traced back to the need for openness and interoperability. Initially, Microsoft Word's .doc format was a proprietary binary file, inaccessible to other applications. The move to .docx, based on Office Open XML (OOXML), in 2006, and its subsequent standardization, made the format more open, allowing for broader implementation. This openness is crucial because documents are now considered infrastructure, and antitrust concerns pushed for formats that weren't locked into a single vendor.
Bernal likens interacting with software systems to document editing. Just as we don't build our own cellular networks to send text messages, we use APIs (Application Programming Interfaces) to interact with complex systems. APIs provide a defined, predictable way to perform actions, ensuring determinism – the same input always yields the same output. He argues that current LLMs struggle with DocX because they are not given the right interface, akin to asking a dog to speak by pressing random buttons instead of providing specific, labeled buttons for commands.
Superdoc.dev provides this necessary interface, a "document API," which allows LLMs to interact with DocX files reliably. Bernal demonstrates this by using an LLM with Superdoc's tools to perform tasks that previously failed. Examples include inserting a new clause with correct formatting, changing text like "Series A" to "Series B" using track changes in real-time collaboration, adding footnotes, and changing monetary amounts. In these instances, the LLM, empowered by the API, not only performs the requested action but also respects the document's existing formatting and context, maintaining coherence.
The key takeaway is that while LLMs possess intelligence, they require the right tools – an API – to effectively interact with structured data formats like DocX. This allows for seamless collaboration between humans and AI in document editing, much like a well-trained dog understands which specific buttons to press. Bernal concludes by posing the question of where else humans and machines collaborate without adequate interfaces, suggesting that such opportunities might be hiding in plain sight.