We created an AI-based document digitisation pipeline to convert Word and PDF documents into a user-friendly web format. We designed a well-researched, robust and accessible web front end.
In this article, we discuss a recent project which delivered:
The client had a large number of Word and PDF documents, dating back many years. Many documents had references to other documents. It was difficult for users to find the information they needed.
Challenges with converting the documents into a user-friendly web format included:
Our multidisciplinary team included technical, design, quality assurance, as well as content specialists from True North Content.
We started with a discovery phase. Following a kick-off and immersion workshop, we conducted desktop research, journey mapping, backstage process mapping, technical research, solution design, and a full content audit. We held workshops with internal stakeholders to understand the documents and related communication channels, as well as architecture and security needs and constraints. We conducted user interviews with a balanced group of users to identify their needs and validate our hypotheses. We then created detailed scenario maps and a clickable prototype for usability testing off the back of this research. At the same time, we developed and tested a technical proof of concept to show that Artificial Intelligence (AI) could be used to ingest and enrich documents with appropriate context and metadata.
Our team used the Agile Scrum method for its iterative and collaborative qualities. The product owners from the client were deeply involved in the project via daily scrum meetings and regular showcases, providing regular feedback. We worked on the digitisation pipeline, the CMS, and the front end in parallel. The site was designed with a focus on user needs and accessibility. Quality assurance was integrated in every step of the project.
The digitisation pipeline was built in Azure and integrated into the Drupal CMS. Documents are uploaded in batches via the AI-based digitisation pipeline and automatically enriched with metadata to make them easily findable via search and the website’s information architecture (IA). A human then checks the results before uploading and publishing them into Drupal. It’s easy to see potential errors or warnings in the documents, such as images that are missing alt text and unsupported image formats. Due to the inconsistent nature of the source documents, the pipeline provides a range of automated classification and fixes, such as fixes to footnote numbering and unifying list items into a single list. Automated fixes are highlighted for human review.
Document authors can continue to use their existing document authoring workflow, involving Word and Sharepoint. They then upload those documents through the pipeline. The option to move to authoring directly in Drupal is there if it’s needed in the future.
Great care was taken in designing the IA and search experience for the website. Our research found that the site needed to serve both expert users who are familiar with the content, and new users. This kind of user experience was made possible by the automated enrichment done in the digitisation pipeline. There are multiple ways to find relevant information, to suit the needs of both new and experienced users.
We successfully created a digital product that uses AI to convert documents into a user-friendly website experience. Our solution provided a seamless experience for the document publishing team, which suits their needs now and into the future. This project greatly improved the user experience of accessing information.
Get in touch with Annex to see how we can help you make your information easier to access.