From disconnected documents to a user-friendly website

In this article, we discuss a recent project which delivered:

  • An efficient and user-friendly AI-based document digitisation pipeline that converts hundreds of Word and PDF documents into a user-friendly web format.
  • An efficient and easy-to-use content management system (CMS) for managing the review and publishing of documents and other web content.
  • A well-researched, thoughtfully designed, robust, and accessible web front end.

The challenge

The client had a large number of Word and PDF documents, dating back many years. Many documents had references to other documents. It was difficult for users to find the information they needed.

Challenges with converting the documents into a user-friendly web format included:

  • Making document layout friendly for the web
  • Making the documents easily findable through search
  • Different image formats
  • Converting complex tables
  • Linking related documents
  • Handling inconsistent styles and formatting
  • Identifying and converting footnotes and definitions to a user-friendly web format

What we did

Our process

Our multidisciplinary team included technical, design, quality assurance, as well as content specialists from True North Content.

We started with a discovery phase. Following a kick-off and immersion workshop, we conducted desktop research, journey mapping, backstage process mapping, technical research, solution design, and a full content audit. We held workshops with internal stakeholders to understand the documents and related communication channels, as well as architecture and security needs and constraints. We conducted user interviews with a balanced group of users to identify their needs and validate our hypotheses. We then created detailed scenario maps and a clickable prototype for usability testing off the back of this research. At the same time, we developed and tested a technical proof of concept to show that Artificial Intelligence (AI) could be used to ingest and enrich documents with appropriate context and metadata.

Our team used the Agile Scrum method for its iterative and collaborative qualities. The product owners from the client were deeply involved in the project via daily scrum meetings and regular showcases, providing regular feedback. We worked on the digitisation pipeline, the CMS, and the front end in parallel. The site was designed with a focus on user needs and accessibility. Quality assurance was integrated in every step of the project.

Digitisation pipeline

The digitisation pipeline was built in Azure and integrated into the Drupal CMS. Documents are uploaded in batches via the AI-based digitisation pipeline and automatically enriched with metadata to make them easily findable via search and the website’s information architecture (IA). A human then checks the results before uploading and publishing them into Drupal. It’s easy to see potential errors or warnings in the documents, such as images that are missing alt text and unsupported image formats. Due to the inconsistent nature of the source documents, the pipeline provides a range of automated classification and fixes, such as fixes to footnote numbering and unifying list items into a single list. Automated fixes are highlighted for human review.

Diagram of the digitisation pipeline. There are four main sequential stages: ingest, enrich, manage, explore. Ingest consists of documents, form recogniser and blob and table storage. Enrich consists of language studio, function apps, custom text classification, key phrase extraction, custom named entity recognition. Manage consists of review, refine and publish. Explore consists of the website.
Diagram of the digitisation pipeline. There are four main stages: ingest, enrich, manage, and explore. The pipeline makes it simple, efficient, and reliable to update the website so people can explore the documents in a user-friendly web format.

Content authoring

Document authors can continue to use their existing document authoring workflow, involving Word and Sharepoint. They then upload those documents through the pipeline. The option to move to authoring directly in Drupal is there if it’s needed in the future.

Findability enabled by AI

Great care was taken in designing the IA and search experience for the website. Our research found that the site needed to serve both expert users who are familiar with the content, and new users. This kind of user experience was made possible by the automated enrichment done in the digitisation pipeline. There are multiple ways to find relevant information, to suit the needs of both new and experienced users.

  • Landing pages and related documents
    There are landing pages for each of the major topics, and the ability to see related documents. This makes it easy for new users to find the information they need and familiarise themselves with the website.
  • Intuitive search experience
    The team designed an intuitive search experience which allows users to find documents as well as crucial excerpts within documents with ease. Users can search using filters to drill down on the information relevant to them. They are then presented with intuitive search results that include context and metadata much like Google does - allowing them to confidently identify the result that is relevant to them and navigate to it with a single click.
    Example search result from Google showing context including link, text snippet, rating, price range and a thumbnail image.
    The team followed the Google pattern of providing search results with context and metadata.
  • Thoughtful document-viewing experience
    The document-viewing experience makes it easy to find relevant information within each document. A table of contents is generated for each document based on the headings found in the source document. Some terms have a definition at the beginning of a document or even in another related document. These terms are linked within the document so that the definition is available right where it’s needed. Footnotes from the original documents are linked in the same way.

The outcome

We successfully created a digital product that uses AI to convert documents into a user-friendly website experience. Our solution provided a seamless experience for the document publishing team, which suits their needs now and into the future. This project greatly improved the user experience of accessing information.

Get in touch with Annex to see how we can help you make your information easier to access.