Using AI to dynamically generate web pages

Thursday 16 November 2023

Our team has recently been exploring the potential of AI in enhancing our users’ online experience. I have delved into the possibilities of using AI to develop customised web pages on demand. In this blog, I will cover the processes involved in building a proof of concept and the challenges faced in the implementation of such an application.

The idea behind this project was straightforward – to create a website capable of dynamically generating web pages based on the user’s input in the URL. For instance, if the user inputted /computing-research-subjects, the system would interpret the request as ‘construct a web page concerning the research subjects of the school of computing’. By leveraging our knowledge base, the AI would then construct a web page that precisely met the user’s request.

My approach consisted of the following five essential steps:

Gathering data
Assigning weight and labels
Storing the data
Customising and personalising
Publishing and refining

Gathering data

Creating the knowledge base involved collecting a significant amount of content, which could have been sourced from various places, such as PDFs, databases, word documents, and websites. To simplify the process, I focused on content that was already published on our website. I ensured that the content was in its original form, without any formatting, styling, or images.

To accomplish this, I developed a basic web spider specifically designed to crawl our website. The spider targeted the main content of each page, extracting only the relevant text and disregarding other parts like headers, footers, and sidebars. The extracted text was then scrubbed of any embedded HTML and saved to a database.

Assigning weight and labels

During the cleanup process, I utilized the content headings and the first sentence within the text to generate metadata and assigned tags to the content to enhance its searchability.

Not all content carries the same importance. For instance, content that is linked from the homepage is generally more significant than content that is buried deep within the website. Thus, my idea was to incorporate web traffic statistics to determine the weighting of the content. This would involve assigning a higher weighting to pages that receive the most traffic, as opposed to pages with lower traffic. However, due to time constraints, I was unable to implement this weighting system and had to skip over it.

Storing the data

To enable semantic search in my application, I decided to use vector embeddings instead of lexical search. This involved converting my text data into vector representations. Although I won’t dive into the technical details, there are many informative articles available online about embeddings and vectors. In simple terms, embeddings are a type of text encoding necessary for semantic search.

To perform the conversion, I utilized OpenAI’s embedding Application Programming Interface (API), however, there was a limitation on the amount of data I could send in a single request. Therefore, before calling the API, I needed to split my text into smaller chunks. To automate this process, I created a script that iterated through my database, dividing the text into manageable pieces and then converting each chunk into an embedding. These embeddings were then stored in a dedicated vector database.

It’s important to emphasize that, although my text resided in one database and the embeddings in another, these two sets of data were linked using a unique identifier. This linkage allowed me to maintain a connection between the embeddings and their corresponding text pieces.

Customising and personalising

The process of customising and personalising content can vary for different individuals. For instance, a staff member seeking module information would require a different set of details than a student with the same purpose. By categorising both the content and the user, we can effectively deliver tailored information.

There are various approaches to categorising content, such as considering the user’s role, nationality, location (online or in-person), academic discipline, or even their residential hall at the University.

Additionally, personal preferences play a significant role. These preferences are specific to each individual user, taking into account their interests or dislikes, such as a strong inclination towards sports or a distaste for the colour blue. All of this information can be utilized to customise the presented content.

Regrettably, due to time constraints, I couldn’t delve further into this concept. However, I envision a potential process where the labelling conducted in step two can be combined with user information obtained from a login session or through cookies. This combination would then be used to exert influence on the content being delivered to the user.

Publishing and fine tuning

The final step involves integrating all the components. When a user submits a request, the system searches for relevant content and generates a web page to display to the user.

The complete workflow is as follows:

The user submits a URL.
The URL is converted into an embedding.
The embedding is used to search the vector database.
A subset of the text data (the search results) is sent to OpenAI, prompting the system to generate HTML content based on the provided subset.
The generated HTML is displayed within a University-branded template.

During my experimentation with fine tuning in OpenAI, I aimed to train a model that can generate HTML content aligned with our University’s branding and content standards. I had limited success and ultimately utilized the basic ChatGPT 3.5 turbo model.

The navigation aspect worked exceptionally well. I instructed the AI to create navigation links that led to relevant pages based on the context of the current page, not to pages that already exist. As a result, when the user follows one of these links the entire process restarts with the new URL, effectively creating a dynamic website on the fly as the user navigates from page to page.

The drawbacks

There are a number of drawbacks to this approach:

Slow user experience: building web pages on the fly requires API calls and database lookups that can be time consuming, resulting in a slow user experience.
Batch processing limitations: the knowledge base is created using a batch process, meaning that any changes in content will only show up after the batch process has been run. This causes delays in updating the delivered content.
Cost constraints: each request to OpenAI incurs a small cost, making running this process at scale significantly more expensive than a standard website.
Quality of content: the quality of the generated content is dependent on the input data. Outdated or poor quality content can find its way into the knowledge base leading to the creation of subpar content. Utilizing AI to determine the quality of content before insertion could be beneficial.
Lack of page layout customisation: efforts to customise page layouts based on content were unsuccessful. Every page ended up in the same standard template, removing the ability to adhere to design principles or format content for different delivery channels.

Conclusion

Despite these challenges, I am satisfied with what has been built. There are still obstacles to overcome and improvements to be made, but I believe the foundational building blocks are in place for AI to revolutionise content delivery.