Synthetic Document Generation and Diversity Calculation Using Large Language Models (LLM)

As AI and LLMs are reshaping data interaction, Getvisibility prioritises ethical AI with a Trusted Framework.

Executive Summary

Artificial Intelligence (AI) and Large language models are revolutionising our interaction and understanding of data. At its core, AI aims to craft intelligent machines capable of human-like reasoning. Language models, a subset of AI, are centred on generating and interpreting human-like text. Getvisibility’s engagement with AI, particularly with Large Language Models (LLMs), is rooted in our aspiration for trustworthy and ethical AI. Through the integration of our Trusted AI Framework, we ensure that the LLMs we utilise adhere to the highest standards of fairness, accountability, and transparency.

Large Language Models are transforming the landscape of artificial intelligence, setting new benchmarks in text comprehension and generation. We ventured into generating synthetic documents using the LLM , demonstrating our capacity to produce thousands of realistic synthetic documents. This endeavour helped us refine our approach to crafting effective prompts, ensuring LLM aligns with our objectives.

Our commitment to a trusted AI future goes beyond mere implementation. We have developed and integrated a Trusted AI Framework, which encompasses:

Transparency : Ensuring every AI decision can be explained and understood by humans.

Fairness : Implementing algorithms that are unbiased and do not perpetuate existing prejudices.

Accountability : Taking responsibility for AI actions and decisions.

Privacy : Guaranteeing the protection of user data and respecting individual rights.

Through this framework, we ensure that our work with LLMs and other AI tools remains consistent with our vision for an ethical and responsible AI-driven future.

Key Points of the Report

What is Prompting? – The technique of asking the model specific questions to enhance accuracy. Prompts include task directions, result formats, context, and input specifics.

Prompting is more than just giving initial instructions to an LLM; it’s about steering its vast knowledge towards a desired outcome. When we prompt the system, we’re not just asking it to generate content; we’re providing it with a context, a guideline. It’s akin to lighting up a pathway in the vast expanse of the LLM’s capabilities. By mastering prompting, we can fine-tune the LLM’s outputs, ensuring the content aligns with our objectives while maintaining the authenticity we desire.

Synthetic Document Generation – Delving into generating synthetic documents using LLM, detailing the steps and processes.

Diversity Calculation – A novel method developed using LLM  to measure the diversity of documents, quantifying their similarities or differences.

Our experiments yielded:

  • Successful generation of synthetic documents mirroring original templates in style and content.
  • A metric-based system to quantify document diversity, providing insights into the variability within the synthetic document pool.

Harnessing LLM for synthetic data creation provides an avenue to generate datasets similar to real-world data in structure and nuances. These methods ensure the datasets address challenges related to privacy and data scarcity, allowing organisations to simulate scenarios, train models, and test algorithms without compromising on sensitive data. By using LLMs, we’ve found a smart way to generate data that looks real. This data is safe because it doesn’t use real people’s private information. With the help of LLMs, we’re leading the way in making AI tools that are powerful and protect user privacy. We believe that AI can be both smart and safe. By intertwining our techniques with our Trusted AI Framework, we ensure context-appropriate, ethical, and fair outcomes. Our dedication to AI is anchored in pioneering technology and creating trusted and secure models, emphasising privacy and responsibility. With the strengths of LLM and our commitment to trusted AI practices, Getvisibility guarantees the delivery of AI models that champion privacy and ethics, resonating with our vision for a better AI future.

Introduction

Artificial Intelligence (AI) and Large language models are new technologies that are changing how we understand and work with data. To explain simply, AI is about making smart machines that can think like humans. Language models, a part of AI, deal with creating and understanding text that sounds like it’s written by a human. Getvisibility has a longstanding association with AI, with a specific emphasis on large language models, such as those engineered by Open AI. Large language models differ significantly from average AI models. They are designed to generate and process human-like text, acquiring knowledge from copious amounts of internet data. Their versatility enables a myriad of functionalities, from responding to inquiries and translating languages to authoring essays and beyond. Getvisibility has utilised the power of these large language models for a significant period, implementing them in tasks such as research on classification tasks, synthetic document generation, diversity calculation of documents, Named entity recognition etc to name a few. This engagement with AI and language models underscores our commitment to innovation and efficiency, guiding us in the creation of intelligent and effective solutions for contemporary challenges.

Our journey with Open AI

Open AI, a leader in artificial intelligence, is transforming the field with pioneering technology. Among its notable developments is GPT-4, a model setting new standards in understanding and generating human-like text. Getvisibility’s journey with OpenAI started when we recognized AI’s potential to revolutionise our processes, services, and user experience. Open AI provides an API, a sophisticated interface that serves as a conduit between their advanced AI models and other software applications. This API allows developers to effectively leverage the AI’s language understanding capabilities within their own applications. We embarked upon the process to generate fictional or synthetic HR documents by sending requests to the engine ‘text-davinci-003’. This is one of the engines provided by Open AI’s GPT-3 API. This project was important in the sense that we were able to generate thousands of synthetic documents which look exactly like the original documents. During this study, effective ways to write prompts were developed such as the Open AI model that can understand our purpose and generate effective output. Various steps were followed to generate synthetic documents. After generating synthetic documents, we devised a method to find out the diversity of these documents. It was important because there was a need to develop a metric to assess the level at which the documents were diverse with respect to a given criteria.

What is Prompting?

A prompt can be a command or a question you ask the model, along with other info like background details, user inputs, or examples. Using these details can help the model understand better, giving you better answers. There are roughly four elements for a prompt: task direction, result format, surrounding information or context and input details. Iterative prompt refinement with clear, concise steps leads to superior outcomes. For complex tasks, decompose them into simpler steps to avoid excessive complexity at the start.

Chained prompting is a technique used to break down complex tasks into multiple steps to achieve more specific and customised results from ChatGPT. By first asking for the structure of an article and then instructing the AI to formulate corresponding bullet points, you can ensure a more accurate and efficient generation process. Zero-shot prompting means the model can answer your question or perform a task correctly the first time you ask, without needing any specific examples or training for that task. Few shot prompting involves providing a small number of examples or demonstrations to guide the model’s understanding and performance on a specific task. By giving the model a limited amount of training data (shots), it aims to enable the model to generalise and generate accurate responses for similar examples it has not seen before. For example:

Input:

“Invoice: Invoice Number: INV-001

Invoice Date: 2023-05-15

Bill To: John Smith

Itemised Details: Product A: Quantity 2, Price per unit $10.00 Product B: Quantity 3, Price per unit $15.00 .Using this example, create 2 fictional invoices with the same format”

Output:

Invoice1:

Invoice Number: INV-002

Invoice Date: 2023-05-16

Bill To: Jane Doe

Itemised Details:

Product C: Quantity 1, Price per unit $8.00

Product D: Quantity 2, Price per unit $12.00

Invoice 2:

Invoice: Invoice Number: INV-003

Invoice Date: 2023-05-17

Bill To: Michael Johnson

Itemised Details:

Product E: Quantity 3, Price per unit $20.00

Product F: Quantity 5, Price per unit $7.00

By following the format of the first demonstration, we can generate two fictional invoices. The first invoice is for Jane Doe with the corresponding invoice number, date, and itemised details of Product C and Product D. The second invoice is for Michael Johnson with a different invoice number, date, and itemised details of Product E and Product F.

When combined with few-shot prompting, chain of thought (CoT) prompting enables improved performance on complex tasks that necessitate reasoning before generating responses. By employing CoT prompting, models can demonstrate a chain of logical steps, facilitating a more coherent and contextually appropriate generation of outputs.

Tips to keep in mind for effective prompting:

1. Keep prompts simple and concise.

2.Be specific about context, outcome, length, format, and style.

3.Provide specific examples or outlines in the prompts.

4.Use placeholders or brackets to guide AI’s actions.

5.Include instructions using words like “Remember” and “Make sure”.

6.Combine CoT prompting and few-shot prompting for better results.

7.Start with few-shot prompting, specifying examples or formats.

8.Use CoT prompting to guide AI step by step.

Synthetic Document Generation

Creating synthetic documents involves the process of feeding documents into Open AI, the process of identifying common elements in a small set of documents, generating AI prompts based on these commonalities, and using these prompts to create fictional/ synthetic documents. Suppose you have some set of files, which can be HR documents, financial documents, non-financial documents etc. It’s a tedious process to read the contents of all the files to generate prompts on your own. It’s a best practice to feed the contents of a few files into the AI first, then you can ask the AI about the type of files, its main content, some common elements etc. Then you can ask AI to generate prompts on its own to generate similar kinds of files.

For example: Suppose you have some HR documents. You can generate more documents based on these example documents. There are certain steps which are to be followed so as to generate fictional HR documents.

1. The first step is to feed the contents of the example document at hand to Open AI and write a prompt, asking for a small description about these files.

Input:

“Identify the type of the provided documents and summarise its main content. Also, highlight the key sections or elements that structure the content of the document. Make sure the summary is brief, concise and on to the point(maximum 3 sentences each)”

Here, we have made the prompt in such a way that it is concise but it includes all the necessary information as well.

1. The next step is to use the description generated from Step 1 and ask AI for common or shared elements present in those two documents. For example:

Input:

“These are the shared elements of two documents:

{ Output from AI on the shared elements of 2 documents obtained from Step 2}

Using this information, write a prompt so as to generate diverse HR documents of different types and formats, following these shared elements. Make sure that the prompt is a maximum of 3 sentences long and informative in its context. Also, make sure to include some fictional details without leaving any information blank.”

Using this prompt, we can ask AI to write a prompt on itself to generate synthetic documents in the next step.  To generate synthetic documents effectively, follow these steps: Use clear, concise prompts with examples or outlines. Apply placeholders to direct AI’s actions and highlight key details with phrases like “Remember” and “Make sure”. Use a combination of chain of thought and few-shot prompting, starting with explicit few-shot prompts. Guide the AI step by step to ensure a coherent content flow.

1. The next step is to ask AI to generate its own prompt for generating more documents of the same type using the common elements generated from the previous step. For example:

Input:

“These are the shared elements of two documents:

1.Both documents are related to job recruitment and human resources.

2.They contain information about the professional background and capabilities of candidates.

3.Both involve identification of candidate skills and experience levels.

4.The location of the candidate/job is a common element.

5.The age of the candidate is present in both documents.

Using this information, Please write a prompt so as to generate diverse HR documents of different types and formats, following these shared elements. Make sure that the prompt is a maximum of 3 sentences long and informative in its context. Also, make sure to include some fictional details without leaving any information blank.”

The AI will give an output prompt which can be used for generating a number of fictional or synthetic documents. In this case the output generated by AI is as follows:

Output:

Create comprehensive, fictitious HR documents: job application form, performance evaluation report, training needs assessment form, diversity inclusion survey, and a resume. These should encompass candidate specifics, skills, prior work experience, locale, and age. Each document should be filled with imaginative yet realistic details and reflect a variety of document types and formats to demonstrate a diverse pool of candidates, their backgrounds, and potential contributions.

We can give this as a common instruction and generate a number of HR documents of our choice such as job application, training needs assessment form etc with a separate prompt.

Diversity calculation with OpenAI API

We can use the Open AI API to evaluate and compare the diversity of a specific category of documents with respect to certain criteria. This is an important step because it enables us to quantify whether two documents chosen at random from a set are similar or dissimilar using a quantitative metric.  The steps are outlined below:

1.Define the criteria. Criteria are the set of commonalities or guidelines that a specific category of documents follow.

2.After defining the specific criteria, and randomly selecting a sample of 100 files, write a prompt so that AI will read the contents of texts in two files and generate a response so as to which the files are Similar, Neutral or Dissimilar.

3.Use the NLTK library to extract the AI response, then send the prompt to AI and calculate the average diversity score of documents[ Diversity score- [ 0: Similar, 0.5: Neutral, 1: Dissimilar].

Figure 1: Plotting the diversity score of documents

The above figure shows the diversity score of documents against their count.  As the diversity score increases, the documents become more diverse or dissimilar. Here we can see that most of the HR documents are similar with regard to the given criteria. In this way we can calculate how diverse the documents are and calculate the average score of diversity.

Conclusion

Using OpenAI to create synthetic data is a powerful and flexible way to create fictional datasets for various uses. Thanks to OpenAI’s advanced language model, it’s possible to make synthetic data that accurately reflects the statistics and patterns found in real-world data. This method is incredibly useful in dealing with privacy issues and data limitations because the fictional data is very similar to the original, yet it keeps the user anonymous. OpenAI’s complex algorithms make it possible to produce large synthetic datasets with varied features and complex structures. As a result, organisations can use these datasets to train their models, test their algorithms, and simulate different scenarios without revealing any sensitive information. Utilising OpenAI’s abilities to create synthetic data brings new possibilities for innovation based on data, while allowing researchers to do their work safely, prioritising privacy. We can also calculate the diversity of the synthetic documents by writing a good prompt and asking AI for a response to rate the documents as similar, dissimilar or neutral. Also it’s observed that it’s better to start the prompt with a few shot prompting enlisting some specific examples or formats that in a way you want the content to be generated. Then apply a chain of thought prompting AI to think step by step in generating output. Therefore, by combining different methods for the specific use case we can generate synthetic documents which are context-appropriate written materials.

Data Scientist
Lakshmi Menon

References :

https://www.promptingguide.ai/introduction/tips

https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api#h_1951f30f08

https://the-decoder.com/chatgpt-guide-prompt-strategies/

https://learnprompting.org/docs/category/-prompt-hacking

https://github.com/brexhq/prompt-engineering

https://github.com/dair-ai/Prompt-Engineering-Guide

https://community.openai.com/t/cheat-sheet-mastering-temperature-and-top-p-in-chatgpt-api-a-few-tips-and-tricks-on-controlling-the-creativity-deterministic-output-of-prompt-responses/172683

We enable you to Own Your Data. Ready to getstarted? Request a demo today.

Latest Resource Articles

Browse All