In an age where data has become an invaluable asset, organisations dedicate significant resources to extracting insights from the vast quantities of information at their disposal. As data is often unstructured in the form of text, such as chats, emails, and documents, effectively organising and understanding it is essential for maximising its value. One approach entails categorising data based on its sensitivity and confidentiality, enabling the implementation of appropriate security measures to protect it.
To achieve this, Getvisibility employs a feed-forward neural network known as a classifier, which rapidly assigns the tags or categories to documents. By automatically and efficiently organising and analysing text, classifiers provide a cost-effective solution for managing large volumes of information.
Recently, the Chinchilla has emerged as a topic of interest within the realm of large-scale language models (LLMs). The term “Chinchilla” is used metaphorically to represent a balance between computational efficiency and processing power in the context of LLMs. Researchers have been analysing the relationship between the computational resources required for LLMs and their capacity to process tokens, aiming to understand the trade-offs and potential improvements in LLM architectures. While this line of inquiry is vital for understanding the scalability and efficiency of LLMs, it is equally important to investigate their performance when handling noisy or compromised input data, which may arise from issues such as decompression.
Our research at Getvisibility addresses this concern by examining the impact of deliberately truncated text inputs on the accuracy of LLMs. Furthermore, by exploring the model’s robustness in processing imperfect data, we contribute valuable insights to the ongoing discourse surrounding the capabilities and limitations of LLMs. These insights are relevant not only to researchers seeking to optimise existing models but also to organisations looking to harness the power of LLMs in real-world applications, where noisy or compromised data may be a common occurrence.
The study on sample language model training reveals that increasing the model’s size without increasing the number of training tokens has been a dominant trend. However, it is crucial to determine the appropriate tokens for the classifier to improve its accuracy and speed. Training tokens refer to the number of words or units of text that the model is trained on. But how did you know the tokens sent to the classifier are appropriate and do not contain noise? Our study focused on understanding the less noisy text to help the classifier to improve its accuracy and speed.
We experimented with PII detectors and six document categories to determine the optimal size of the tokens to help the classifier improve its performance. PII detectors aim to detect Personally Identifiable Information (PII) such as names, addresses, and social security numbers. Document categories include HR, business, technical, legal, marketing, and financial records.
Fig. 1. Linear relationship between the number of tokens and the accuracy of the classifier for each category.
(a) Document Category (1 of 6)
(b) PII detector (experimental). The x-axis shows the log scale of the number of tokens, and the y-axis represents the accuracy. The scatter point illustrates the accuracy at each token size, while the red line shows the best-fit line.
Before moving forward, we studied the correlation between the number of tokens and the accuracy. The experiment resulted in two categories: Document Category and PII detector. The graphs in Fig.1 represent the linear relationship between the number of tokens and accuracy. Therefore, the accuracy increases as the number of tokens increases. Furthermore, we experimented with the optimal size of tokens sent to the classifier from the files to achieve better accuracy.
Fig.2. Optimal size of tokens is 1024 to be sent to the classifier.
(a) Document Category
(b) PII detector.
Fig.2. Illustrates that the optimal size of the tokens varies across the categories. For example, the document category is continuously improving until 1024 tokens, but after that, it slightly grew while the PII’s accuracy was highest at around 512 tokens; while it performs well at this size, there is a noticeable improvement in accuracy with more tokens. Hence, Table 1 and Table 2 show that the minimum tokens sent to the classifier should be 1024, and the R2 value confirms a correlation between the accuracy and the tokens.
Table 1. Number of tokens sent to the classifier and their resulting accuracy for the document and PII categories
Table 2. Optimal sizes of tokens and R2 value for each category
In summary, determining the optimal size of tokens for the classifier can significantly improve its accuracy and speed. The experiment results show that the optimal size varies depending on the document category, and using PII detectors can further enhance the accuracy. The graphs and table provide a detailed correlation analysis between the number of tokens and accuracy. This can help organisations improve their language models’ performance while keeping the amount of data constant.
At Getvisibility, we continuously improve our classification model to better serve our clients by increasing the complexity of the model architecture based on the optimal texts. By utilising classifiers, we offer a cost-effective solution for efficiently managing large volumes of unstructured data in text, such as chats, emails, and documents. Our research shows that determining the optimal size of tokens for the classifier can significantly improve its accuracy and speed, leading to enhanced language models’ performance while maintaining a constant amount of data. By applying these insights to real-world applications, organisations can harness the power of language models, improve their ability to extract valuable insights from large volumes of data and make informed decisions that benefit their business operations.