In recent years, I’ve seen a sharp increase in the number of projects related to the use of large language models (LLMs). A typical request looks like this:
A certain company handles large amounts of documents with certain “quirks”, for example, judicial documents, instruction manuals, invoices, etc. Manually handling these documents is expensive and labor intensive, thus the company is looking for a way to optimize this process and get a document analysis system developed.
The typical functionality companies ask of automatic document processing systems is:
All these tasks are solved very well with LLMs.
While we all know what LLMs are, at least on a basic level. However, when it comes to implementing one into your processes, it becomes challenging to pick the LLM for your project.
There are two concerns when it comes to picking an LLM. First, overall performance against your documents. An LLM might perform well as a chatbot or as a text summarisation tool but will fail when trying to extract data from an invoice. You need to understand, better yet test, how well an LLM you are considering might work in your domain. This will not only result in a more effective product but potentially much less money spent: there may be no need to use GPT (and pay OpenAI the fee) when a less powerful but well-performing LLM will do the trick.
Second, data security. Most data from documents is either confidential or sensitive, so feeding it to a cloud-based LLM is not a good idea for obvious reasons. You need to weed out cloud-based models and look for ones that allow local setup.
Given these concerns, choosing an LLM is not as straightforward as it may seem. When you consider the number of LLMs out there, it’s easy to make a wrong choice and invest a lot of money and time only to realize the model is not well-suited for the task of document analysis. There are, however, tests you can perform to find the perfect LLM for your task — both performance and cost-wise — and effectively automate the document processing.
As LLMs are language models, the best way to test them is with language: yes, asking LLM questions is a good way of evaluating its performance. But not just any questions: you need to evaluate LLMs from different angles by asking them specific questions and giving them specific tasks.
Here’s a list of questions and tasks which can help perform a well-rounded evaluation:
Generation of text based on a prompt,
Answers to common questions,
Answers to questions in a conversation format,
Grammatical error correction.
Text summary,
Answers to text-based questions,
Structured data extraction.
With this task we aim to assess the quality of a text generated by an LLM, including assessing grammar, text coherence, narrative style and topic relevance.
Text generation query, example 1:
Please generate text with the following parameters: Imagine a future where technology has advanced so much that people can travel to other planets as tourists. Describe a day in the life of a tourist visiting Mars. Include details about the places they visited, the experiences they had, and the people (or other creatures) they met.
Text generation query, example 2:
Please generate a text with the following parameters: Generate a text that resembles an official order issued in a law firm. The text should include the document header, order number, date, main content, and a place for signature. If necessary, you can add any fields that are typical for an order in a law firm. Use universal values to fill in the fields. The generated text should have a structure typical for an order, including indents and line breaks.
Here, we evaluate answer accuracy and coherency. Example questions: Please answer the following questions briefly:
What is the theory of relativity?
Who wrote "Romeo and Juliet"?
What causes the change of seasons on Earth?
What is the value of the number Pi (π)?
Who is the author of the theory of evolution?
These questions need to be asked one by one in the form of a conversation to test how well an LLM can “remember” context.
Example:
Please answer the following questions briefly. How much does the Moon weigh?
And Mars?
And in pounds?
And what is the distance between them?
What else do we know about them?
What topic do these questions relate to?
What else does it relate to?
Tell me about the last one.
And what was the first question?
Here we evaluate how well an LLM can correct a text:
Example:
Please correct the grammatical errors in the following text. The result should be a coherent, grammatically correct text.
Yesterday I went to a store to purchase groceries to cook lunch. Mine friend told me I have needed to buy vegetables and meat. We planned to cook pasta but we lacked ingridients. I made a mistakke and bought milk instead of tomato sauce. When I came home, I noticed I forgotted to buy bread. I tried to fix my mistake, but it were too late and the shops was closed.
Here we assess how well an LLM grasps the main concepts of a text.
Text summary, example 1:
Please summarize the following text:
Harper Lee, To Kill A Mockingbird
When he was nearly thirteen, my brother Jem got his arm badly broken at the elbow. When it healed, and Jem’s fears of never being able to play football were assuaged, he was seldom self-conscious about his injury. His left arm was somewhat shorter than his right; when he stood or walked, the back of his hand was at right angles to his body, his thumb parallel to his thigh. He couldn’t have cared less, so long as he could pass and punt.
When enough years had gone by to enable us to look back on them, we sometimes discussed the events leading to his accident. I maintain that the Ewells started it all, but Jem, who was four years my senior, said it started long before that. He said it began the summer Dill came to us, when Dill first gave us the idea of making Boo Radley come out.
Text summary, example 2:
Please summarize the following text:
SUPERIOR COURT COUNTY OF LOS ANGELES, STATE OF CALIFORNIA(COURT ORDER)
ORDER TO DISCLOSE VIRGIN MOBILE WIRELESS RECORDS
Date: April 26, 2010
VIRGIN MOBILE CUSTODIAN OF RECORDS: 10
Independence Blvd. San Luis Beach, Ca. 90987
Fax: (310)555-4205
Virgin Mobile is ordered to provide the following records regarding the account with the phone number of (310) 555-4032. All personal information used to open the account, such as the subscriber’s name and Social Security Number, the billing address, the address where the service was connected, if different from the billing address, all telephone toll records for the last month of the account, if the service included call forwarding, disclose the forwarded telephone number along with the subscriber’s name and billing address. Include Make, Model, phone numbers, and ESN numbers of all phones associated with the accounts. Include forms of payment on the original account also. These records are to be returned to the Affiant within ten (10) days from the date in which the order is served.Virgin Mobile, its agents and employees are ordered not to disclose the existence of this court order to the subscriber(s), unless and until ordered to do so by the court.
We evaluate answer accuracy and coherency.
Answers to text-based questions, example 1:
Please answer questions based on the text below:
Harper Lee, To Kill A Mockingbird
Miss Maudie had known Uncle Jack Finch, Atticus’s brother, since they were children. Nearly the same age, they had grown up together at Finch’s Landing. Miss Maudie was the daughter of a neighboring landowner, Dr. Frank Buford. Dr. Buford’s profession was medicine and his obsession was anything that grew in the ground, so he stayed poor. Uncle Jack Finch confined his passion for digging to his window boxes in Nashville and stayed rich. We saw Uncle Jack every Christmas, and every Christmas he yelled across the street for Miss Maudie to come marry him. Miss Maudie would yell back, “Call a little louder, Jack Finch, and they’ll hear you at the post office, I haven’t heard you yet!”
Questions:
How old is Miss Maudie compared to Uncle Jack Finch?
What was the occupation of Miss Maudie’s father?
What would Uncle Jack Finch do every Christmas?
Answers to text-based questions, example 2:
(the questions are based on a text from example 2 of text summary):
Questions:
What is this document about?
In which geographical region was this document created?
What type of document is this: a fiction book, a legal document, or a textbook?
With this prompt we evaluate how well the model can extract relevant data from a text.
Example:
(the query is based on a text from example 2 of text summary):
Please extract the following data from this document: document type, date, court and location data, respondent name and address, what records need to be disclosed, compliance deadline, non-disclosure requirements.
Example:
Here is a description of a small database:
Users table:
user_id (INT): unique user identifier
username (VARCHAR): user name
email (VARCHAR): user email
created_at (DATE): account creation date
orders table:
order_id (INT): unique order identifier
user_id (INT): user who placed the order
product_id (INT): product identifier
order_date (DATE): order date
quantity (INT): quantity of units ordered
products table:
product_id (INT): unique product identifier
product_name (VARCHAR): product name
price (DECIMAL): product price
Please write a SQL query to find all orders made by a user named john_doe, including information about the product name and quantity ordered.
Here are a few important parameters to consider on top of the above testing: