Screens Redlining Evaluation

7 min read

December 13 2024

Executive Summary

Evaluating the accuracy of large language models (LLMs) on contract review tasks is critical to understanding reliability in the field. At Screens, we focus on application-specific ways to evaluate the performance of various aspects of our LLM stack. We’ve previously released an evaluation report that measures an LLM system’s ability to classify a contract as meeting or not meeting sets of substantive, well-defined standards.

Now, we turn our attention to the system’s ability to correct failed standards with suggested redlines. We find that the Screens product, which employs this system, achieves a 97.6% success rate at correcting failed standards with redlines.

Introduction

Screens is an LLM-powered contract review and redlining tool that runs in a web application and directly in Microsoft Word. Our customers use Screens to review and redline individual contracts or to analyze large sets of contracts in bulk. The core abstraction that enables this is what we call a screen. A screen contains any number of standards. Standards are criteria that are either met or not met in any given contract. Our users draft and perfect standards to test contracts for alignment with their team or client’s preferences, risk tolerance, negotiating power, and perspectives on contract issues.

When a standard is not met in a given contract, the platform suggests a redline that intends to correct the failed standard.

Here is an example of a common standard: The contract should require the vendor to receive pre-approval from the purchaser for any use of the purchaser's name, marks, or logos for marketing or publicity.

Here is an example of this standard failing (in Fivetran’s publicly hosted Master Subscription Agreement):

Fivetran may use and display Customer’s name and logo on Fivetran’s website and marketing materials in connection with identifying Customer as a customer.

According to the screen result on the Screens platform: The document allows Fivetran to use the Customer's name and logo for marketing without requiring pre-approval, which does not meet the requirement for pre-approval.

The Screens platform then suggests this redline, which can be accepted and applied directly in Microsoft Word. This can be done either standalone or along with other suggested modifications:

Screenshot 2024-12-11 at 12.46.52 PM

The resulting clause after accepting tracked changes looks like this:

Fivetran may use and display Customer’s name and logo on Fivetran’s website and marketing materials only with Customer’s prior written approval.

This is our primary approach to redlining: identify failed standards, suggest modifications to correct those failed standards, and apply those suggestions directly to the contract. Depending on the contract and the screen, there could be anywhere from a couple to a couple dozen failed standards.

We’ve put this report together to showcase just one of the ways that we think about evaluating the accuracy of this process. Our goal is to pull back the curtain and offer transparency into how we think about building and evaluating LLM-powered contract review tools.

Methodology

The evaluation methodology that we’ve chosen to share in this report can be summarized with the following question: What percentage of the suggested redlines actually correct the failed standard when applied to the contract?

More specifically, consider the following steps:

Screen a contract
Accept all of the suggested redlines
Screen it again
Measure how many of the initial failed standards now pass

Pros

This evaluation is automated and can be run entirely by LLMs via the native functionality of the Screens platform, making it repeatable and scalable
It is narrowly focused on the primary objective of the redline suggestion generations: passing the standard

Cons

This doesn’t take into consideration other factors that we optimize for:
- Brevity of the change and/or likelihood of being accepted by the counterparty
- Adherence to industry standards and redlining etiquette
There is no human review or manual grading of the quality of the suggested redlines

Given the pros, we think this methodology results in a valuable exercise that answers this important question about the quality of redlines. Given the cons, we don’t solely rely on this evaluation technique. Focusing only on passing the standards can result in catastrophic consequences like aggressive redlines that counterparties will rarely accept as well as other industry etiquette violations and unintended side effects of language choices.

We used the Screens platform to screen 50 publicly available terms of service for software companies. We used the publicly available SaaS Savvy: Lower Value Purchases screen from the Screens community. We measured how many standards failed for each screened contract, accepted all of the redline suggestions, re-screened each contract, and finally measured how many of the initial failed standards now pass.

Caveats

We used a single screen and a single contract type for this analysis. Results may vary for different types of contracts and different types of standards. A more robust analysis might include multiple screens and multiple contract types, more accurately reflecting the workload of a real-world contract professional. We choose this screen due to its variety of strict standards. We chose this contract type due to the availability of public contracts that are good candidates for the screen, allowing us to make the results public and auditable.

Results

What percentage of the suggested redlines actually correct the failed standard when applied to the contract? 97.6%

Contracts	50
Fails	534
Resolved by Redline Suggestion	521
Score	97.57%

Counts of fails per contract ranged from 5 at the lowest to 18 at the highest.

Error Types

When prompting LLMs to resolve failed standards with redlines and applying those redlines directly in Microsoft Word, there are a number of distinct failure modes that warrant individual discussion and understanding. It’s tempting to think that a LLM drafting a revision to a clause will always be able to draft something that will pass a given standard. In practice, there are exceptions.

Complex Fail

Straightforward redlines fall in these two buckets:

A narrow provision needs to be changed by adding or removing words and phrases within a single consecutive sentence or two.
Net-new sentences need to be added either at the beginning or end of an existing section or in a new section.

However, sometimes correcting failed standards requires more complex modifications at multiple touch points throughout the contract.

For instance, consider a standard that requires that the contract not reference any supplemental hosted addendums or incorporate them by reference in any way. If a contract has defined terms for multiple addendums in a definitions section, and then references those addendums in complex ways at dozens of touch points throughout the contract, this could be challenging to work through. While Screens will try to work through this, it is more error prone than the straightforward buckets of modifications described above.

Ineffective Redline

Sometimes the generated redline doesn’t go far enough to modify the language to result in a pass on the second review. When the broader system is generating redline suggestions, it is balancing a number of considerations. It aims to:

Ensure that the standard will pass if the redline is accepted
Keep the change as lightweight as possible to maximize the likelihood of being accepted by the counterparty
Avoid any changes unrelated to the failed standard, even if they would make nearby provisions more favorable to the redlining party
Keep the defined terms and writing style aligned with the original draft

This balance can occasionally cause the redline to move the provision in the right direction, but not go far enough to result in a pass. This is more common with strict standards that require precise wording in order to pass.

Review Error

For this type of error, the initial failure is correctly flagged, a sufficient redline is generated to correct the fail, but the LLM makes a mistake when doing the second review. Here the LLM mistakenly decides that the standard still fails the second time around. While this is the least common error type, it’s always going to be possible as long as LLMs are less than perfect in determining if standards should pass or fail.

Considering these different failure modes can help users understand where things can go wrong as they assess the right use cases for Screens. They also help us target the best paths to improving the redlining system. While errors do occur, this analysis shows that they are relatively rare.

Reproducing the Analysis

This analysis utilized public contracts with a community screen. It can be reproduced by anyone using the Screens platform by following the methodology described above. The results are available for review here, which include links to the public contracts that were used in the analysis.