Recommended Workflow to Leverage Coding Suggestions for Review – Knowledge Base

Coding Suggestions identifies codes relevant to documents based on user-provided code criteria. When paired with the proper process, Coding Suggestions can meaningfully reduce manual human review while maintaining review quality. You can use suggestions in reviews of all sizes and types.

It is important to think of Coding Suggestions as both a technology and a workflow. As a technology, Coding Suggestions leverages the language understanding and reasoning capabilities of LLMs to analyze documents against instructions provided in natural language by users. As a workflow, Coding Suggestions requires phases of testing, iteration, and validation to get effective results.

This article walks through key concepts to understand about Coding Suggestions and example workflows that can be implemented in your review on Everlaw.

Key Coding Suggestions concepts

Core technology

Learn more about generative AI technology in our The Technology Behind Everlaw AI article.

Coding criteria

The coding criteria includes descriptions of the case, the code categories, and the individual codes. In Everlaw, you create these criteria in natural everyday language, much the same way that you might put together a review protocol for human reviewers.

diagram depicting how the case description, category description, and code description fit together when configuring Coding suggestions

This information provides the model necessary context and guidance for what the codes are meant to capture and how to evaluate whether they apply to a document, given its content. Because all the context is embodied in these criteria, the quality and precision of the prompts determines the effectiveness of the resulting suggestions. Therefore, to improve the performance of Coding Suggestions against your corpus, you must test and iterate on your code criteria.

Suggestion categories

In addition to the user-provided code criteria, Everlaw adds a layer of instructions to the prompt prior to sending the data to the LLM for evaluation. This includes meta-instructions on how the LLM should approach its evaluation and describes the form the suggestions should take.

In particular, for each configured code, Everlaw provides one of four possible suggestions and a brief justification for the suggestion:

Yes: The document is directly relevant given the code criteria
Soft Yes: Although the document is not directly relevant, there is a strong plausible link
Soft No: The document is, at best, only weakly relevant
No: The document has no relevance given the code criteria

These categories allow the system to make suggestions more granular than a binary yes or no. They also enable greater flexibility in how you leverage suggestions in workflows. The key to understanding how to use suggestions is realizing that the suggestions will not always match how humans would code a document. For example, codes that you would apply to a document may be suggested as Soft Yes, Soft No, or even No depending on the criteria and how the LLM evaluated the document against it. If coding suggestions are performing well, you can expect to see:

For Yes and No suggestions: high agreement between the suggestions and human review decisions on whether to apply a code or not
For Soft Yes and Soft No suggestions: mixed agreement between suggestions and human review. But a greater portion of Soft Yes suggestions would be human-reviewed to be relevant, and vice versa.

Knowing this, you can define, expand, or contract sets of documents using suggestions based on your review needs. For example:

To identify a set of predominantly relevant documents (higher precision), you can filter for documents with only Yes suggestions. But this may mean some relevant documents will be excluded.
To identify a set that will have more relevant documents (higher recall), but may also have more irrelevant documents, you can filter for documents with both Yes and Soft Yes suggestions.
To review more ambiguous documents, either to help with prompt iteration or to prioritize human review of more inconclusive documents, you can focus on documents with Soft Yes and Soft No suggestions
To locate documents that are likely to be almost entirely irrelevant, you can focus on documents with No suggestions.

diagram showing that you can broaden the suggestions you use for more recall, home in on the suggestions you use for more precision, and focus on ambiguoys documents to refine your coding configuration

The recommended workflow

The quality of the code criteria you provide is the primary driver of performance for Coding suggestions, and in practice it is rare for the initial code criteria to be optimal.

As a result, Everlaw recommends the following workflow stages for Coding Suggestions:

Create the initial code criteria based on your knowledge of the case and the review goals
Test and iterate your code criteria on smaller random samples of your corpus until sample performance reaches a satisfactory threshold based on your needs
Run the suggestions at scale using the finalized criteria
Validate the final suggestions by reviewing samples of documents with suggestions

diagram showing the steps of the workflow, including iteration between testing and iterating

Your use-case and review goals determine the level of iteration and validation required. For example, if you are a plaintiff firm primarily looking to prioritize documents in received productions, you may not need to optimize your prompts as thoroughly or perform final validation. However, if you are using coding suggestions as the producing party, you need to systematically test, iterate, and validate to ensure a defensible review.

Step 0: Identify documents to exclude from suggestions

There are some documents that you may want to exclude from Coding Suggestions. In particular:

Documents with important non-textual data: The LLMs underlying Coding Suggestions only take document text into account. If you have data in your corpus where visual information is key to the content and meaning, you should filter these out prior to running suggestions against a dataset.
Documents that require complex numerical analysis to understand: LLMs may struggle if the evaluation requires complex numerical analysis. You may want to reserve such documents for human review only.
Documents that can be categorized or identified by keyword or metadata values alone: If there are documents that can be coded based on keyword matches or metadata filters alone, you should consider excluding them from Coding Suggestions to reduce cost.

To exclude such documents, run a search and batch add them to a binder. You can then exclude documents from the binder in later steps in the workflow.

Step 1: Create initial code criteria

To create your initial code criteria, you should think about two key things:

What to consider

How to think about it

Which codes to create or configure

The codes you write criteria for should balance specificity with avoiding redundancy:

Be specific: Suggestions generally work better if a given code is tailored to a particular topic or facet in the case. For example, if there are different issues or topics underlying responsiveness, it is better to break these out into separate codes rather than rely on an overarching responsive code.
Avoid redundancy: You may want to limit the number of codes configured for use with coding suggestions from a cost or redundancy perspective. Try to create logical groupings of issues or topics that can be consolidated into a single code.

Gathering the right information to create the code criteria

To create effective code criteria, you need to understand:

The background of the case and the main issue(s) the review is centered on
The important entities in the case, and names that they might be referenced by within the corpus (ex. acronyms, email domains, nicknames and alternative references, etc.)
Acronyms or specialized jargon that appears in the corpus and needs to be defined or expanded
What each code is meant to capture, and why – at a high level – a code should or should not be applied to a document based on its content

Once you’ve created the appropriate codes in Project Settings and gathered the right background information, you can create the code criteria in the Project Settings > Everlaw AI > Assistant tab:

For a step-by-step guide on how to create the code criteria, see this article on Coding suggestions.
For some best practices and tips on how to write the code criteria, see this article on best practices for writing Coding suggestions

Step 2: Test and iterate code criteria

Once you have your initial code criteria set up, test it against a random sample of documents. We recommend testing your code criteria against a minimum of 25 documents. You might iterate your code criteria and test them against documents multiple times. The goals during this phase are twofold:

Have several confirmed examples of accurate suggestions for each code (both positive and negative)
Ensure that code criteria is tested against a representative subset of your wider corpus

Once you have met these goals, you’ll have greater confidence that suggestions generated at scale will perform well.

Here’s an exemplar workflow you can follow during this stage:

[Optional] Set up a homepage folder to hold the collections of documents you’ll be using for testing and iteration. You can use this folder to hold cards related to future steps in the workflow as well. To learn more about homepage folders, see this article about homepage folders.
Create an initial random sample of 25 documents (after any exclusions you may want to apply). To learn more about sampling, see this article about search settings.
Batch run Coding Suggestions on your initial sample.
Manually review each document. As you review, decide whether or not you agree with the suggestion.
- For manual review, apply the codes that you think should be applied to the document.
- Then, compare the applied codes to the suggestions in the coding suggestions tab of the AI context panel.
- For documents where you disagree with one or more suggestions, make a note of the suggestions you think are incorrect and why, grounded in the document’s content and the current code criteria. To learn more about applying notes to documents, see t his article about applying notes and highlights.
Examine your notes on inaccurate suggestions and identify trends or themes.

Tip

Add suggested codes, applied codes, and notes as separate, side-by-side columns in your results table view so you can compare the two sources of codes and see your notes at a glance.
Based on this analysis, revise your code criteria from the code criteria configuration panel or in Project Settings. To learn more about how to revise your code criteria and viewing your history of changes, see this article about coding suggestions.
Rerun the suggestions on the original documents.
Verify that the updated suggestions better match your manually applied codes and that there is no regression (ie. the updated criteria did not cause previously correct suggestions from changing)
Rerun steps 3-7 on a new sample of 10-25 documents. This helps you avoid overfitting the code criteria on your test documents. Overfitting occurs when you write or update your criteria in ways that are too specific to certain documents, thereby harming generalizability to other documents.
- Depending on the performance you are seeing and the level of rigor you want to apply, you may do this loop anywhere from 1 to 10 or more times. We recommend testing against at least one additional sample.
- As a rule of thumb, you can feel comfortable ending criteria iteration if, on new samples, you are seeing only a small number of inaccurate suggestions.
- If you want to be more rigorous or defensible in measuring performance, use the formal verification workflow described below to get metrics on sample performance and stop when 1-2 samples reach acceptable precision, recall, and F1 metrics. To learn more about these performance metrics, see this article that defines key terms.

Step 3: Generate suggestions at scale

Once you’re happy with the code criteria, you’re ready to scale the generation of suggestions. You can batch generate coding suggestions, up to 250k documents at a time.

Here are some tips to efficiently generate suggestions:

Use the Suggested Code search term to identify documents that do not have existing suggestions. Pair it with other search terms if you need to exclude certain documents from Coding Suggestions.
To generate suggestions for more than 250k documents, divide the dataset into groups of up to 250k documents using searches and add those sets to binders (which you can add to your Coding Suggestions folder for tracking). The Bates/Control search term is a good option for quickly identifying non-overlapping sets of documents. Then, open each binder and kick off batch Coding Suggestions.
Remember to uncheck the irrelevant categories/codes on the batch generation dialog.

To batch generate the suggestions:

Pull together the set of documents you want to generate suggestions for into a results table.
Select Batch > Coding Suggestions.
Select the code(s) you want to generate suggestions for.
Select Generate and confirm the action and number of credits.

Once the batch(es) complete, you’ll have suggestions ready for use in your case to filter, search, and prioritize by.

If you are not producing these documents, or if you are planning to use Coding suggestions solely to support manual review, you may not need or want to do further validation. If that is the case, you can move on to the next step in your review process . However, formally validating performance on samples allows you to calibrate your confidence in the suggestions and quantify concepts like precision, recall, and accuracy to guide your usage of suggestions.

Step 4: Validate performance

Particularly for production, you need to establish defensibility of your review process by showing that negotiated or acceptable performance thresholds are reached in statistically significant samples of your corpus. Validation has many dimensions, depending on what you need to show. A key one is demonstrating that your method is reliably disclosing relevant documents and withholding irrelevant ones. For example:

Depending on how your metrics turn out, you may need to use both Yes and Soft Yes to define the production set to reach the desired recall numbers
You may sample the No and Soft No suggestions to validate that there are very few relevant docs, but in the process discover that there’s too high a proportion of relevant docs in the Soft No sample, necessitating more manual review of documents with that suggestion

The workflow described below is an example of how a validation process can be run using tools available in the platform and the attached metrics template:

Create a statistically meaningful sample of documents with suggestions, excluding documents that were used in the initial test-iteration steps (“validation sample”). As a rule of thumb, this can be around a 10% sample.
Perform a manual review of these documents, applying the relevant codes. Don't use or reference suggestions during this review (ie. the human reviewer's determination should not be influenced by the suggestions).

Tip

You can create an Assignment Group to batch these out to trusted reviewers.
Download this template for calculating performance metrics. Next, you'll use the filters available in the results table to count the number of documents that fall into these pairs:
Follow the instructions in this step when just one code is used for responsiveness. Scroll down to step 5 below if you have broken down your responsiveness coding into multiple issue or topic codes.
To capture these numbers when just one code is used for responsiveness:
1. Open the validation sample in a results table.
2. For each suggestion category:
  - Filter by the suggestion in the Coding Suggestions column. In the example in the image below, we are filtering for Yes.
  - Filter by the Coded column for the responsive or not responsive code. Or you can use the (No value) option to identify non-responsive documents if the reviewer only applied the responsive code during review.
  - Obtain the final count of documents returned by both filters on the results table, and input it in the appropriate cell in the sheet.
  - Remember you have to do these filtering/counting steps per suggestion category, per code to get all the constituent counts. You should have counts for the each of the following pairs of filters:
    - Suggested Yes - Responsive
    - Suggested Yes - Not Responsive (or No value)
    - Suggested Soft Yes - Responsive
    - Suggested Soft Yes - Not Responsive (or No value)
    - Suggested Soft No - Responsive
    - Suggested Sot No - Not Responsive (or No value)
    - Suggested No - Responsive
    - Suggested No - Not Responsive (or No value)
If you’ve broken down your responsiveness coding into multiple issue or topic codes, there are some additional considerations to keep in mind when generating performance metrics. Importantly, unlike the single code case where counts and performance are at the document-level, this workflow will result in counts and performance at the code-level. This gives you actionable information about the performance of suggestions. However, you may also need metrics at the document level, particularly if you’re trying to validate results for production purposes. If so, we have some guidance for how to generate these metrics outside the platform later on in the article.

For the code-level workflow:
1. 1. Access a results table for the validation set.
  2. For each suggestions category:
    - Filter by the suggestion in the Coding Suggestions column and one of the codes.
    - Filter by Coded to identify responsive documents based on the code used in the prior suggestion filter (ie. pick the same single code in this filter). To identify non-responsive documents based on the selected code, choose the exclude filter option.
    - The final count of documents is at the top of the results table. Input the number into the correct cell on the suggestion_categories tab in the template. You can maintain separate metrics per code by duplicating the sheet and keeping track of code-specific counts in separate tabs. Or, to have aggregate results, keep a running sum of counts in each cell as you get the corresponding counts for each code.
    - Remember, you have to do these filtering/counting steps per suggestion category, per code to get all the constituent counts.
Once all numbers are inputted (for either the single-code workflow (step 4) or the multi-code workflow (step 5), the sheet calculates some standard performance metrics (precision, recall, and F1) at different cutoffs for relevance and populates visualizations to help you understand your Coding Suggestions results. To learn more about the performance metrics, see this article that explains these terms.

This workflow assumes a common case for how teams set up codes for suggestions and validation review. Your set-up may differ, whether in the codes that are used or in additional steps. For example, you may have a second pass review to adjudicate disputes between human coding and AI suggestions instead of simply assuming the human reviewers are correct. Whatever the case, these workflow steps will still apply, with the only major change being your filtering criteria and counting logic.

Export data for analysis

To conduct additional off-platform analysis, you can export a CSV of document-level code and suggestion data.

To create the export:

Access a results table of the documents for which you want to export data
Select Export > CSV
Under Select fields select Coding Suggestions. Under Select Categories & Codes select the Coding you are validating.
Select Export to CSV.
When the export is complete, you can download it from the Baches & Exports column of your homepage.

The workflow below assumes a comfort level with intermediate spreadsheet functionality. It makes use of the export to construct document-level performance metrics when there are multiple possible codes that make a document count as responsive, and for which suggestions have been generated. To handle such multi-code cases, you’ll need to map the suggestions and codes applied to documents to canonical suggestion and responsiveness values used to generate the metrics. To do so:

Define the criteria for when a document has a Yes or No suggestion (ex. assign Yes if there is at least one Yes or Soft Yes suggestion on a code, No otherwise)
Define the criteria for when a document should be considered Responsive or Not Responsive (ex. assign Responsive if at least one topic or issue code is applied to the document, Not Responsive otherwise).
Create new columns for Suggestion and Responsive.
Create a script or formula based on the mapping criteria that assigns a suggestion and responsive value for each document in the export
Filter based on these new values and input the counts into the correct cells on the binary_results tab in the template to calculate metrics and generate visualizations. You’ll need counts for the following pairs:
- Suggested Yes, Responsive
- Suggested Yes, Not Responsive
- Suggested No, Responsive
- Suggested No, Not Responsive

Coding suggestions performance metrics template

Download the template spreadsheet below to input data to validate the performance of your Coding Suggestions