A Beginner's Guide to Predictive Coding

To view all of Everlaw's predictive coding-related content, please see our predictive coding section.  

Table of Contents 



Predictive coding is a great tool to have at your disposal if you want to facilitate an efficient review, especially given the growing data sizes involved in even routine matters. Though the technology is no longer a novelty, and techniques have matured over the years, many people still hesitate to use it in their cases. Some are understandably intimidated by the jargon and technicality involved. The aim of this article is to demystify predictive coding.

 Over the next several sections, we’ll break down key concepts and develop analogies that will help you improve your understanding of predictive coding. In the final section, we’ll summarize how predictive coding can be integrated into a variety of review workflows.

We believe that all cases can benefit from using predictive coding; you can identify privileged documents, potentially relevant or responsive documents, and create any model to suit your criteria. By the end of this article, we hope that you’ll have gained the knowledge and confidence required to realize these benefits in your own cases.

Though you can skip around to different sections, we encourage you to start from the beginning, as sections build upon one another.


Return to table of contents

What is Predictive Coding?

Predictive coding systems learn from existing review decisions to predict how your team will evaluate the remaining, unreviewed documents. Let’s explore how this works using a simple analogy.


Imagine you are in a park filled with dogs and cats. You want to find all the dogs in the park, and you’ve enlisted a friendly robot to help you. Unfortunately, the robot doesn’t know how to distinguish between dogs and cats, so you’ll need to teach it through examples. The first animal you come across is a dog, so you label it as such. The robot examines the animal and its features, and determines that anything with four legs, fur, and a tail is a dog. The next animal you come across happens to be a cat, and you label it as such. The robot realizes that its existing model for what counts as a dog is inadequate because this cat also happens to have four legs, fur, and a tail. The robot tries to figure out differences between the cat and the dog it encountered in order to improve the model that it’s using to identify dogs. After some more examples, the robot develops a sophisticated model that has a high success rate in correctly predicting whether an animal is a dog without explicit labeling.


You can think of predictive coding systems as the robot, documents as the animals in the park, and the “dog” and “cat” labels as ratings and codes.


Return to table of contents

How does Everlaw’s Predictive Coding work?

Though predictive coding systems try to accomplish the same thing, their implementation will differ. Everlaw’s predictive coding system is built on a regression-based learning algorithm, which is a technical term describing how the system learns from review product in the database. Essentially, you will define a model using the available ratings, codes, and  document attributes in your case. In particular, you will specify criteria identifying which documents the system should learn from for a given model, along with the criteria for a subset of those documents that you want to find more of (ie. those that are relevant). Just like the robot described in the previous section, the prediction system examines the documents that satisfy your criteria, dissects the different features of the documents, and develops a model that predicts the relevancy of any particular document based on its features.

Unlike the robot in the previous section, though, Everlaw’s prediction system doesn’t apply a binary classification of relevant/irrelevant (or, in the case of the robot, dog/non-dog). Instead, it gives documents a score from 0-100. You can think of it as an upgraded version of the dog classification robot. Instead of predicting whether an animal is a dog or not, it says that a score of ‘100’ means the animal is highly likely to be dog and a score of ‘0’ means an animal is highly unlikely to be a dog. Then, for each animal, the robot assigns a score between 0 and 100.

All cases on Everlaw come with a pre-created prediction model based on the ‘hot’, ‘warm’, ‘cold’ rating system. However, you can create as many models as you like using different criteria. This allows you to target documents that are relevant to different aspects or issues in your case.

Everlaw’s prediction system will continuously learn and update to reflect ongoing review activity. Updates occur approximately once every 24-48 hours.

For more on how to use Everlaw’s predictive coding system, read our help article on the topic.


Return to table of contents

A Deeper Dive into Everlaw’s Predictive Coding

As mentioned previously, you can have multiple models running concurrently in a single case. A detailed discussion on how to create models can be found here. In this section, we’ll walk through the key concepts at a high level.

For each model, you need to specify criteria to identify documents you want to use in training the model (‘reviewed’), and which of the reviewed documents you want to find more of (‘relevant’). For example, in the default rating model, any document that has a rating is considered ‘reviewed’. Out of those documents, any document that is rated hot is considered ‘relevant’. Warm documents are considered of intermediate relevance, and cold documents are considered irrelevant. The rating model is the only model that uses this three-tiered system; all other models use the relevant/irrelevant distinction, with irrelevant documents being those that are “reviewed”, but not “relevant”. In any case, the prediction system will analyze the features of the relevant and irrelevant documents to generate the model.


Let’s break this down by expanding on the dog-classification robot analogy described earlier. Instead of a park filled only with cats and dogs, we now have a park full of a variety of different animals. We want our friendly robot helper to assist us in finding dogs. To classify the animals, we have a set of labels for all the different types in the park.

  • All the animals in the park comprise the universe of animals
  • The ‘reviewed’ set comprises the animals that are labeled
  • The ‘relevant’ set comprises the reviewed animals are labeled as dogs
  • By default, the reviewed animals that we didn’t label as dogs will be considered ‘irrelevant’

Here’s a graphic that lays out the general pipeline of the predictive coding system:


Let’s take each component in turn.

(1) The Universe of Documents: All the documents in the case

(2) Reviewed: The set of documents that are used for training a particular model. The criteria for ‘reviewed’ is defined by the user.

(3) Relevant: The subset of reviewed documents that the model should consider relevant. The criteria for this is also defined by the user.

(4) Irrelevant: The subset of reviewed documents that the model should consider irrelevant. These documents are automatically determined: they are reviewed documents that do not fit the relevant criteria.   

(5) Holdout Set: 5% of the total documents in the case reserved to evaluate the performance of a model. The holdout set is maintained by taking 5% of the documents in each upload. These documents are not used in training, even though they might satisfy the ‘reviewed’ criteria of one or more models. In order to use the holdout set to generate performance metrics, at least 200 documents from the set must be reviewed, with at least 50 satisfying the relevant criteria. If this threshold is not reached, a non-holdout set of training documents will be used (see below).       

(6) Non-holdout training documents: If the holdout set is not active, the system will set aside a randomly sampled 5% of documents from the ‘reviewed’ set in order to generate performance metrics. Unlike the holdout set, the documents included in this set change every time the model is updated. Depending on how you are conducting review, these documents are unlikely to be representative of all the documents in the case. This results in two main shortcomings relative to using a holdout set: (1) you cannot assess historical performance, and (2) the performance evaluations are likely to be based on biased data.


Return to table of contents

Training the Model

In general, the more labeled input you provide the system, the better the predictions. After all, it makes sense that if you label all the dogs in the park, the robot will be able to pick out the dogs with 100% accuracy. But the reason we enlisted a robot assistant in the first place is to avoid having to label everything ourselves. Of course, we also don’t want to provide the robot with too little training, or it’ll merrily go around misidentifying all sorts of animals as likely to be dogs. So what’s the happy medium?

There are no hard and fast rules for how much training to provide - it really depends on the idiosyncrasies of your case and documents, and how you want to use predictions. People usually want models to reach particular performance thresholds before using predictions in any substantive way. As mentioned, one of the best ways to increase performance is to review more documents. Fortunately, prediction models in Everlaw update to reflect new review activity, so you’ll be training the model, and providing more and more labeled input, over the entire course of your review without needing to do anything special.

However, many teams prioritize reviewing potentially relevant documents, or particular types of documents, at the beginning of a case. For example, they might concentrate on looking at documents that contain particular keywords, are from particular custodians, or are of a certain type. This can lead to biased training. To see why, let’s return to our animal park for a minute. Perhaps you really love huskies. Because you love them so much, they are the first dogs that you seek out and label in the park. You label 100 dogs, all of them huskies, along with some non-dog animals, and decide to end your training, opting to rely on the robot’s judgment from there on out.


The characteristics of huskies now over-determine your model. The robot is likely to evaluate dogs that don’t share many characteristics with huskies (like pugs) to not be dogs, and non-dog animals that share many characteristics with huskies (like lions) to be dogs. That’s why variety is also an important factor, along with quantity.

Of course, it’s often the case that reviewing more naturally leads to more variety. However, it’s not guaranteed, especially if you’re doing a priority review and/or want to start using prediction results early in the process. That’s why it might be a good idea to generate a set of randomly-selected documents to review for training purposes. The approximation in our animal park analogy would be to grid the park, then visit each grid and label a randomly chosen subset of the animals in the grid. This ensures that you are labeling a representative sample of the animals. Everlaw allows you to easily seed a training set with randomly sampled documents from the entire case or a specific search.


The last thing to be aware of when training the model is the importance of supplying negative examples (i.e., animals that are not dogs) in addition to positive examples (i.e., animals that are dogs). You can imagine how only training the robot on examples of dogs can lead it to erroneously deem a giraffe to be a dog. After all, giraffes have four legs, a tail, a snout, and are mammals. If, in training your robot helper, you label giraffes in addition to dogs, the robot will come to learn that if an animal is unusually tall and has a long neck, it is highly unlikely to be a dog.

The equivalent, in terms of review, is to rate and code documents such that they satisfy the ‘reviewed’ criteria, but not the ‘relevant’ criteria.


Let’s summarize the three training principles we’ve discussed in this section:

  • Reviewing more documents (expanding the ‘reviewed’ set for a given model) will lead to better predictions.
  • Reviewing a greater variety of documents (often achieved by randomly sampling from the database) will lead to better predictions.
  • Providing negative examples (rating and coding documents such that they satisfy the ‘reviewed’ criteria, but not the ‘relevant’ criteria) will lead to better predictions.

Everlaw’s prediction system allows you to take either a passive or active approach to training. Depending on your workflows, you might be comfortable training the models informally over the course of regular review. Or, you can elect to have a more rigorous, formal process. 


Return to table of contents

Exploring the Prediction Results

Models need to have a sufficient amount of data before they can begin generating predictions. In Everlaw, a model will kick off once 200 documents have been reviewed, with at least 50 reviewed as relevant.

Once a model is ready, it is used to rank documents on a 0-100 scale, with 100 representing documents that are highly likely to be relevant according to the model’s criteria, and 0 representing documents that are likely to be irrelevant.

You can fetch documents based on their prediction scores for either exploration or prioritization purposes.


In addition to a prediction score, Everlaw also calculates a coverage score per document per model in the case. As we explored, models are constructed based on an analysis of the various features that are encountered in the documents. However, based on how you train the model, there are some features in the general population of documents that may not be well-represented by the documents that were used in training. The coverage score captures the extent to which a particular document’s features are represented in a model. Just like prediction scores, coverage scores are also given on a 0-100 scale, and are visualized against predictions in Everlaw. Reviewing documents with low coverage scores will improve the model’s performance, and the quality of the predictions.


A model’s performance can be evaluated through performance metrics. We’ll discuss how to interpret performance in the next section. For now, let’s summarize this section using our animal park analogy:

  • We enlist a robot to help us find dogs within our animal park.
  • We train our robot with a randomly chosen subset of the animals in the park.
  • The robot goes around the park and scores each animal from 0-100, with 100 representing animals that are highly likely to be dogs.
  • Once scores are assigned, you can ask the robot to bring you animals based on their scores. For example, you can see all the animals that are scored higher than 70, or all the animals scored between 30-50, or all the animals scored under 20. If you continue to train the robot by labeling more animals, the predictions will change to reflect the updated training.
  • In addition to the prediction score, the robot will also assign each animal in the park with a coverage score based on comparing the features of the animals that were used in training to the features of the particular animal. One way you can choose which animals to label for further training is to select those that have low coverage scores.


Return to table of contents

Understanding Performance

Imagine that after training your robot, you ask it to bring you three animals that it thinks are dogs. It brings you two dogs and a fox. You want to figure out more systematically how often your robot makes errors given its current training status. This is where performance metrics come into play.

One way to gauge performance is to label all the animals, then check the robot’s predictions against your labels. However, this is incredibly inefficient. After all, the reason you enlisted the robot in the first place is to minimize the number of animals you need to look at and label.

A better approach is to label a representative subset of animals, check those against the robot’s predictions, and extrapolate from the robot’s performance on the subset to the entire park.

You might be tempted to just check the robot’s predictions against the animals that you’ve labeled over the course of training, but this introduces the possibility of bias: If you trained your robot on a particular set of animals, it’s no surprise that it would be adept at gauging whether an animal in that set is likely to be a dog.

This is where holdout sets come into play. The basic idea is that you set aside part of the labeled input purely for evaluating performance, and not for training. Let’s see how this works in our animal park: First, you randomly identify a subset of animals in the park and give them a special designation. This designation tells the robot that, even if the animal has been labeled, it shouldn’t be incorporated into training. During training, you’ll label animals with and without the designation. The labeled, non-designated animals will be used to train the robot. The robot’s predictions will then be compared to the labeled, designated animals to generate performance metrics.


Similarly, in a case, a percentage of the documents are carved out and reserved for evaluating the performance of prediction models. Though these documents are not used in training, they are used to generate the metrics that will help you evaluate the models. There are three performance metrics that are commonly used to evaluate predictive coding models: precision, recall, and F1.

  • Precision: If you ask the robot to bring you 10 dogs from the holdout set, and it brings you 6 dogs and 4 non-dog animals, you’ll have a precision of 60%. More formally, precision measures how many items predicted to be relevant are actually relevant.
  • Recall: If there are 10 dogs in the holdout set, and 3 of them are predicted to be dogs by the robot, you’ll have a recall score of 30%. More formally, recall measures how many of the relevant items are correctly predicted to be relevant.
  • F1: Because precision and recall measure different aspects of a model’s performance, an evaluation of a model’s overall performance requires taking both into account. The F1 score is a weighted average of the two. There is a fundamental tradeoff between precision and recall. Let’s explain this using an example. Imagine there are 1000 animals in the park, only 10 of which are dogs. You ask the robot to bring you all the dogs, and it retrieves all 1000 animals. This results in a perfect recall score, because all 10 dogs are included in that 1000, but a low precision score, because the robot also retrieved 990 non-dogs. Now, let’s imagine that the robot only retrieves 1 animal, and it happens to be a dog. This results in a perfect precision score, because every animal retrieved is a dog, but a low recall score, because it retrieved only one of the 10 dogs.

You may be thinking back to an earlier section in the article where we discussed how the robot doesn’t determine whether an animal is a dog or not, but instead gives the animal a score from 0-100, with 100 being highly likely to be a dog. If the prediction scores are along a continuum, how do we determine what counts as ‘relevant’ (ie. a ‘dog’) when calculating performance? The answer is that the boundary is left up to you. For example, you can decide that a score of 70 marks the relevance boundary (all items with prediction scores above 70 are considered ‘relevant’, all below are considered ‘irrelevant’). Everlaw will display the precision, recall, and F1 scores at any boundary you choose, allowing you to select one based on your desired combination of performance levels.


Return to table of contents

Tying it Together

We’ve covered quite a bit of ground in the preceding sections. Let’s take a step back and recap all that we’ve learned about predictive coding. If you feel confident in your knowledge of how all the moving parts fit together, feel free to skip this section.


The Setup:

Predictive Coding

Park Analogy

We have a universe of unreviewed documents in a case.

We have a park filled with animals of unknown type

To facilitate review, we create codes to tag the documents with.

To identify the animals by type, we create labels.

We want help finding documents that are relevant to a specific aspect of the case, so we create a prediction model.

We want help identifying all the dogs in park, so we enlist a friendly robot assistant.

In order to generate performance stats, the system carves out a randomly sampled 5% of documents in the case. This is known as the holdout set. Reviewed documents in this set are not used for training the model, but are instead reserved to gauge a model’s performance.

To gauge the robot's performance in identifying dogs, we randomly select 5% of the animals in the park, and designate them in such a way that the robot knows not to learn from any of the designated animals. This is known as the holdout set.


Getting the Model Started

Predictive Coding

Park Analogy

In order for the model to start generating predictions, you need to provide it with some initial inputs. In the case of Everlaw, you need to review at least 200 documents to match the reviewed criteria, with at least 50 of those matching the relevant criteria.

To activate our robot, we label 200 animals, at least 50 of which are dogs. The robot will learn from these labeled animals. Any labeled animal is considered reviewed; any animal that is labeled a dog is considered relevant.

If you want performance metrics, you should also review documents in the holdout set. The prediction model will not learn from reviewed documents in the holdout set.

To ensure that there are performance metrics, we’ll also label the animals that we designated to be in the holdout set. The robot will not learn from these labeled animals.


Improving the Prediction Model

Predictive Coding

Park Analogy

To improve the quality of the predictions for a given model, you can code more documents to match the reviewed and relevant criteria.

To improve our robot’s performance, we provide it additional training by labeling more animals.

You can also focus additional training on documents that have low coverage scores. Documents with low coverage scores are those with features that are not well accounted for in the documents that model was trained on.

We can identify animals with features that are not well represented in the set of labeled animals that the robot learned from. Labeling these animals will expand the features that the robot takes into account when rendering a prediction, thereby improving the accuracy of the predictions.

To improve the accuracy of the performance metrics, you can review more documents from the holdout set.

To improve the metrics that are used to gauge the robot’s performance, we can label more of the animals that we designated to be in the holdout set.


Using Predictions

Predictive Coding

Park Analogy

Once a model is running, you can use it to retrieve documents based on their current prediction scores. Documents are scored from 0-100, with 100 being very likely to be relevant given a model’s criteria for relevance. For example, you can select all of the documents that are scored 70-100. You can also pair this with other search terms: You can retrieve unreviewed documents that are very likely to be relevant, or documents that you reviewed as relevant but the model predicts as likely to be irrelevant.

Once the robot is running, it’ll score all the animals in the park from 0-100, with 100 being animals that are very likely to be dogs.  You can ask the robot to do things like the following:

  • Bring me animals that are scored above 80
  • Bring me animals that I labeled as foxes but are very likely to be dogs (likely relevant)
  • Bring me animals that are scored between 0-30

Keep in mind that as long as you are coding additional documents, the model will update itself to take into account the newly available training input, and the predictions will update accordingly. The exception are documents in the holdout set - reviewed docs in the holdout set are not used for training, but will help improve the accuracy of the performance metrics.  

As long as you are labeling animals, the robot will continue to learn from your decisions, except for animals that are part of the holdout set. However, labeling animals in the holdout set will improve the accuracy of the performance metrics.  


Return to table of contents

Key Best Practices

It is difficult to give specific targets or guidelines for precision, recall, or F1 scores since every case is unique. Performance targets will also depend on how predictive coding is used within a larger review workflow. For example, one team might value high precision scores, and are willing to sacrifice recall to obtain them. Another review team might prioritize recall over precision. Nevertheless, here are some general tips on how to improve different aspects of a model’s performance in Everlaw:

  • Work to reduce the bias of the training corpus. Make a conscious effort to review a broadly sampled set of documents from across the entire case.
  • Use the coverage scores to target specific documents that are not well represented in the training corpus for additional training.
  • You can restrict training to the review decisions of a select number of trusted reviewers or subject matter experts. This will help keep training input consistent. 


Return to table of contents

Example use cases

Here are three common use cases for predictive coding:

  • Prioritizing review: Use predictions to select which documents to manually review next. Usually these are unreviewed documents that are currently predicted to be relevant.
  • Finding relevant documents: Use the predictions to identify a set of relevant documents. You might choose to only review documents above a certain prediction score after a certain performance threshold is reached.  
  • Identifying irrelevant documents: Use the predictions to exclude documents from manual review. You might choose to ignore all documents under a certain prediction score from the database after a certain performance threshold is reached.
  • QA: You can use predictive coding to find documents that are predicted relevant, but not coded as relevant, or documents predicted as irrelevant but coded as relevant.


You can take either an active or passive approach for each use case.

Active Training

Passive Training

- Take an active approach to training at the outset of a case (creating and reviewing randomly sampled training sets, reviewing documents with low coverage scores, etc).

- Take a passive approach to training (initial review is not conducted for the express purpose of training the model, and subsequent review is not informed by the need to further train the model).

Return to table of contents

Have more questions? Submit a request


Please sign in to leave a comment.