To view all of Everlaw's predictive coding-related content, please see our predictive coding section.
Table of Contents
- What is predictive coding?
- How does Everlaw’s predictive coding work?
- A deeper dive into Everlaw’s predictive coding
- Training the model
- Exploring the prediction results
- Understanding performance metrics
- Tying it together
- Key best practices
- Example use cases
Predictive coding is a great tool to have at your disposal if you want to facilitate an efficient review, especially given the growing data sizes involved in even routine matters. Though the technology is no longer a novelty, and techniques have matured over the years, many people still hesitate to use it in their projects. Some are understandably intimidated by the jargon and technicality involved. The aim of this article is to demystify predictive coding.
Over the next several sections, we’ll break down key concepts and develop analogies that will help you improve your understanding of predictive coding. In the final section, we’ll summarize how predictive coding can be integrated into a variety of review workflows.
We believe that all projects can benefit from using predictive coding; you can identify privileged documents, potentially relevant or responsive documents, and create any model to suit your criteria. By the end of this article, we hope that you’ll have gained the knowledge and confidence required to realize these benefits in your own projects.
Though you can skip around to different sections, we encourage you to start from the beginning, as sections build upon one another.
What is Predictive Coding?
Predictive coding systems learn from existing review decisions to predict how your team will evaluate the remaining, unreviewed documents. Let’s explore how this works using a simple analogy.
Imagine you are in a park filled with dogs and cats. You want to find all the dogs in the park, and you’ve enlisted a friendly robot to help you. Unfortunately, the robot doesn’t know how to distinguish between dogs and cats, so you’ll need to teach it through examples. The first animal you come across is a dog, so you label it as such. The robot examines the animal and its features, and determines that anything with four legs, fur, and a tail is a dog. The next animal you come across happens to be a cat, and you label it as such. The robot realizes that its existing model for what counts as a dog is inadequate because this cat also happens to have four legs, fur, and a tail. The robot tries to figure out differences between the cat and the dog it encountered in order to improve the model that it’s using to identify dogs. After some more examples, the robot develops a sophisticated model that has a high success rate in correctly predicting whether an animal is a dog without explicit labeling.
You can think of predictive coding systems as the robot, documents as the animals in the park, and the “dog” and “cat” labels as ratings and codes.
How does Everlaw’s Predictive Coding work?
Though predictive coding systems try to accomplish the same thing, their implementation will differ. Everlaw’s predictive coding system is built on a regression-based learning algorithm, which is a technical term describing how the system learns from review work in the project. Essentially, you will define a model using the available ratings, codes, and document attributes in your project. In particular, you will specify criteria identifying which documents the system should learn from for a given model, along with the criteria for a subset of those documents that you want to find more of (ie. those that are relevant). Just like the robot described in the previous section, the prediction system examines the documents that satisfy your criteria, dissects the different features of the documents, and develops a model that predicts the relevancy of any particular document based on its features.
Unlike the robot in the previous section, though, Everlaw’s prediction system doesn’t apply a binary classification of relevant/irrelevant (or, in the case of the robot, dog/non-dog). Instead, it gives documents a score from 0-100. You can think of it as an upgraded version of the dog classification robot. Instead of predicting whether an animal is a dog or not, it says that a score of ‘100’ means the animal is highly likely to be dog and a score of ‘0’ means an animal is highly unlikely to be a dog. Then, for each animal, the robot assigns a score between 0 and 100.
All projects on Everlaw come with a pre-created prediction model based on the ‘hot’, ‘warm’, ‘cold’ rating system. However, you can create as many models as you like using different criteria. This allows you to target documents that are relevant to different aspects or issues in your project.
Everlaw’s prediction system will continuously learn and update to reflect ongoing review activity. Updates occur approximately once every 24-48 hours.
For more on how to use Everlaw’s predictive coding system, read our help article on the topic.
A Deeper Dive into Everlaw’s Predictive Coding
As mentioned previously, you can have multiple models running concurrently in a single project. A detailed article on how to create models can be found here. In this section, we’ll walk through the key concepts at a high level.
For each model, you need to specify criteria to identify documents you want to use in training the model (‘reviewed’), and which of the reviewed documents you want to find more of (‘relevant’). For example, in the default rating model, any document that has a rating is considered ‘reviewed’. Out of those documents, any document that is rated hot is considered ‘relevant’. Warm documents are considered of intermediate relevance, and cold documents are considered irrelevant. The rating model is the only model that uses this three-tiered system; all other models use the relevant/irrelevant distinction, with irrelevant documents being those that are “reviewed,” but not “relevant.” In any case, the prediction system will analyze the features of the relevant and irrelevant documents to generate the model.
Let’s break this down by expanding on the dog-classification robot analogy described earlier. Instead of a park filled only with cats and dogs, we now have a park full of a variety of different animals. We want our friendly robot helper to assist us in finding dogs. To classify the animals, we have a set of labels for all the different types in the park.
- All the animals in the park comprise the universe of animals
- The ‘reviewed’ set comprises the animals that are labeled
- The ‘relevant’ set comprises the reviewed animals are labeled as dogs
- By default, the reviewed animals that we didn’t label as dogs will be considered ‘irrelevant’
Here’s a graphic that lays out the general pipeline of the predictive coding system:
Let’s take each component in turn.
(1) The Universe of Documents: All the documents in the project
(2) Reviewed: The set of documents that are used for training a particular model. The criteria for ‘reviewed’ is defined by the user.
(3) Relevant: The subset of reviewed documents that the model should consider relevant. The criteria for this is also defined by the user.
(4) Irrelevant: The subset of reviewed documents that the model should consider irrelevant. Reviewed documents that do not fit the relevant criteria are automatically determined to be irrelevant.
(5) Holdout Set: 5% of the total documents in the project reserved to evaluate the performance of a model. The holdout set is maintained by taking 5% of the documents in each upload. These documents are not used in training, even though they might satisfy the ‘reviewed’ criteria of one or more models. In order to use the holdout set to generate performance metrics, at least 200 unique documents from the set must be reviewed, with at least 50 unique documents deemed relevant and 50 unique documents deemed irrelevant. If this threshold is not reached, a non-holdout set of training documents will be used (see below).
(6) Non-holdout training documents: If the holdout set is not active, the system will set aside randomly sampled documents from the ‘reviewed’ set in order to generate performance metrics. Unlike the holdout set, the documents included in this set change every time the model is updated. Depending on how you are conducting review, these documents are unlikely to be representative of all the documents in the project. This results in two main shortcomings relative to using a holdout set: (1) you cannot assess historical performance based on a consistent set of documents, and (2) the performance evaluations are likely to be based on biased data.
Training the Model
How much training to provide
In general, the more labeled input you provide the predictive coding system, the better the predictions. After all, it makes sense that if you label all the dogs in the park, the robot will be able to pick out the dogs with at or near 100% accuracy. But the reason we enlisted a robot assistant in the first place is to avoid having to label everything ourselves. Of course, we also don’t want to provide the robot with too little training, or it’ll merrily go around misidentifying all sorts of animals as likely to be dogs. So what’s the happy medium?
There are no hard and fast rules for how much training to provide - it really depends on the idiosyncrasies of your project and documents, and, crucially, how you want to use predictions. Let’s imagine two review scenarios you might find yourself in:
- You only intend to use predictions in supplementary ways. For example, you may be interested in seeing if the predictive coding system finds interesting documents you missed over the course of regular review. Or, you might be using predictions as an extra signal when performing quality assurance on review and productions.
- You have negotiated a review protocol with opposing counsel that heavily leans on predictive coding to get as many relevant documents as possible out the door in a reasonable amount of time.
If your situation is more akin to the second scenario, one of the primary purposes of review is training the model such that particular performance thresholds are met. You will be orienting your entire review process toward the goal of training and you may need to review a good amount of training documents to hit the minimum performance threshold called for in your negotiated protocol. If, on the other hand, your situation is more like the first scenario, you may not be that concerned with hitting particular performance thresholds. Therefore, even a small amount of training can get the model to a point where it is yielding useful results for your particular use cases.
One of the benefits of predictive coding systems that continuously learn, like Everlaw’s, is that as long as you are still reviewing, the model is still learning. Thus, regardless of the scenario you find yourself in, your models will keep incrementally improving as long as you are still tagging documents in your projects.
What kind of training to provide
In the previous section, we discussed the general rule that supplying a greater number of training examples usually leads to higher quality predictions. Besides volume, are there other aspects of training that can influence the quality of predictions? For example, can a particular selection or prioritization of training examples lead to better predictions for the same level of effort. Put another way, if we supply our model with 5,000 training examples, does the type of examples we supply affect the performance level that is achieved, or is performance primarily determined by volume?
There’s active and ongoing debate about these questions, and a number of techniques have been offered to optimize training in various ways. For the purpose of this beginner’s guide, we’ll cover two concepts that we think are key to understanding and evaluating the debates around training: random sampling to reduce bias and targeted training.
Random sampling to reduce bias
One common school of thought posits that the more representative the training set is of the universe of items it is drawn from, the higher the quality of the resulting predictions. In the context of review, many teams prioritize reviewing potentially relevant documents, or particular types of documents, at the beginning of a case. For example, they might concentrate on looking at documents that contain particular keywords, are from particular custodians, or are of a certain type. This often leads to unrepresentative training, and, by the logic laid out above, lower quality predictions. Thus, it’s often recommended to train models using randomly sampled training sets. Or, at the very least, supplement any potentially biased training with randomly sampled training sets.
To develop a better intuition of why biased training can be a problem, let’s return to our animal park for a minute. Perhaps you really love huskies. Because you love them so much, they are the first dogs that you seek out and label in the park. You label 100 dogs, all of them huskies, along with some non-dog animals, and decide to end your training, opting to rely on the robot’s judgment from there on out.
The characteristics of huskies may now over-determine your model. The robot is likely to evaluate dogs that don’t share many characteristics with huskies (like pugs) to not be dogs, and non-dog animals that share many characteristics with huskies (like lions) to be dogs. That’s why, theoretically, variety is also an important factor, along with quantity.
It’s often the case that reviewing more naturally leads to more variety. However, it’s not guaranteed, especially if you’re predisposed to looking at certain types of documents earlier than others. That’s why proponents of this school of thought argue that it is important to either consciously train on randomly sampled documents, or supplement normal review with randomly-sampled documents. The approximation in our animal park analogy would be to grid the park, then visit each grid and label a randomly chosen subset of the animals in the grid. This ensures that you are labeling a representative sample of the animals, which in theory should lead to better quality, more dependable, predictions.
In Everlaw, we make it easy to create training sets that are either entirely randomly sampled or supplemented with randomly sampled documents. Learn more about training sets here.
In contrast to random sampling, targeted training involves the purposeful creation of specific sets of training examples that can be used to train the model in more directed ways.
Some argue that, holding effort constant, a targeted training approach can be more effective than training based mainly, or entirely, on random samples. The theoretical mechanisms through which this improvement is achieved will vary depending on the approach taken. Some common examples include giving the model “higher quality” training examples faster (especially if you have a universe filled with a lot of irrelevant examples) or purposefully exposing the model to examples that you know it is less familiar with.
Of course, training is rarely an either/or situation. In many cases, training is done through a mixture of random sampling and targeted training, whether by design or for practical reasons. Furthermore, neither approach is unequivocally better than the other across all or most circumstances. Whether a given approach is actually better for your case depends on a number of idiosyncratic factors, like how “rich” your universe of documents is to begin with (how prevalent relevant documents are in your set), how good your keywords or search criteria are when generating the targeted training sets, the distribution of document characteristics and features, and whether targeted training is paired with active learning (more on that below), among other things. The benefits of different types of future training sets is also affected by the current training status of your model.
Given this complexity, our best advice is that you should pursue a training strategy that makes sense given the requirements, needs, and timelines of your team and stakeholders. Training strategy can also evolve over the course of a project. There is no “best strategy” that can be articulated in a vacuum.
Through this process, it helps to have a predictive coding system that is not too prescriptive in how it operates. For example, with Everlaw you have the flexibility to train models using both targeted and randomly sampled approaches, or a mixture of the two.
Supplying negative examples
One last thing to be aware of when training the model is the importance of supplying negative examples (i.e., animals that are not dogs) in addition to positive examples (i.e., animals that are dogs). You can imagine how only training the robot on examples of dogs can lead it to erroneously deem a giraffe to be a dog. After all, giraffes have four legs, a tail, a snout, and are mammals. If, in training your robot helper, you label giraffes in addition to dogs, the robot will learn that if an animal is unusually tall and has a long neck, it is highly unlikely to be a dog.
The equivalent, in terms of review, is to rate and code documents such that they satisfy the ‘reviewed’ criteria, but not the ‘relevant’ criteria.
Wrapping it up
Let’s summarize the training principles we’ve discussed in this section:
- Reviewing more documents will generally lead to better predictions.
- Reviewing a greater variety of documents will generally lead to better predictions.
- Providing negative examples will generally lead to better predictions.
- Random sampling is used to ensure your model is learning from a representative set of documents in your project.
- Targeted training is used to guide model training in particular ways that can – depending on the circumstances – lead to better predictions or faster times to desired results.
Finally, depending on your workflows, you might be comfortable training the models in your case informally and incidentally, over the course of regular review. Or, you can elect to have a more rigorous, formal process where all review serves the end of training the model and achieving particular performance thresholds.
Exploring the Prediction Results
Models need to have a sufficient amount of data before they can begin generating predictions. In Everlaw, a model will kick off once 200 qualified documents have been reviewed, with at least 50 reviewed as relevant and 50 reviewed as irrelevant.
Documents are considered qualified if they:
- have sufficient text,
- are unique (e.g. if there are duplicates of a reviewed document that are coded the same, only one of those documents is considered as qualified reviewed), and
- are not in conflict (e.g emails that have been coded irrelevant in the same thread as emails coded relevant are not considered as qualified reviewed).
Once a model is ready, it is used to rank documents on a 0-100 scale, with 100 representing documents that are highly likely to be relevant according to the model’s criteria, and 0 representing documents that are likely to be irrelevant.
You can fetch documents based on their prediction scores for either exploration or prioritization purposes.
In addition to a prediction score, Everlaw also calculates a coverage score per document per model in the project. As we explored, models are constructed based on an analysis of the various features that are encountered in the documents. However, based on how you train the model, there are some features in the general population of documents that may not be well-represented by the documents that were used in training. The coverage score captures the extent to which a particular document’s features are represented in a model. Just like prediction scores, coverage scores are also given on a 0-100 scale, and are visualized against predictions in Everlaw. Reviewing documents with low coverage scores will improve the model’s performance, and the quality of the predictions.
A model’s performance can be evaluated through performance metrics. We’ll discuss how to interpret performance in the next section. For now, let’s summarize this section using our animal park analogy:
- We enlist a robot to help us find dogs within our animal park.
- We train our robot with a randomly chosen subset of the animals in the park.
- The robot goes around the park and scores each animal from 0-100, with 100 representing animals that are highly likely to be dogs.
- Once scores are assigned, you can ask the robot to bring you animals based on their scores. For example, you can see all the animals that are scored higher than 70, or all the animals scored between 30-50, or all the animals scored under 20. If you continue to train the robot by labeling more animals, the predictions will change to reflect the updated training.
- In addition to the prediction score, the robot will also assign each animal in the park with a coverage score based on comparing the features of the animals that were used in training to the features of the particular animal. One way you can choose which animals to label for further training is to select those that have low coverage scores.
Imagine that after training your robot, you ask it to bring you three animals that it thinks are dogs. It brings you two dogs and a fox. You want to figure out more systematically how often your robot makes errors given its current training status. This is where performance metrics come into play.
One way to gauge performance is to label all the animals, then check the robot’s predictions against your labels. However, this is incredibly inefficient. After all, the reason you enlisted the robot in the first place is to minimize the number of animals you need to look at and label.
A better approach is to label a representative subset of animals, check those against the robot’s predictions, and extrapolate from the robot’s performance on the subset to the entire park.
You might be tempted to just check the robot’s predictions against the animals that you’ve labeled over the course of training, but this introduces the possibility of bias: If you trained your robot on a particular set of animals, it’s no surprise that it would be adept at gauging whether an animal in that set is likely to be a dog.
This is where holdout sets come into play. The basic idea is that you set aside part of the labeled input purely for evaluating performance, and not for training. Let’s see how this works in our animal park: First, you randomly identify a subset of animals in the park and give them a special designation. This designation tells the robot that, even if the animal has been labeled, it shouldn’t be incorporated into training. During training, you’ll label animals with and without the designation. The labeled, non-designated animals will be used to train the robot. The robot’s predictions will then be compared to the labeled, designated animals to generate performance metrics.
Similarly, in a project, a percentage of the documents are carved out and reserved for evaluating the performance of prediction models. Though these documents are not used in training, they are used to generate the metrics that will help you evaluate the models. There are three performance metrics that are commonly used to evaluate predictive coding models: precision, recall, and F1.
- Precision: If you ask the robot to bring you 10 dogs from the holdout set, and it brings you 6 dogs and 4 non-dog animals, you’ll have a precision of 60%. More formally, precision measures how many items predicted to be relevant are actually relevant.
- Recall: If there are 10 dogs in the holdout set, and 3 of them are predicted to be dogs by the robot, you’ll have a recall score of 30%. More formally, recall measures how many of the relevant items are correctly predicted to be relevant.
- F1: Because precision and recall measure different aspects of a model’s performance, an evaluation of a model’s overall performance requires taking both into account. The F1 score is a weighted average of the two. There is a fundamental tradeoff between precision and recall. Let’s explain this using an example. Imagine there are 1000 animals in the park, only 10 of which are dogs. You ask the robot to bring you all the dogs, and it retrieves all 1000 animals. This results in a perfect recall score, because all 10 dogs are included in that 1000, but a low precision score, because the robot also retrieved 990 non-dogs. Now, let’s imagine that the robot only retrieves 1 animal, and it happens to be a dog. This results in a perfect precision score, because every animal retrieved is a dog, but a low recall score, because it retrieved only one of the 10 dogs.
However, you can see which animals are given the special designation, and which are not. Because of this, it’s still possible for you to bias the robot’s predictions. You could choose, for example, to only label animals with rounded ears. Any non-designated animal with rounded ears would be used to train the robot, and any designated animal with rounded ears would be used to evaluate the robot’s performance. The robot’s performance metrics would probably be inaccurately high, because the animals used to measure its performance were artificially similar to the animals used to train the robot’s predictions. For your robot’s performance metrics to be truly unbiased, you not only have to give animals the special designation randomly, but also must label a random sample of those designated animals.
To prevent this bias from generating overly positive performance statistics, you can choose to evaluate your model’s performance according to a randomly sampled set of reviewed holdout set documents. This option will generate statistics that are more rigorously determined and thus more representative of your model’s actual performance.
You may be thinking back to an earlier section in the article where we discussed how the robot doesn’t determine whether an animal is a dog or not, but instead gives the animal a score from 0-100, with 100 being highly likely to be a dog. If the prediction scores are along a continuum, how do we determine what counts as ‘relevant’ (ie. a ‘dog’) when calculating performance? The answer is that the boundary is left up to you. For example, you can decide that a score of 70 marks the relevance boundary (all items with prediction scores above 70 are considered ‘relevant’, all below are considered ‘irrelevant’). Everlaw will display the precision, recall, and F1 scores at any boundary you choose, allowing you to select one based on your desired combination of performance levels.
Tying it Together
We’ve covered quite a bit of ground in the preceding sections. Let’s take a step back and recap all that we’ve learned about predictive coding. If you feel confident in your knowledge of how all the moving parts fit together, feel free to skip this section.
We have a universe of unreviewed documents in a project.
We have a park filled with animals of unknown type
To facilitate review, we create codes to tag the documents with.
To identify the animals by type, we create labels.
We want help finding documents that are relevant to a specific aspect of the case, so we create a prediction model.
We want help identifying all the dogs in park, so we enlist a friendly robot assistant.
In order to generate performance stats, the system carves out a randomly sampled 5% of documents in the project. This is known as the holdout set. Reviewed documents in this set are not used for training the model, but are instead reserved to gauge a model’s performance.
To gauge the robot's performance in identifying dogs, we randomly select 5% of the animals in the park, and designate them in such a way that the robot knows not to learn from any of the designated animals. This is known as the holdout set.
Getting the Model Started
In order for the model to start generating predictions, you need to provide it with some initial inputs. In the case of Everlaw, you need to review at least 200 qualified documents to match the reviewed criteria, with at least 50 of those matching the relevant criteria and 50 considered irrelevant.
To activate our robot, we label 200 unique animals, at least 50 of which are dogs and 50 of which are not dogs. The robot will learn from these labeled animals. Any labeled animal is considered reviewed; any animal that is labeled a dog is considered relevant; any animal that is labeled as anything other than a dog (e.g., cat).
If you want performance metrics, you should also review documents in the holdout set. The prediction model will not learn from reviewed documents in the holdout set.
To ensure that there are performance metrics, we’ll also label the animals that we designated to be in the holdout set. The robot will not learn from these labeled animals.
Improving the Prediction Model
To improve the quality of the predictions for a given model, you can code more documents to match the reviewed and relevant criteria.
To improve our robot’s performance, we provide it additional training by labeling more animals.
You can also focus additional training on documents that have low coverage scores. Documents with low coverage scores are those with features that are not well accounted for in the documents that model was trained on.
We can identify animals with features that are not well represented in the set of labeled animals that the robot learned from. Labeling these animals will expand the features that the robot takes into account when rendering a prediction, thereby improving the accuracy of the predictions.
To improve the accuracy of the performance metrics, you can review more documents from the holdout set.
To improve the metrics that are used to gauge the robot’s performance, we can label more of the animals that we designated to be in the holdout set. We can also choose to review the documents in the holdout set in a randomly-assigned order, to improve the rigor of metrics.
Once a model is running, you can use it to retrieve documents based on their current prediction scores. Documents are scored from 0-100, with 100 being very likely to be relevant given a model’s criteria for relevance. For example, you can select all of the documents that are scored 70-100. You can also pair this with other search terms: You can retrieve unreviewed documents that are very likely to be relevant, or documents that you reviewed as relevant but the model predicts as likely to be irrelevant.
Once the robot is running, it’ll score all the animals in the park from 0-100, with 100 being animals that are very likely to be dogs. You can ask the robot to do things like the following:
Keep in mind that as long as you are coding additional documents, the model will update itself to take into account the newly available training input, and the predictions will update accordingly. The exception are documents in the holdout set - reviewed docs in the holdout set are not used for training, but will help improve the accuracy of the performance metrics.
As long as you are labeling animals, the robot will continue to learn from your decisions, except for animals that are part of the holdout set. However, labeling animals in the holdout set will improve the accuracy of the performance metrics, especially if you choose to review documents in a randomly selected order.
Key Best Practices
It is difficult to give specific targets or guidelines for precision, recall, or F1 scores since every project is unique. Performance targets will also depend on how predictive coding is used within a larger review workflow. For example, one team might value high precision scores, and are willing to sacrifice recall to obtain them. Another review team might prioritize recall over precision. Nevertheless, here are some general tips on how to improve different aspects of a model’s performance in Everlaw:
- Work to reduce the bias of the training and evaluation corpus. Make a conscious effort to review a broadly sampled set of documents from across the entire project.
- Use the coverage scores to target specific documents that are not well represented in the training corpus for additional training.
- You can restrict training to the review decisions of a select number of trusted reviewers or subject matter experts. This will help keep training input consistent.
Example use cases
Here are four common use cases for predictive coding:
- Prioritizing review: Use predictions to select which documents to manually review next. Usually these are unreviewed documents that are currently predicted to be relevant.
- Finding relevant documents: Use the predictions to identify a set of relevant documents. You might choose to only review documents above a certain prediction score after a certain performance threshold is reached.
- Identifying irrelevant documents: Use the predictions to exclude documents from manual review. You might choose to ignore all documents under a certain prediction score after a certain performance threshold is reached.
- QA: You can use predictive coding to find documents that are predicted relevant, but not coded as relevant, or documents predicted as irrelevant but coded as relevant.