To view all of Everlaw's predictive coding-related content, please see our predictive coding section.
Table of Contents
- Action items
- Setting a threshold for your model
- Holdout set
- Historical performance
Now that you’ve created your predictive coding model, it’s time to interpret your results! To access any of your predictive coding models, use the Document Analytics icon in the navigation bar (indicated by a bar graph) and select Predictive Coding.
Your predictive coding model page is separated into three sections: action items, results, and training. Let’s start with the action items section.
First, you have your prioritized documents.
Prioritized documents are those that have not yet been reviewed by your team, yet the model predicts to be relevant based on how you have reviewed other documents. In the case of a rating model, these are documents that the team has not yet rated, but the model predicts to be either warm or hot. These may be important documents that were missed by prior searches, or documents that the team simply has not yet gotten to.
Click Review to see a full results table of these prioritized documents.
After your prioritized documents, you are given any conflicts the model has run into.
On the left is the number of documents that the model predicts to be relevant, but that have been reviewed irrelevant by the team.
For a rating model, these are documents that the model predicts are either warm or hot (relevant), but that the team rated cold (irrelevant). Note that a document that has not yet been reviewed by the team would not fall into this category. This section is for documents where the team has reviewed (i.e., coded or rated) them, but where the team’s review work conflicts with the model’s predictions for the document. Click Review to see a list of these documents.
On the right, you’ll see the number of documents that the model predicts to be irrelevant, but that the team reviewed as relevant.
In a rating model, these conflicting documents are ones that the model predicts are cold (irrelevant), but the team rated either warm or hot (relevant). Click Review to see a list of these conflicting documents.
The final action item allows you to improve the model’s predictions by reviewing documents that are not well covered.
For a document to be well covered, the model needs to have been trained on documents that share similar features to it. If a document is full of features that the model has never seen before, the document will be considered not well covered. For example, if the word “velociraptor” shows up in a document 100 times, but the model has never been trained on a document with the word “velociraptor” in it, it won’t be able to predict whether that document is relevant as well as a model that has been trained on documents with “velociraptor” in it. The documents in your Improve Predictions section are those that have features the model hasn’t seen before. Click Review to see a list of documents that contain unfamiliar features.
The next section in your model’s page is Results. The results section has two different parts: distribution and performance.
Let’s start with distribution.
The distribution graph shows the documents in your project along a scale of predicted relevance. To the far left, you can see the documents that have a prediction score of 0. The model predicts that these documents are very unlikely to be relevant. To the far right, there are documents that have a prediction score of 100. The model predicts that these documents are very likely to be relevant. The purple flag on the distribution graph represents your model’s max F1 score. F1 scores will be discussed in more depth in the performance metrics section, but you can think of the max F1 score as your model’s threshold for relevance. Documents that fall anywhere to the right of this line are considered relevant by your model, while documents that fall to the left of this line are considered irrelevant. Additionally, a document that has a prediction score of 100 has a higher predicted likelihood of being relevant than a document with a prediction score of 85, but the document with a prediction score of 100 is not predicted to be more relevant than the document with the lower prediction score. Relative prediction scores correspond to relative likelihoods of relevance, not necessarily relevance itself.
You will also notice a green, movable flag on your distribution graph. By moving this green flag, you can set your own prediction threshold for reviewing your documents. For example, if you only want to review documents that are predicted very likely to be relevant, you can slide the green flag further to the right. This, of course, may mean that you skip over other documents that may be relevant but received a lower prediction score. Clicking on the blue number to the right will bring up a list of documents that fall above the threshold you’ve set.
In the top right corner of the distribution graph are Reviewed and Unreviewed toggles. If only Reviewed is selected, your graph will only show the prediction scores of documents that have been reviewed. In a rating model, this means the graph is showing the prediction score distribution of documents that the team has already rated. If only Unreviewed is selected, the graph will show the prediction score distribution of documents that have not yet been reviewed. In this rating model, these are documents that have not yet been rated by the team. If both Reviewed and Unreviewed are selected, the graph will show you the predicted relevance of all documents in your project, with Reviewed and Unreviewed documents stacked on top of each other. Reviewed documents will be stacked on top and unreviewed documents will be underneath.
After your model’s distribution graph, there is a section dedicated to the performance of your model. Performance is measured by recall, precision, and F1 scores. Let’s first go through what these each mean.
Recall is a measure of how many relevant documents the model identified, compared to how many relevant documents actually exist in the project. Said another way, recall answers, "how many documents are predicted by the model to be relevant, as a % of how many documents are actually relevant?" It helps us understand whether the model casted a wide enough net for capturing relevant documents. For example, if our model accurately identified 80 relevant documents, and 100 relevant documents actually exist in the project, our model’s recall score would be 80%. We can see an 80% recall score represented in the graph above. This means the model is returning 80% of the relevant documents in the project.
Precision It is a measure of how many of the documents identified as relevant by the model are actually relevant. Precision is the inverse of recall. Generally speaking, it answers the question, "how many documents are actually relevant, as a % of documents predicted by the model to be relevant?" If our model identified 100 documents as relevant, but only 15 of those 100 were actually relevant, our model’s precision score would be 15%. In the above screenshot, our precision score is 78%. That means that 78% of the documents that the model predicts to be relevant are actually relevant.
Finally, we have the model’s F1 score. The F1 score is the weighted average of precision and recall. In other words, it finds the ideal balance between capturing all relevant documents and not giving you too many false positives. The max F1 score is where both precision and recall are maximized, which is why it’s anchored to the model’s distribution graph as its threshold for relevance. On the performance graph, the max F1 score is the highest point on the F1 line.
Setting a threshold for your model
With an understanding of recall, precision, and F1 scores, let’s return to the green line on the distribution graph. If we line our green flag up with the max F1 score, we have set our threshold at the point at which recall and precision are optimally balanced.
Any range of documents above or below this prediction score will see a tradeoff in either precision or recall. To understand what this means, imagine dragging the green flag to the right.
The green line on the performance graph will follow. First, look at what happens to the model’s precision score as we go further right. The precision score is represented by the blue line. As we increase our threshold, the precision score of our model goes up.
This means that documents to the right of this threshold (accessed by clicking the blue document icon on the distribution graph) are more likely to actually be relevant than documents to the left of it. In other words, there’s a smaller chance of false positives the further right we go. But let’s now add the recall line into the graph. The recall score is represented by the purple line. You’ll notice that as we drag the threshold further to the right, the model’s recall score decreases.
This means that there’s a lower chance that we’re capturing all of the relevant documents in our project. In other words, it’s become more likely that we’re missing some relevant documents in our search. Taken together, this means that moving our threshold to the right of the F1 score will increase our chances that the documents we look at are truly relevant, but it will decrease our chances of finding all the relevant documents in our project.
Alternatively, if we move the threshold to the left, we decrease our chances of only seeing relevant documents, but we increase our chances of finding all the relevant documents in our project.
The performance graph is accompanied by a table displaying the number of documents that would be predicted relevant/irrelevant for any given F1 score, as well as how many reviewed documents fall on either side of the F1 score. For example, moving the threshold to the left increases our recall, meaning that more documents reviewed as relevant are likely to be predicted relevant by the model. This can be seen by the increasing value in the Predicted Relevant-Reviewed Relevant cell. However, moving the threshold to the left also decreases our precision, meaning that fewer of the documents the model predicts to be relevant will have been reviewed as relevant. This can be seen by the increasing value in the Predicted Relevant-Reviewed Irrelevant cell.
You have the option of generating either basic or rigorous performance statistics for your model. Which statistics are shown for your model depends on which documents from your model’s holdout set are being used to generate them. The next section will discuss the holdout set.
Below your distribution and performance graphs is your holdout set. Your holdout set is the set of documents your model uses to generate the performance statistics we just looked at. There are two types of performance statistics that can be generated for your predictive coding model: basic and rigorous performance statistics. For the most rigorously calculated performance statistics, choose Rigorous.
Please read our documentation for more information on generating rigorous performance statistics. This article will discuss holdout sets for generating basic performance statistics.
Your holdout set is the set of documents your model uses to generate the performance statistics we just looked at. In the case of the model above, we have reviewed 18,340 documents for our holdout set. The team rated 439 of those documents as relevant, and the rest as irrelevant. Our model made its own predictions about the relevance of those 18,340 documents, and then compared its predictions of the documents to how the team rated them. If you recall, our model’s recall score was 80%. That means that, of the 439 holdout documents that the team rated either warm or hot, the model captured 80% of them, correctly predicting that they were relevant. Conversely, 20% of the documents that the team rated either warm or hot were incorrectly predicted to be cold by the model. Next, we can think about the model’s precision score. The precision score for this model is 78%. That means that, of all the holdout documents that the model predicted were either warm or hot, 78% of them actually were rated warm or hot by the team. Conversely, 22% of the holdout documents that the model predicted to be either warm or hot were actually rated cold by the team. Documents in your holdout set always remain in your holdout set, though the holdout set can grow as you review more documents.
If your holdout set is insufficiently reviewed, this means that the model doesn’t have enough holdout documents deemed relevant or irrelevant by your team to give a good sense of its historical performance.
In the case of an insufficiently reviewed holdout set, the model will use a random sample of reviewed documents to generate performance statistics. However, these performance statistics will not be saved and therefore won’t contribute to your model’s performance history. In order to generate performance statistics that will be saved, your team needs to have reviewed at least 400 unique holdout documents, with at least 100 unique documents deemed relevant and 100 unique documents deemed irrelevant. To review documents for your holdout set, click the blue number under Unreviewed. Additionally, your holdout set may be sufficiently reviewed to generate performance statistics, but by reviewing all unreviewed documents in your holdout set, you can improve the accuracy of the statistics.
To see your model’s performance over time, check your model’s historical performance graph. Similar to the performance graph discussed above, this graph contains lines representing the recall, precision, and F1 scores of your model. In this graph, however, the x-axis is a timeline.
As mentioned above, your model’s basic performance statistics are always generated from the same core set of holdout documents. This means that the historical performance statistics come from a uniform comparison at each time point. If your model is generating rigorous performance statistics, historical performance will be calculated on the lowest contiguous set of holdout documents at that point in time. You can view the number of documents contributing to your model’s performance statistics over time by choosing the Holdout Size option under Performance History.
This section of the predictive coding page displays your model’s training data. To learn more about specific steps you can take to improve your model’s performance, please see this support article.
When you review the documents in your project, it teaches the model what documents should be considered relevant or irrelevant. As a reminder, at least 400 unique documents need to be reviewed in order for your model to begin generating predictions. At least 100 of these documents need to be reviewed as relevant, and 100 need to be reviewed as irrelevant. These 400 documents ensure that your model has enough training to begin making predictions.
In the center of the circle, you can see how many documents have been reviewed in the project thus far, according to the model’s “reviewed” criteria. The red section of the circle represents the number of documents that the team has reviewed as relevant. The blue section represents the number of documents the team has reviewed as irrelevant. Finally, you’ll find the number of documents ineligible for your model on the far right. Ineligible documents are those without text, or with too little text to generate reliable predictions. They are neither used to train the model, nor do they receive prediction scores.
Underneath your model’s training data, you can find training sets and weighted terms. Training sets and weighted terms help improve your model’s performance. To learn more about these and other methods for improving your model’s performance, please see this support article.
Underneath your model’s training sets is the model’s coverage graph.
Training coverage is a measure of how well your project's documents are represented in the training set. Poorly covered documents are ones that don’t share many features with the documents the model was trained on. For example, if a document has many words that the model has never seen before in training, it will be considered poorly covered, and will receive a low coverage score. The y-axis of the coverage graph is how well covered a document is given the model’s training. The x-axis of the graph is the model’s prediction of how likely the document is to be relevant.
The opacity of each pixel in the coverage grid indicates the number of documents occupying that spot on the graph. The darker a pixel is, the more documents can be found at that spot in the graph.
The color of the pixel is determined simply by the prediction score. Pixels with lower prediction scores are more blue and pixels with higher prediction scores are more red. No information is lost by viewing the graph in grayscale.
The documents in your project are plotted based on their coverage and prediction scores. Documents in the top right quadrant are those that the model predicts are likely to be relevant, and that are well covered given the model’s training. Documents in the bottom right quadrant are those that the model similarly predicts are likely to be relevant, but are not well covered given the model’s training. To see any of these documents, click and drag to select the region, and then click Review.
At the very bottom of your model’s page, you’ll see the update status of your model.
On the left, you’ll see when your model was last updated. Models are automatically updated approximately once every 24-48 hours. Your model should update at roughly the same time each day, since models are queued by time of model creation, but it is not guaranteed.
If your model has a scheduled update and is not currently running, you can manually update all models in your project by clicking Update Now. This will prioritize your model to be updated, pending the completion of currently updating models across the platform. Note that this will not result in an instant update.