To view all of Everlaw's predictive coding-related content, please see our predictive coding section.
Table of Contents
Now that you’ve created your predictive coding model and generated initial prediction scores as well as performance metrics, it’s time to interpret your results! To access any of your predictive coding models, use the Document Analytics icon in the navigation bar (indicated by a bar graph) and select Predictive Coding.
Your predictive coding model page is separated into five sections: action items, results, performance, training, and updates. Let’s start with the action items section.
Action items
First, you have your prioritized documents.
Prioritized documents are those that have not yet been reviewed by your team, yet the model predicts to be relevant based on how you have reviewed other documents. The “Prioritize” action item will identify unreviewed documents that have a prediction score that is equal to or above the cutoff established by your model’s current max F1 score (i.e., location of the purple flag on the distribution graph). In the case of a rating model, these are documents that the team has not yet rated, but the model predicts to be either Warm or Hot based on the model's current max F1 score. These may be important documents that were missed by prior searches, or documents that the team simply has not yet gotten to.
Click Review to see a full results table of these prioritized documents.
Next are any conflicts the model has identified. These are documents for which the model's prediction for relevance/irrelevance is inconsistent with how the team coded the document.
On the left is the number of reviewed documents that the model currently predicts to be relevant, but that have been reviewed irrelevant by the team. The model considers reviewed documents to be predicted relevant here if the document’s associated prediction score is equal to or above the cutoff established by your model’s current max F1 score.
For a rating model, these are reviewed documents that the model currently predicts to be either Warm or Hot (i.e., relevant) based on the model's current max F1 score, but that the team rated Cold (i.e., irrelevant). This section is for documents where the team has reviewed (i.e., coded or rated according to model criteria) them, but where the team’s review work conflicts with the model’s predictions for the document. Note that a document that has not yet been reviewed by the team would not fall into this category.
Click Review to see a list of these documents.
On the right, you’ll see the number of reviewed documents that the model currently predicts to be irrelevant, but that the team reviewed as relevant. The model considers reviewed documents to be predicted irrelevant here if the document’s associated prediction score is below the cutoff established by your model’s current max F1 score (i.e., location of the purple flag on the distribution graph).
In a rating model, these conflicting documents are ones that the model predicts are Cold (irrelevant) based on the model's current max F1 score, but the team rated either Warm or Hot (i.e., relevant). Click Review to see a list of these conflicting documents.
The final action item allows you to improve the model’s predictions by reviewing documents that are not well covered. The "Improve Predictions" action item will identify documents that have a coverage score of 20% or below as documents that are not well covered.
For a document to be well covered, the model needs to have been trained on documents that share similar features to it. If a document is full of features that the model has never seen before, the document will be considered not well covered. For example, if the word “velociraptor” shows up in a document 100 times, but the model has never been trained on a document with the word “velociraptor” in it, it won’t be able to predict whether that document is relevant as well as a model that has been trained on documents with “velociraptor” in it. The documents in your Improve Predictions section are those that have features the model hasn’t seen before. Click Review to see a list of documents that contain unfamiliar features.
Results
The next section in your model’s page is Results which displays a distribution graph that shows the documents in your project along a scale of predicted relevance.
To the far left on the distribution graph, you can see the documents that have a prediction score of 0. The model predicts that these documents are very unlikely to be relevant. To the far right on the distribution graph, there are documents that have a prediction score of 100. The model predicts that these documents are very likely to be relevant. The purple flag on the distribution graph represents your model’s max F1 score. F1 scores will be discussed in more depth in the performance metrics section, but you can think of the max F1 score as your model’s threshold for relevance. Documents that fall anywhere to the right of this line are considered relevant by your model, while documents that fall to the left of this line are considered irrelevant. Additionally, a document that has a prediction score of 100 has a higher predicted likelihood of being relevant than a document with a prediction score of 85, but the document with a prediction score of 100 is not predicted to be more relevant than the document with the lower prediction score. Relative prediction scores correspond to relative likelihoods of relevance, not necessarily relevance itself.
You will also notice a green, movable flag on your distribution graph. By moving this green flag, you can set your own prediction threshold for reviewing your documents. For example, if you only want to review documents that are predicted very likely to be relevant, you can slide the green flag further to the right. This, of course, may mean that you skip over other documents that may be relevant but received a lower prediction score. Clicking on the blue number to the right will bring up a list of documents that fall above the threshold you’ve set.
In the top right corner of the distribution graph are Reviewed and Unreviewed toggles. If only Reviewed is selected, your graph will only show the prediction scores of documents that have been reviewed. In a rating model, this means the graph is showing the prediction score distribution of documents that the team has already rated. If only Unreviewed is selected, the graph will show the prediction score distribution of documents that have not yet been reviewed. In this rating model, these are documents that have not yet been rated by the team. If both Reviewed and Unreviewed are selected, the graph will show you the predicted relevance of all documents in your project, with Reviewed and Unreviewed documents stacked on top of each other. Reviewed documents will be stacked on top and unreviewed documents will be underneath.
Performance
After the Results section containing your model’s distribution graph, there is a Performance section dedicated to the performance of your model. Performance is measured by recall, precision, and F1 scores. Let’s first go through what these each mean.
Recall is a measure of how many relevant documents the model identified, compared to how many relevant documents actually exist in the project. Said another way, recall answers, "how many documents are predicted by the model to be relevant, as a % of how many documents are actually relevant?" Recall helps us understand whether the model casted a wide enough net for capturing relevant documents. For example, if our model accurately identified 80 relevant documents, and 100 relevant documents actually exist in the project, our model’s recall score would be 80%. We can see an 80% recall score represented in the graph above. This means the model is returning 80% of the relevant documents in the project.
Precision is a measure of how many of the documents identified as relevant by the model are actually relevant. Precision is the inverse of recall. Generally speaking, it answers the question, "how many documents are actually relevant, as a % of documents predicted by the model to be relevant?" If our model identified 100 documents as relevant, but only 15 of those 100 were actually relevant, our model’s precision score would be 15%. In the above screenshot, our precision score is 78%. That means that 78% of the documents that the model predicts to be relevant are actually relevant.
Finally, we have the model’s F1 score. The F1 score is the weighted average of precision and recall. In other words, it finds the ideal balance between capturing all relevant documents and not giving you too many false positives. The max F1 score is where both precision and recall are maximized, which is why it’s anchored to the model’s distribution graph as its threshold for relevance. On the performance graph, the max F1 score is the highest point on the F1 line.
Setting a threshold for your model
With an understanding of recall, precision, and F1 scores, let’s return to the green line on the distribution graph. If we line our green flag up with the max F1 score, we have set our threshold at the point at which recall and precision are optimally balanced.
Any range of documents above or below this prediction score will see a tradeoff in either precision or recall. To understand what this means, imagine dragging the green flag to the right.
The green line on the performance graph will follow. First, look at what happens to the model’s precision score as we go further right. The precision score is represented by the blue line. As we increase our threshold, the precision score of our model goes up.
This means that documents to the right of this threshold (accessed by clicking the blue document icon on the distribution graph) are more likely to actually be relevant than documents to the left of it. In other words, there’s a smaller chance of false positives the further right we go. But let’s now add the recall line into the graph. The recall score is represented by the purple line. You’ll notice that as we drag the threshold further to the right, the model’s recall score decreases.
This means that there’s a lower chance that we’re capturing all of the relevant documents in our project. In other words, it’s become more likely that we’re missing some relevant documents in our search. Taken together, this means that moving our threshold to the right of the F1 score will increase our chances that the documents we look at are truly relevant, but it will decrease our chances of finding all the relevant documents in our project.
Alternatively, if we move the threshold to the left, we decrease our chances of only seeing relevant documents, but we increase our chances of finding all the relevant documents in our project.
The performance graph is accompanied by a table displaying the number of documents that would be predicted relevant/irrelevant for any given F1 score, as well as how many reviewed documents fall on either side of the F1 score. For example, moving the threshold to the left increases our recall, meaning that more documents reviewed as relevant are likely to be predicted relevant by the model. This can be seen by the increasing value in the Predicted Relevant-Reviewed Relevant cell. However, moving the threshold to the left also decreases our precision, meaning that fewer of the documents the model predicts to be relevant will have been reviewed as relevant. This can be seen by the increasing value in the Predicted Relevant-Reviewed Irrelevant cell.
You have the option of generating either basic or rigorous performance metrics for your model. Which metrics are shown for your model depends on which documents from your model’s holdout set are being used to generate them. The next section will discuss the holdout set.
Holdout set
Below your performance graphs is your holdout set. Your holdout set is the set of documents your model uses to generate the performance metrics we just looked at. There are two types of performance metrics that can be generated for your predictive coding model: basic and rigorous performance metrics. For the most rigorously calculated performance metrics, choose Rigorous.
Please read our documentation for more information on generating rigorous performance metrics. This article will discuss holdout sets for generating basic performance metrics.
In the case of the model above, we have reviewed 18,340 documents for our holdout set. The team rated 439 of those documents as relevant, and the rest as irrelevant. Our model made its own predictions about the relevance of those 18,340 documents, and then compared its predictions of the documents to how the team rated them. If you remember, our model’s recall score was 80%. That means that, of the 439 holdout documents that the team rated either Warm or Hot, the model captured 80% of them, correctly predicting that they were relevant. Conversely, 20% of the documents that the team rated either Warm or Hot were incorrectly predicted to be Cold by the model. Next, we can think about the model’s precision score. The precision score for this model is 78%. That means that, of all the holdout documents that the model predicted were either Warm or Hot, 78% of them actually were rated Warm or Hot by the team. Conversely, 22% of the holdout documents that the model predicted to be either warm or hot were actually rated cold by the team. Documents in your holdout set always remain in your holdout set, though the holdout set can grow as you review more documents.
If your holdout set is insufficiently reviewed, this means that the model doesn’t have enough holdout documents deemed relevant or irrelevant by your team to give a good sense of its historical performance based on a consistent set of documents.
In the case of an insufficiently reviewed holdout set, the model will use a random sample of reviewed documents to generate performance metrics. Although initial basic performance metrics give you a general sense of your model’s performance, a consistent set of documents is not being used to track model performance because the sampled documents used to generate these metrics change at every model update. However, reviewing holdout set documents will allow you to generate performance metrics based on a consistent set of documents over time and accordingly improve the accuracy of these metrics. In order to generate performance metrics based on reviewed holdout set documents, your team needs to have reviewed at least 200 qualified holdout documents, with at least 50 documents deemed relevant and 50 documents deemed irrelevant. To meet the holdout threshold, reviewed holdout set documents must have sufficient text, be unique (e.g., duplicates of a reviewed document that are coded the same are only counted once), and not in conflict (e.g., emails that have been coded irrelevant in the same thread as emails coded relevant are not considered as qualified reviewed).
To review documents from your holdout set, click the blue number under Unreviewed. Additionally, your holdout set may be sufficiently reviewed to generate performance metrics, but by reviewing all unreviewed documents in your holdout set, you can improve the accuracy of the metrics. Learn more about basic and rigorous performance metrics here.
Historical performance
To see your model’s performance over time, check your model’s historical performance graph by selecting the “Performance” graph view under the History section. Similar to the performance graph discussed above, this graph contains lines representing the recall, precision, and F1 scores of your model. In this graph, however, the x-axis is a timeline.
After you have sufficiently reviewed your holdout set to generate basic performance metrics for your model, these metrics will always be generated from the same core set of holdout documents. This means that the historical performance metrics come from a uniform comparison at each time point. If your model is generating rigorous performance metrics, historical performance will be calculated on the lowest contiguous set of holdout documents at that point in time. You can view the number of documents contributing to your model’s performance statistics over time by choosing the Holdout Size graph view option under Performance History.
Training
This section of the predictive coding page displays your model’s training data. Learn more about specific steps you can take to improve your model’s performance.
When you review the documents in your project according to your model's criteria, it teaches the model what documents should be considered relevant or irrelevant. As a reminder, at least 200 qualified documents need to be reviewed in order for your model to begin generating predictions. At least 50 of these documents need to be reviewed as relevant, and 50 need to be reviewed as irrelevant. These 200 qualified documents ensure that your model has enough training to begin making predictions. Learn more about kicking off your predictive coding model here.
In the center of the circle, you can see how many documents have been reviewed in the project thus far, according to the model’s “reviewed” criteria. The red section of the circle represents the number of documents that the team has reviewed as relevant. The blue section represents the number of documents the team has reviewed as irrelevant. The gray section represents the number of documents that remain unreviewed by the team.
Underneath your model’s data, you can find sections for excluded & ineligible, training sets, and weighted terms. The excluded & ineligible section will display the number of excluded documents on the left and the number of documents ineligible for your model on the right. Excluded and ineligible documents are not used to train the model or evaluate its performance and do not receive model prediction scores. Excluded documents are documents that were specified to be excluded from the model. Ineligible documents are documents without identifiable text, or with too little text to generate reliable predictions. Note that the total number of excluded documents contributes to the total number of ineligible documents because excluded documents are not eligible for training your model.
Training sets and weighted terms help improve your model’s performance. To learn more about these and other methods for improving your model’s performance, please see this support article.
Coverage
Underneath your model’s training sets is the model’s coverage graph.
Training coverage is a measure of how well your project's documents are represented in the training set. Poorly covered documents are ones that don’t share many features with the documents the model was trained on. For example, if a document has many words that the model has never seen before in training, it will be considered poorly covered, and will receive a low coverage score. The y-axis of the coverage graph is how well covered a document is given the model’s training. The x-axis of the graph is the model’s prediction of how likely the document is to be relevant.
The opacity of each pixel in the coverage grid indicates the number of documents occupying that spot on the graph. The darker a pixel is, the more documents can be found at that spot in the graph.
The color of the pixel is determined simply by the prediction score. Pixels with lower prediction scores are more blue and pixels with higher prediction scores are more red. No information is lost by viewing the graph in grayscale.
The documents in your project are plotted based on their coverage and prediction scores. Documents in the top right quadrant are those that the model predicts are likely to be relevant, and that are well covered given the model’s training. Documents in the bottom right quadrant are those that the model similarly predicts are likely to be relevant, but are not well covered given the model’s training. To see any of these documents, click and drag to select the region, and then click Review.
Updates
At the very bottom of your model’s page, you’ll see the update status of your model.
On the left, you’ll see when your model was last updated. Models are automatically updated approximately once every 24-48 hours. Your model should be queued for update at roughly the same time each update cycle, since models are queued by time of model creation.
If your model has a scheduled update and is not currently running, users with Admin permissions on prediction models in the project can manually update all models in your project by clicking the "Update model" button. This will prioritize your model to be updated, pending the completion of currently updating models across the platform. Note that this will not result in an instant update.
0 Comments