Interpreting Your Predictive Coding Model – Knowledge Base

Once you have created your predictive coding model and reviewed enough documents to generate initial prediction scores and performance metrics, it’s time to interpret your results.

This article walks you through each section of an active predictive coding model page and explains how to interpret it. It is geared toward users who are ready to use the results of their model to start leveraging predictive coding in their review. Use this article to learn how to:

Prioritize action items
Understand the distribution of your prediction scores
Interpret your model's performance metrics
Interpret and improve your model's coverage
Queue your model for an update or freeze it

To access any of your predictive coding models, go to Document Analytics > Predictive coding.

Your predictive coding model page is separated into five sections: action items, results, performance, training, and updates. This article describes each of them.

Requirements

If you have Admin permissions on Predictive Coding, created the model, or had the model shared with you with any level of permission, you can access the model page.

Action Items

The action items section identifies key document sets to help you jump into using your model. This section is only visible if the model is active.

Prioritize

First are prioritized documents. Prioritized documents are those that the model predicts to be relevant based on your team's past reviews but have not yet been reviewed.

The documents in the Prioritize action item have a prediction score that is equal to or above the cutoff established by your model’s current max F1 score (i.e., location of the purple flag on the distribution graph).

To get started with a review of these documents, select Review. This takes you to a results table of these prioritized documents.

Conflicts

Next are any conflicts the model has identified. Conflicts are documents for which the model's prediction for the document's relevance is inconsistent with how the team reviewed the document.

A document that has not yet been reviewed is not captured in conflicts.

This section is for documents where the team has reviewed (i.e., coded or rated according to model criteria) them, but where the team’s review work conflicts with the model’s predictions for the document. Note that a document that has not yet been reviewed by the team would not fall into this category.

On the left is the number of reviewed documents that the model predicts to be relevant, but that have been reviewed as irrelevant by the team.

The model considers reviewed documents to be predicted relevant if the document’s prediction score is equal to or above the cutoff established by your model’s max F1 score.

On the right is the number of reviewed documents that the model currently predicts to be irrelevant, but that the team reviewed as relevant. The model considers reviewed documents to be predicted irrelevant here if the document’s associated prediction score is below the cutoff established by your model’s current max F1 score (i.e., location of the purple flag on the distribution graph).

Assessing conflicts can help:

Improve your model's performance metrics. Read more in our article on improving your prediction model's performance.
Help you to run quality control on reviewer's decisions. Learn more in our article on leveraging predictive coding for prioritization and quality control

To access the results table of documents with conflicts select Review under either set of conflicting documents.

Improve Predictions

The Improve Predictions action item identifies documents that have a coverage score of 20% or below as documents that are not well covered.

Coverage is a representation of how well the model understands the features of a document. For a document to be well covered, the model needs to have been trained on documents that share similar features to it. If a document is full of features that the model has never seen before, the document is not considered well covered. For example, if the word “liability” shows up in a document 100 times, but the model has never been trained on a document with the word “liability” in it, it won’t be able to predict whether that document is relevant as well as a model that has been trained on documents with “liability” in it. The documents in your Improve Predictions section are those that have features the model hasn’t seen before.

Select Review to access a results table of documents that contain unfamiliar features. This can help you improve the model's understanding of these unfamiliar features.

Note

All the documents in the Action items section are captured from the most recent update of your model. When your model updates, the numbers and the documents within these action items can update.

Results

The results section displays a distribution graph that shows the documents in your project along a scale of predicted relevance. Each bar represents the number of documents that have a specific prediction score. The higher the prediction score of a document, the more likely the model predicts that document to be relevant.

Screen_Shot_2018-08-20_at_2.35.58_PM.png

To the far left of the distribution graph are the documents that have a prediction score of 0. The model predicts that these documents are very unlikely to be relevant.
To the far right on the distribution graph are the documents that have a prediction score of 100. The model predicts that these documents are very likely to be relevant.
The purple flag on the distribution graph represents your model’s max F1 score. F1 scores are discussed in more depth in the performance metrics section, but you can think of the max F1 score as your model’s threshold for relevance. Documents that fall anywhere to the right of this line (higher prediction scores) are considered relevant by your model, while documents that fall to the left of this line are considered irrelevant.
Note: A document that has a prediction score of 100 has a higher predicted likelihood of being relevant than a document with a prediction score of 85, but the document with a prediction score of 100 is not predicted to be more relevant than the document with the lower prediction score. Relative prediction scores correspond to relative likelihoods of relevance, not necessarily relevance itself.

In the top right corner of the distribution graph are Reviewed and Unreviewed toggles.

If only Reviewed is selected, your graph will only show the prediction scores of documents that have already been reviewed.
If only Unreviewed is selected, the graph only shows the distribution of documents that have not yet been reviewed.
If both Reviewed and Unreviewed are selected, the graph shows you the predicted relevance of all documents in your project, with Reviewed and Unreviewed documents stacked on top of each other. Reviewed documents are stacked on top and unreviewed documents are underneath.

Interpret your distribution

You can use the distribution graph to quickly access a results table with documents above or below any specific prediction score. To do so, use the moveable green flag.

To access a results table all the documents that your model predicts to be relevant, drag the flag so that it overlaps the F1 score line. Then select the number to the right.

Tip

Deselect Reviewed to only access the unreviewed documents. These will be the same as those in the Prioritize action item.
To review documents that are predicted to be very likely to be relevant, slide the green flag further to the right than the F1 flag and select the number on the right side of the flag.

This may mean that you skip over other documents that are relevant but have a lower prediction score. You can read more about this in the section below on performance metrics.

You can also use the distribution of scores to assess the quality of your model and how well it is being trained on reviewed documents.

A bimodal distribution (with high peaks on the left and right, and none in the middle) is an indicator that your model can clearly distinguish between relevant and irrelevant documents. This suggests that your reviewers are making consistent review decisions and sending a clear signal to the model about the features that characterize and distinguish relevant and irrelevant documents.

A distribution with a large peak toward the middle of the graph indicates that the review decisions that the model is learning from are less consistent. Instead of only predicting documents as highly likely to be relevant (high prediction scores) or highly unlikely to be relevant (low prediction scores), the model is predicting a large set of documents that are in the middle, suggesting that review decisions might go either way.

This is usually an indication that documents with similar characteristics are sometimes getting coded by reviewers as relevant, and sometimes as irrelevant. When the model comes across a document with those characteristics, it doesn't have clear training to predict the relevance of the document, so it assigns a score toward the middle, in between relevant and irrelevant.

Read our article on improving your predictive coding model to learn how to improve your model's distribution.

Performance

The Performance section includes measures of recall, precision, and F1 scores. You can use these metrics to assess the quality of your model and help you identify action items to improve the metrics.

Let’s first go through what each metric means.

Recall is a measure of how many relevant documents the model identified, compared to how many relevant documents actually exist in the project. Said another way, recall answers, "How many documents are predicted by the model to be relevant, as a % of how many documents are reviewed as relevant?" Recall helps us understand whether the model cast a wide enough net to capture relevant documents.

For example, if a model accurately identified 80 relevant documents, and 100 relevant documents actually exist in the project (as coded by reviewers), the model’s recall score would be 80%. We can see an 80% recall score represented in the graph above. This means the model is returning 80% of the relevant documents in the project.

Precision is a measure of how many of the documents identified as relevant by the model are actually relevant, as coded by reviewers. Precision is the inverse of recall. Generally speaking, it answers the question, "How many documents are actually relevant, as a % of documents predicted by the model to be relevant?" If our model identified 100 documents as relevant, but only 15 of those 100 were actually relevant, our model’s precision score would be 15%. In the above screenshot, our precision score is 78%. That means that 78% of the documents that the model predicts to be relevant are actually relevant.

The F1 score is the weighted average of precision and recall. In other words, it finds the ideal balance between capturing all relevant documents and not giving you too many false positives.

The max F1 score is where both precision and recall are maximized, which is why it’s anchored to the model’s distribution graph as its threshold for relevance. On the performance graph, the max F1 score is the highest point on the F1 line.

You have the option to generate either basic or rigorous performance metrics for your model. Which metrics are shown for your model depends on which documents from your model’s holdout set are being used to generate them. The next section discusses the holdout set.

Set a threshold for your model

With an understanding of recall, precision, and F1 scores, let’s return to the green line on the distribution graph. If we line our green flag up with the max F1 score, we have set our threshold at the point at which recall and precision are optimally balanced.

Any range of documents above or below this prediction score will see a tradeoff in either precision or recall.

To understand what this means, imagine dragging the green flag to the right. The green line on the performance graph follows. First, look at what happens to the model’s precision score as we go further right. The precision score is represented by the blue line. As we increase our threshold, the precision score of our model goes up. In the above image where the green line is positioned in alignment with the max F1 score, the Precision is .60. When the green flag is moved to the right so that the relevance cutoff is at 96, the Precision score goes up to .70.

This means that documents to the right of this threshold (accessed by selecting the blue document icon on the distribution graph) are more likely to actually be relevant than documents to the left of it. In other words, there’s a smaller chance of false positives the further right we go.

Let’s take a look at the recall. The recall score is represented by the purple line. As we drag the threshold further to the right, the model’s recall score decreases. In the topmost image, the recall was .64. At this new cutoff, it has dropped to .40.

This means that there’s a lower chance that we’re capturing all of the relevant documents in our project. In other words, it’s become more likely that we’re missing some relevant documents in our search.

Taken together, this means that moving the threshold to the right of the F1 score increases the chances that the documents we look at are truly relevant, but it decreases our chances of finding all the relevant documents in our project.

Alternatively, if we move the threshold to the left, we decrease our chances of only seeing relevant documents, but we increase our chances of finding all the relevant documents in our project.

The performance graph is accompanied by a table displaying the number of documents predicted relevant/irrelevant for any given F1 score, as well as how many reviewed documents fall on either side of the F1 score.

Moving the threshold to the left:

Increases recall, meaning that more documents reviewed as relevant are likely to be predicted relevant by the model. This can be seen by the increasing value in the Predicted Relevant-Reviewed Relevant cell.
Decreases precision, meaning that fewer of the documents the model predicts to be relevant will have been reviewed as relevant. This can be seen by the increasing value in the Predicted Relevant-Reviewed Irrelevant cell.

Holdout set

Your holdout set is the set of documents your model uses to generate the performance metrics. There are two types of performance metrics that can be generated for your predictive coding model: basic and rigorous performance metrics.

Please read our documentation for more information on generating rigorous performance metrics. This article discusses holdout sets for generating basic performance metrics.

In the case of the model above, we have reviewed 3,964 documents for our holdout set. The team reviewed 43 of those documents as relevant, and the rest as irrelevant.

Our model made its own predictions about the relevance of those reviewed documents, and then compared its predictions of the documents to how the team reviewed them. This model has a recall score of .64. That means that, of the 43 holdout documents that the team reviewed as relevant, the model captured 64% of them, correctly predicting that they were relevant. Conversely, 36% of the documents that the team reviewed as relevant were incorrectly predicted to be irrelevant by the model.

Next, we can think about the model’s precision score. The precision score for this model is .60. That means that, of all the holdout documents that the model predicted were relevant, 60% of them actually were reviewed as relevant by the team. Conversely, 40% of the holdout documents that the model predicted to be relevant were actually reviewed as irrelevant by the team.

Documents in your holdout set always remain in your holdout set, though the holdout set can grow as you review more documents.

If your holdout set is insufficiently reviewed, this means that the model doesn’t have enough holdout documents reviewed by your team to give a good sense of its historical performance based on a consistent set of documents.

In the case of an insufficiently reviewed holdout set, the model uses a random sample of reviewed documents to generate performance metrics. A note in the Holdout set section reads "Basic performance metrics have been generated based on a combination of any reviewed holdout set documents and a random sample of documents from the 'Reviewed' set. Basic performance metrics exclusively based on your holdout set, review more holdout set documents."
Although initial basic performance metrics give you a general sense of your model’s performance, a consistent set of documents is not being used to track model performance because the sampled documents used to generate these metrics change at every model update.

Reviewing holdout set documents allows you to generate performance metrics based on a consistent set of documents over time and accordingly improve the accuracy of these metrics.

To generate performance metrics based on reviewed holdout set documents, your team needs to have reviewed:

At least 200 qualified holdout documents
At least 50 of them reviewed as relevant
At least 50 of them reviewed as irrelevant.

To meet the holdout threshold, reviewed holdout set documents must

Have sufficient text
Be unique: duplicates of a reviewed document that are coded the same are only counted once
Not be in conflict: emails that have been coded irrelevant in the same thread as emails coded relevant are not considered as qualified reviewed

To review documents from your holdout set, select the blue number or the Review button under Unreviewed.

Tip: Even if your holdout set is sufficiently reviewed to generate performance metrics, reviewing all unreviewed documents in your holdout set can improve the accuracy of the metrics. Learn more about basic and rigorous performance metrics here.

History

Performance: Like the performance graph discussed above, this graph contains lines representing the recall, precision, and F1 scores of your model. In this graph, the x-axis is a timeline. This lets you track these metrics over time.

As soon as your model starts generating performance statistics, this graph starts tracking the F1, precision, and recall scores over time.

After you have sufficiently reviewed your holdout set to generate basic performance metrics for your model, these metrics are always generated from the same core set of holdout documents. This means that the historical performance metrics come from a uniform comparison at each time point.

If your model is generating rigorous performance metrics, historical performance is calculated on the lowest contiguous set of holdout documents at that point in time.

Holdout size: You can view the number of documents contributing to your model’s performance statistics over time by choosing the Holdout Size graph view option under Performance History.

Training

The Training section of the predictive coding page displays your model’s training data. Learn more about specific steps you can take to improve your model’s performance.

Model data

When you review the documents in your project according to your model's Reviewed criteria, it teaches the model the characteristics of documents that should be considered relevant or irrelevant. The Model data section shows the numbers of documents that have been reviewed, and how they have been reviewed, so far.

You can learn more about what counts as reviewed here.

In the center of the Reviewed circle, you can see how many documents have been reviewed in the project thus far, according to the model’s “reviewed” criteria.

The red section of the circle represents the number of documents that the team has reviewed as relevant.
The blue section represents the number of documents the team has reviewed as irrelevant. The gray section represents the number of documents that remain unreviewed by the team.

To access a results table of any of these sets of documents, select the number next to the document button.

Training sets

Learn more about Training sets in our article on improving your predictive coding model.

Excluded and ineligible

Excluded and ineligible documents are not used to train the model or evaluate its performance and do not receive model prediction scores. The excluded & ineligible section displays the number of excluded documents on the left and the number of documents ineligible for your model on the right

Excluded documents are documents that were specified to be excluded from the model.
Ineligible documents are documents without identifiable text, or with too little text to generate reliable predictions.

Note: The total number of excluded documents contributes to the total number of ineligible documents because excluded documents are not eligible for training your model.

Coverage

Underneath your model’s training sets is the model’s coverage graph.

Training coverage is a measure of how well your project's documents are represented in the reviewed documents.

Poorly covered documents are ones that don’t share many features with the documents the model was trained on. For example, if a document has many words that the model has never seen before in training, it is considered poorly covered, and has a low coverage score.

The graph has three dimensions:

The y-axis is how well covered a document is given the model’s training.
The x-axis of the graph is the model’s prediction score, indicating how likely the document is to be relevant.
The opacity of each pixel in the coverage grid indicates the number of documents occupying that spot on the graph. The darker a pixel is, the more documents can be found at that spot in the graph.
The color of the pixel is determined by the prediction score. Pixels with lower prediction scores are more blue and pixels with higher prediction scores are more red. No information is lost by viewing the graph in grayscale.

The documents in your project are plotted based on their coverage and prediction scores:

Documents in the top right quadrant are those that the model predicts are likely to be relevant, and that are well covered given the model’s training.
Documents in the bottom right quadrant are those that the model similarly predicts are likely to be relevant, but are not well covered given the model’s training.

You can click and drag any region of the coverage graph. When you do, the number of documents in the selected region populates on the right.
To access a results table of these documents, select Review. Learn more about how you can use the coverage graph to improve your model in our article on improving your predictive coding model.

Updates

At the very bottom of your model’s page is the update status of your model.

Screen Shot 2023-08-18 at 12.41.06 PM.png

On the left, you’ll see when your model was last updated. Models are automatically updated approximately once every 24 hours. Your model should be queued for update at roughly the same time each update cycle, since models are queued by time of model creation.

If your model has a scheduled update and is not currently running, users with Admin permissions on prediction models in the project can manually update all models in your project by selecting Update models. This prioritizes your model to be updated, pending the completion of currently updating models across the platform. Note that this will not result in an instant update.

Model Statuses

Here is a summary of the possible model statuses:

Idle: model is currently idle but if model is eligible for update (.e.g. additional docs reviewed according to criteria), the model will automatically be queued for update during regularly scheduled daily update.
Queued for update: currently in the queue to update
Updating: in the process of updating
Frozen: model will not update

Freeze and unfreeze a model

Required permissions: Users with Admin permissions on predictive coding models

Freezing a model stops all updates on the model, even when activity on the project, such as reviewing additional documents, would typically trigger a model update. When a model is frozen:

Prediction scores for documents do not change
Performance metrics for the model do not change
You cannot edit a model, other than changing its name

Here are a couple situations where freezing a model can be useful:

You are doing quality control (QC) work or reviewing a sample of unreviewed documents for validation and want the scores to remain consistent while review work is being applied to documents.
You want your generated prediction scores and performance metrics to remain unchanged after reaching a certain threshold dictated by a discovery agreement with the opposing party.

To freeze a predictive coding model:

Go to Document Analytics > Predictive Coding.
On the left of the page, select the name of the model that you plan to freeze. This takes you to a page with information about the model.
Scroll to the Updates section.
Under Update status, select Freeze model.

This opens a confirmation dialog with additional information about freezing a model.
To freeze the model, select Freeze. To go back, select Cancel.

When a model is frozen:

Prediction scores and performance metrics will not update, regardless of any new review work or other project activity
A banner at the top of the model page tells you the date and time that the model was frozen
The Model History section records the date and time the model was frozen (and unfrozen, when applicable) and the user who performed the action
The Update models button is disabled on that model's Updates section. Users with Admin permissions on predictive coding can still update models from the Updates section of any predictive coding model within the project that isn't frozen.
You cannot edit the predictive coding model’s criteria
The status of the model changes to Frozen

Note

Frozen predictive coding models can be selected when you are creating a new multi-matter model. Multi-matter models that are deployed using frozen models make use of the model's training as of the last model update before the model was frozen, and function like any other multi-matter model. When the model is unfrozen, and the model updates during the next scheduled update (or after triggering a manual update), the multi-matter model will also update based on updates to the underlying model.

To unfreeze a model, select Freeze model at the bottom of the model page, then select Unfreeze.

Once the model is unfrozen:

If the model is eligible for an update (e.g., additional documents were reviewed, uploaded or deleted since the model was frozen), it will automatically update during its regularly scheduled daily update. To update the model sooner, select Update models to manually update the model. This will update all models in the project.
The banner with information about the freeze disappears
Edit model is enabled
The Update models button is enabled
The status of the model changes from Frozen to Idle until the model is queued for updates

Tip

If you have reviewed most of the high-scoring ranked documents and would like to validate your document review progress, you may want to consider performing an elusion test — a method used to estimate how many relevant documents may still remain to be found. This assessment is made by reviewing a sample of documents that the model deems to be less likely to be relevant (low relevancy prediction scoring documents). If many low-scoring documents are actually relevant, elusion testing may indicate that the model needs further training or further refinement of relevance criteria, or you need to continue with your document review with the next set of highly-ranked documents.

Elusion testing is not a mandatory part of predictive coding, however it is a recommended tool you can use to validate your model.

Interpreting the results of an elusion test can be performed in a variety of ways and requires a working knowledge of statistics. To discuss possible approaches with an Everlaw Predictive Coding expert, contact Everlaw Support.

Requirements

Action Items

Prioritize

Conflicts

Improve Predictions

Results

Interpret your distribution

Performance

Set a threshold for your model

Holdout set

History

Training

Model data

Training sets

Excluded and ineligible

Coverage

Updates

Model Statuses

Freeze and unfreeze a model

Related articles