Inactive Holdout Set

Your holdout set is inactive.



Everlaw’s predictive coding system learns from user review decisions to predict how the remaining, unreviewed documents in a case might be evaluated. Those that are likely to be reviewed the same way are considered relevant. Documents are scored on a 0-100 scale, with scores closer to 100 indicating a higher likelihood of relevance (according to the user-defined criteria for relevance). Along with the predictions, Everlaw also provides various performance metrics to help you gauge how accurate the predictions are. These metrics are displayed here.


If you want to brush up on predictive coding fundamentals, read our beginner’s guide.


What are the metrics based on

To evaluate the performance of a model, the system sets aside a subset of documents. It then compares its prediction scores for documents in the set against any actual user evaluations -- which it assumes to be the ground truth -- to estimate its performance. For example, if a document is predicted to be relevant, but was human-reviewed in a way that does not satisfy the model’s criteria for relevance, that counts as a strike against the model.


Though prediction scores are given on a 0-100 scale, documents must be grouped into binary relevant/irrelevant categories for the purpose of calculating performance metrics. The boundary between relevant/irrelevant document is left up to the user, and can be set by dragging the green line in the distribution graph to the desired score for the relevance boundary. The metrics will changed based on the relevance boundary that is being used.  

Which set of documents is being used to generate the metrics?

A randomly sampled 5% of the training documents is being used to generate the performance metrics. This randomly sampled set changes each time the model is updated.


It is generally preferred to use the holdout set to generate performance metrics. The holdout set comprises 5% of the total case docs. Because these documents are not used to training models, and do not change from update-to-update, using the holdout set reduces potential bias and enables reporting on historical performance.


The holdout set is currently not being used because your team has reviewed an insufficient number of documents from the set. To activate the holdout set, review more documents from the set.  


What are the metrics measuring?
    The performance metrics are:

    • Precision: This is an estimate of how often documents above the relevance boundary are correctly classified as relevant. For example, let’s assume that the relevance boundary is 70. There are 100 documents that have prediction scores that lie above this, and are therefore predicted to be relevant. However, only 50 of these documents are reviewed as relevant, yielding a precision metric of 50%.


    • Recall: This is an estimate of how of how often truly relevant documents are scored above the relevance boundary. For example, let’s assume that the relevance boundary is 70. There are a total of 100 relevant documents, but only 20 of them have prediction scores above 70. This will yield a recall metric of 20%.
    • F1: This is a  weighted average of the precision and recall metrics. It is used to gauge the overall performance of a model. In general, there is a tradeoff between precision and recall, and the F1 takes both into account.


By default, the system sets the relevance boundary to the score that results in the highest F1 -- the score that has the best balance of precision and recall. This is shown as the Color line in the graph to the left. Depending on your needs, you may prefer this best balance, or you may prefer to favor either one of precision or recall over the other. (For instance, if you are trying to find the needle in the haystack, you may prefer higher recall at the expense of lower precision.)

Have more questions? Submit a request


Article is closed for comments.