Your holdout set is active.
Everlaw’s predictive coding system learns from user review decisions to predict how the remaining, unreviewed documents in a case might be evaluated. Those that are likely to be reviewed the same way are considered relevant. Documents are scored on a 0-100 scale, with scores closer to 100 indicating a higher likelihood of relevance (according to the user-defined criteria for relevance). Along with the predictions, Everlaw also provides various performance metrics to help you gauge how accurate the predictions are. These metrics are displayed here.
If you want to brush up on predictive coding fundamentals, read our beginner’s guide.
What are the metrics based on
To evaluate the performance of a model, the system sets aside a subset of documents. It then compares its prediction scores for documents in the set against any actual user evaluations -- which it assumes to be the ground truth -- to estimate its performance. For example, if a document is predicted to be relevant, but was human-reviewed in a way that does not satisfy the model’s criteria for relevance, that counts as a strike against the model.
Though prediction scores are given on a 0-100 scale, documents must be grouped into binary relevant/irrelevant categories for the purpose of calculating performance metrics. The boundary between relevant/irrelevant document is left up to the user, and can be set by dragging the green line in the distribution graph to the desired score for the relevance boundary. The metrics will changed based on the relevance boundary that is being used.
Which set of documents is being used to generate the metrics?
The holdout set is being used to generate the performance metrics. The holdout set comprises 5% of the total case docs, and is maintained by taking a randomly sampled 5% of documents from each upload into the database. These documents are not used to training prediction models, which helps to eliminate possible bias in evaluating a model’s performance. Because the holdout set is active, we can also report historical performance data, displayed in the section below.
To improve the system’s assessment of its own performance, review more documents in the holdout set. These reviews give the system more ground-truth datapoints with which to evaluate its accuracy. Keep in mind that improving the quality of the assessment is different from improving the predictions themselves. To learn more about how to improve the actual performance of the model, click here.
What are the metrics measuring?
The performance metrics are:
- Precision: This is an estimate of how often documents above the relevance boundary are correctly classified as relevant. For example, let’s assume that the relevance boundary is 70. There are 100 documents that have prediction scores that lie above this, and are therefore predicted to be relevant. However, only 50 of these documents are reviewed as relevant, yielding a precision metric of 50%.
- Recall: This is an estimate of how of how often truly relevant documents are scored above the relevance boundary. For example, let’s assume that the relevance boundary is 70. There are a total of 100 relevant documents, but only 20 of them have prediction scores above 70. This will yield a recall metric of 20%.
- F1: This is a weighted average of the precision and recall metrics. It is used to gauge the overall performance of a model. In general, there is a tradeoff between precision and recall, and the F1 takes both into account.
By default, the system sets the relevance boundary to the score that results in the highest F1 -- the score that has the best balance of precision and recall. This is shown as the Color line in the graph to the left. Depending on your needs, you may prefer this best balance, or you may prefer to favor either one of precision or recall over the other. (For instance, if you are trying to find the needle in the haystack, you may prefer higher recall at the expense of lower precision.)