To view all of Everlaw's predictive coding-related content, please see our predictive coding section.
It is hard to give specific targets or guidelines for precision, recall, or F1 scores since every case is idiosyncratic. Targets will also depend on how predictive coding is used within a larger review workflow. For example, one review team might only want high precision scores, and are willing to sacrifice recall. Another review team might want their model to capture as many potentially relevant documents across a case, and are willing to sacrifice precision. Nevertheless, this article will provide you with some general tips on how to improve different aspects of a model’s performance in Everlaw. Though this list is applicable to both Basic and Rigorous predictive coding performance statistics, please see this help article for more information on improving Rigorous performance statistics.
Here are our tips for building a good predictive coding model:
- Avoid rating/coding inconsistencies (for example, rating documents in the same context, such as dupes/near dupes, attachment families, email threads, differently). This will make the input to the model cleaner. Use the context panel to ensure consistent rating and coding across documents in the same document family.
- Work to reduce the bias of the training corpus. Make a conscious effort to review a broadly sampled set of documents from across the entire case.
- Create multiple training sets sampled at random from the case as a whole on the prediction page.
- Use the coverage graph to target specific documents to add to the training corpus.
- Highlight an area of the graph that captures low coverage scores, and click the “review” button on the right. Refine the search, and add unrated to the search criteria. Review all, or a subset, of the search results (if there are a lot of unrated documents, consider using the sample search option to draw a random sample of the search results for manual review). This will improve the model's understanding of the documents in the case, resulting in more accurate predictions.
- Use the cutoff prediction score as an anchor. The cutoff is determined by the level that maximizes the F1 value, and is also indicated by the purple F1 flag on your distribution graph. Generally, if you use a lower cutoff, the recall will improve while precision gets worse. If you use a higher cutoff, the recall will get worse while the precision improves.
- For example, if you look at the cutoff used for the performance metrics, and you want a set of documents with a potentially better recall score, drag the green flag on your distribution graph to select all documents above a lower cutoff. You can intuit that the recall score for this set would be better, while dragging the green flag in the other direction would improve precision.
- Check what the model is using for the cut off prediction score for relevance/irrelevance, as indicated by the purple F1 flag. If the cutoff is abnormally high or low, then the performance metrics will generally be misleading. This indicates that some problem occurred during training.
- Using the person parameter for the rating, coding, and category search terms when determining your reviewed and relevance criteria, you can build a model that only takes into account the review decisions of trusted reviewers or subject matter experts.