Table of Contents
Background: Everlaw’s predictive coding system is built on a regression-based learning algorithm. First, a model is built from the featurized documents in a training set. The algorithm then applies the model to all documents in a case to generate predictions through a linear classification system (the predictions exist on a 0-100 point continuum).
In general, models in the case are updated once a day to reflect ongoing review work. By default, models automatically start generating predictions once a couple hundred documents have been reviewed (~200 or 5% of the case, whichever is smaller), with at least 50 satisfying the “relevant” criteria and 50 satisfying the “irrelevant” criteria for the given model. Because the system is a linear classifier, judgement about what prediction cutoff to use to define relevance/irrelevance is deferred to the user. This is in contrast to binary systems, which essentially give a thumbs up/thumbs down prediction rating to documents.
Everlaw's predictions are generated on a 0-100 continuum with “0” indicating documents least likely to be in the relevant set and “100” indicating documents most likely to be in the relevant set. A random subset of the coded documents in the training corpus is reserved for testing. To generate the estimated precision, recall, and F1 score, the rating and coding status of each document in this reserved subset is compared to the predicted value.
For each predictive coding model in the case, the general pipeline is:
User Facing Features:
Creating a new model: To create a new model, select "create new model" at the bottom of the list of accessible models. A wizard will open, and will walk you through the creation of a new model, allowing you to specify the “relevant” and “irrelevant” criteria, as well as optional exclusion criteria. For a more detailed walkthrough and introduction to predictive coding, click here.
Next Step Suggestions: You will be provided with suggested next steps to help improve the accuracy and performance of the given predictive coding model.
Distribution Graph of Prediction Values: You are provided the distribution of predictions for both currently rated and unrated documents. Dragging and dropping the cutoff line will display the number of documents lying above and below the cutoff score. Clicking on one of the numbers will open a results table with the documents either above or below the selected cutoff score. Using this, you can isolate documents based on where they fall on the continuum of predicted values.
Training Coverage Graph: Training coverage is a measure of how well the model understands the documents in the case. The training coverage score is essentially a value comparing the features of documents in the training corpus to any given document. Documents that have many features already seen and included in the training corpus will have higher coverage scores, while documents with few features included in the training corpus will have low coverage scores. The graph's x-axis shows the predicted relevance, and the y-axis shows the coverage score. On the chart, you can click and drag to highlight an area of the graph. By clicking “search”, you can access a results table with the documents whose coverage scores and predicted values are in the range contained in the selected area. Using the graph, you can strategically identify documents in your corpus that you can review in order to train the system better (generally the un-reviewed documents with low coverage scores).
Estimated precision: This is a measure of how accurate the predictions are. Of the documents classified as being in the relevant set by the algorithm, how many are actually relevant? Hovering over the card will show you the prediction score used as the cutoff for being considered in the relevant set. The cutoff is chosen by the system to maximize the F1 score (which you can read about below). Generally, a low precision estimate suggests a large number of false positives (documents predicted to be relevant when they are actually irrelevant).
In this example, out of all the documents predicted to be relevant (at the 100 point cutoff), 94.7% of them are estimated to be correctly identified as relevant. Another way to think about this is that for the set of documents that have a prediction score of 100, an estimated 94.7% of them are actually relevant.
Estimated recall: This is a measure of how thorough the model is in identifying all the relevant documents in the case. It is the proportion of relevant documents in the predicted relevant set relative to the total number of relevant documents in the case as a whole. For example, your model might return 10 documents that it thinks are relevant, 9 of which actually are, resulting in a high precision score. However, those 9 documents represent only 1% of the total relevant documents in the case. Recall captures this latter measure. Again, hovering over the card will provide you with the prediction score used as the cutoff for being considered in the relevant set. In general, lower estimated recall scores suggest a larger number of false negatives. In other words, there are a greater number of documents that are predicted to be irrelevant at the given cutoff, but are, in fact, actually relevant.
In this example, it is estimated that out of all the relevant documents in the case, 79.3% of them can be found in the set of documents with a prediction score of 100.
F1: This is a measure of the model’s accuracy that takes into account both the estimated precision and recall score. Values for F1 range from 0-1 (not inclusive), with higher scores indicating more accurate models.
An estimate of the maximum possible F1 score for any given model in the case is provided. When evaluating the trade-off between precision and recall, F1 scores often provide the best way to judge the overall performance of a model.
Special Note: The language in the current text tooltips is not dynamic. “Hot” and “cold” are used to designate relevance and irrelevance. This is not an issue for the default rating model, but can prove confusing for other models in the case that have different criteria for relevance and irrelevance. In those cases, just use “hot” and “cold” as stand-ins for the criteria governing the model. For any given model in the case, the top of the predictive coding page will display the criteria being used.
In this example, “hot” would be documents coded Andrew Fastow, while “cold” refers to documents most liked not coded Andrew Fastow.
Document Input Information, and Estimate of Cullable Documents: A count of the total number of documents in the current training corpus, the total number of unprocessable documents in the case due to insufficient or missing OCR text, and an estimate of the number of documents that could be culled from the database based on predicted scores is provided. Hovering over the culled document card will show you the prediction score cutoff used to determine the number documents that could be potentially culled.
In this case, there are about 835,351 documents that have prediction scores below 40. These documents, representing 54.9% of the total documents in the database, are likely to be irrelevant (or “cold”), and could potentially be culled.
Status and Training Sets
After performance metrics, users are provided a status report of when the model was last updated, and how many documents have been manually reviewed since the last update was completed.
Finally, users have the option to create specific training sets for any given model. The training sets can be sampled from all the documents in the case, or from a specific search. Either way, randomly sampled representative documents will be included in a training set. Reviewing these documents will help the model better understand the the documents in your case, resulting in improved predictions.
It is hard to give specific targets or guidelines for precision, recall, or F1 scores since every case is idiosyncratic. Targets will also depend on how predictive coding is used within a larger review workflow. For example, one review team might only want high precision scores, and are willing to sacrifice recall. Another review team might want their model to capture as many potentially relevant documents across a case, and are willing to sacrifice precision. Nevertheless, here are some general tips on how to improve different aspects of a model’s performance in Everlaw:
- Avoid rating/coding inconsistencies (for example, rating documents in the same context, such as dupes/near dupes, attachment families, email threads, differently). This will make the input to the model cleaner. Use the context panel to ensure consistent rating and coding across documents in the same document family.
- Work to reduce the bias of the training corpus. Make a conscious effort to review a broadly sampled set of documents from across the entire case. You can do this by:
- Creating multiple training sets sampled at random from the case as a whole on the prediction page.
- Use the coverage graph to target specific documents to add to the training corpus.
- Highlight an area of the graph that captures low coverage scores, and click the “search” button on the right. Refine the search, and add unrated to the search criteria. Review all, or a subset, of the search results (if there are a lot of unrated documents, consider using the sample search option to draw a random sample of the search results for manual review). This will improve the models understanding of the documents in the case, resulting in more accurate predictions.
- Use the cutoff prediction score as an anchor. The cutoff is determined by the level that maximizes the F1 value. Generally, if you use a lower cutoff, the recall will improve while precision gets worse. If you use a higher cutoff, the recall will get worse while the precision improves.
- For example, if you look at the cutoff used for the performance metrics (by hovering over the performance cards), and you want a set of documents with a potentially better recall score, use the chart to select all the documents above a lower cutoff. You can intuit that the recall score for this set would be better. Vice versa for precision.
- Check what the model is using for the cut off prediction score for relevance/irrelevance. If the cutoff is abnormally high or low, then the performance metrics will generally be misleading. This indicates that some problem occurred during training.
- Using the person parameter for the rating, coding, and category search terms, you can build a model that only takes into account the review decisions of trusted reviewers or subject matter experts.