Basic and Rigorous Performance Statistics

Table of Contents

This article will discuss how basic and rigorous performance statistics are generated for predictive coding models on Everlaw.

 

What is the holdout set?

The holdout set is a randomly sampled 5% of documents from your project that are not used to train your model, but instead are used to generate statistics about your model’s performance. Your model makes predictions on the documents in its holdout set, and then compares its predictions to how your team reviewed those documents. The more the model's predictions agree with your team's review decisions, the higher its reported performance will be.

 

Return to table of contents

How is the holdout set created?

A model’s holdout set is created by randomly sampling 5% of documents from each upload to your project. Until holdout documents are manually reviewed, however, they cannot be used to generate performance statistics for your model. Once holdout documents have been sufficiently reviewed to meet the threshold (50 relevant, 50 irrelevant, and 200 total), they begin generating basic performance statistics. In order to generate rigorous performance statistics, the model takes a random sample of the reviewed holdout documents and uses them to generate performance statistics.

 

Return to table of contents

What is the difference between basic and rigorous performance statistics?

The difference between basic and rigorous performance statistics is the sample of documents used to generate them. Basic performance statistics may be generated from a biased sample of documents. Rigorous performance statistics are generated from a truly random sampling of a model’s reviewed holdout documents.

To illustrate the primary concern with generating basic performance statistics from the full reviewed holdout set, let’s imagine this scenario: A team sets up a predictive coding model looking for hot documents. The model needs to be trained on some documents (plus, they need to get started on their case), so they start reviewing documents. To find documents to review in order to train the model, they run searches for documents that contain keywords relevant to their case, as well as documents that come from certain custodians.

 The team’s predictive coding model is now running, but they notice that there are no performance statistics being generated because their holdout set has not yet been reviewed, so they begin reviewing documents from their holdout set. Because they are still trying to find hot documents for their case, they review documents in their holdout set that contain the same relevant keywords, or that come from those certain custodians. After reviewing a few hundred documents from their holdout set, their model is now generating performance statistics. The model seems to be performing well!

The issue with these statistics, however, is that they are not being generated from a random sample of documents from the project. The holdout set was created from a random sample of the project’s documents, but the documents that they reviewed, and that therefore generated the model’s performance statistics, were chosen via non-random selection (e.g., running keyword searches, selecting particular custodians). In fact, the non-random selection process used to review documents from the holdout set was similar to the non-random selection process used to review documents for training. Because of this, the documents that the model used to test its prediction accuracy were likely similar to the documents it was originally trained on. It would therefore be expected that the model’s performance statistics would show it performing well. However, if the model was given a completely random document to make a prediction on, it may perform less well than expected.

As the above scenario illustrates, the documents which generate performance statistics must be randomly selected to prevent inflating your model’s performance. The next section will describe how these documents are selected.

 

Return to table of contents

Which documents generate rigorous performance statistics?

Let’s say your project has one million documents in it. Your holdout set, which represents a randomly sampled 5% of your total project, has 50,000 documents in it. Each of these 50,000 holdout documents is assigned a random identifier, known as its holdout ID (HID). Some of these holdout documents have been reviewed, while others have not yet been reviewed. Your model’s rigorous performance statistics are generated from all reviewed holdout documents with lower HIDs than the lowest-numbered unreviewed holdout document.

To illustrate how performance statistics are generated, imagine that the cells in the table below represent the first fifteen documents in your holdout set, in order of ascending HIDs. HIDs in bolded text represent documents that have been reviewed.

Screen_Shot_2018-06-22_at_4.15.18_PM.png

Basic performance statistics are generated by all reviewed holdout documents. In the table below, the documents used to generate basic performance statistics have been shaded beige:

Screen_Shot_2018-06-22_at_4.16.08_PM.png

Rigorous performance statistics, however, are generated by the contiguous set of reviewed holdout documents beginning with the lowest-numbered reviewed document. In the table below, the documents used to generate rigorous performance statistics have been shaded green:

Screen_Shot_2018-06-22_at_4.17.28_PM.png

Only the documents numbered 96 and 114 are used to generate the model’s rigorous performance statistics. Although document 256 has been reviewed, it will not contribute to the model’s rigorous performance statistics until documents 117 and 204 have been reviewed, as well. This ensures that a random set of documents is used to evaluate the model, providing unbiased statistics.

If you want to increase the number of documents used to generate rigorous performance statistics, you will need to review the lowest-numbered unreviewed holdout documents. To aid your review, anytime you open the list of unreviewed holdout documents from the model page, they will appear in order from lowest to highest HID. You can simply review the list of documents from top to bottom.

 

Return to table of contents

I have reviewed documents in my holdout set, so why don’t I see any performance statistics?

In order for your model’s holdout set to begin generating basic performance statistics, it must contain 200 documents, 50 of which have been reviewed as relevant, and 50 of which have been reviewed as irrelevant. In order for rigorous performance statistics to be generated, the contiguous set of reviewed holdout documents must meet these same criteria. In other words, when ordered by ascending HID, the first 200 documents in your holdout set must all be reviewed, with at least 50 deemed relevant and 50 deemed irrelevant.

 

Return to table of contents

How can I increase the number of documents contributing to my model’s rigorous performance statistics?

If you want to increase the number of documents used to generate rigorous performance statistics, you will need to review the lowest-numbered unreviewed holdout documents. This is because your holdout subsample will exclude any holdout documents above the lowest-numbered unreviewed document. To aid your review, anytime you open the list of unreviewed holdout documents from the model page, they will appear in order from lowest to highest HID.

Screen_Shot_2018-06-22_at_4.19.54_PM.png

You can simply review the list of documents from top to bottom.

 

Return to table of contents

Why did the number of documents contributing to my model’s rigorous statistics decrease?

Uploading new documents to your project may cause the number of documents contributing to your model’s rigorous performance statistics to fluctuate. Five percent of the newly uploaded documents will be randomly selected for the holdout set and assigned HIDs. If any of these documents are assigned a lower HID than any of the documents already in the contiguous set of reviewed documents generating rigorous statistics, those higher-numbered documents will be removed until the lower-numbered unreviewed documents have been reviewed. This ensures that your model continues to generate performance statistics based on a random and representative sample of documents from your project.

 

Return to table of contents

What does performance history look like for rigorous performance statistics?

Your model’s performance history graph will reflect the performance of your model at each point in time that the holdout set threshold was met. This will generally mean that the same core set of documents (though the set may grow) will be used to generate each data point on your performance history graph. New uploads to your project may cause a separate set of holdout documents to be used. If the threshold was not met at a given update, there will be no data for that date. You can view the number of documents contributing to your performance statistics over time by viewing the Holdout Size graph under Performance History.

Screen_Shot_2018-06-22_at_5.11.54_PM.png

Return to table of contents

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.