Deduplicate, Sample, Group, and Remove Search Hits Via "More Options"

Table of Contents

 

How do I deduplicate, sample, group, or remove my search hits?

On Everlaw, you can deduplicate, sample, group, and remove documents within a search. All of these settings can be applied via the More Options tab.

The settings in More Options, and the combination of them, allow you to build specific searches with only a few simple steps, while also supporting very sophisticated search workflows (visit the "example search workflows using More Options" section). 

The More Options tab is in the bottom right corner of each logical container. Any setting can be applied to any logical container (inner or outer), with the exception of deduplication, which can only be applied to the outermost container. You can learn more about applying search settings on inner and outer search containers in this section of the article.

Screen_Shot_2020-01-24_at_5.17.42_PM.png

 

Once you’re in the More Options dialog, you can select any combination of settings. The effect of each setting on your results is reflected below each section as a positive or negative number. If there is no effect on your search, the section will say “No Change.” You can also click “Show walkthrough of your search settings” and click through each step to understand how your settings impact your search, which is particularly useful if you’re deduplicating it. 

screenshot2.png

You can learn more about applying multiple search settings at once via some complex use cases at the end of this article.

Once you’re happy with your settings, click Save. All settings applied will be represented on the search container. 

Return to table of contents

Deduplicate among search hits

Because Everlaw can identify documents as duplicates, it also allows you to manage duplicates as you go through review. This is called deduplication. Deduplication is the process of removing exact duplicates from the action you’re taking. For a conceptual overview of deduplication on Everlaw, visit this help article

By deduplicating your search, only one copy of each document will be returned within your search results. If your search returns two or more documents that belong to the same duplicate group, search deduplication will return only one copy in your search, which will be the copy of the document in the search that was first uploaded to Everlaw. 

For example, let’s say we want all documents in a binder, with duplicates removed. Add the binder search term to your query builder, click More Options, then select “Deduplicate within search hits.” A count of how many documents are removed from your search as a result of this setting will be shown. Click “Save,” and your binder is now deduplicated, indicated by the deduplicate tab on your search container. 

1_-_dedupe_settings.gif  

Your project may have a search setting enabled that hides all project duplicates by default. In this case, the More Options tab will say “Hide All.”

 screenshot3.png 

You will also see a third option, selected by default, in the dialog that says “Hide all project duplicates.” This setting is intended for clients receiving productions with many duplicates who would like to mimic upload deduplication, while retaining all Bates numbered documents. It is rarely used in any other circumstances.

 screenshot4.png 

It’s understandably complex to think about how this option differs from deduplicating within search hits. Primarily, search deduplication is a search-wide setting. It occurs after identifying all documents that match your search criteria, then removes duplicative copies, leaving you with a single copy of each document that matches your search criteria. Hiding project duplicates is a project-wide setting. It occurs BEFORE evaluating your search criteria. With this option enabled, any document flagged as a “project duplicate” will be excluded from results, regardless of search criteria. In particular, that means you may not have even a single copy of a document that matches your search, if the only copies matching your search are considered project duplicates. This workflow is only recommended when you will be running a series of searches, and want to ensure that two different copies of the same document are never returned across the union of your searches.  

Return to table of contents 

Sampling

In the sampling section of More Options, you can choose a randomly sampled subset of your search results for any given search. Sampling is helpful for triaging review, where you may receive thousands of documents of a particular custodian and want to review a sample to decide how to triage the entire set. It’s also a useful setting for training predictive coding models. Training a prediction model with randomly sampled subsets of documents may help improve the precision and recall of the generated predictions.

Document sampling will always be applied after deduplication, and before grouping or filtering decisions. In other words, if you choose to sample your documents and also group them by email thread, your documents will be sampled before they are grouped into threads. This prevents partial email threads from appearing in your results table. You can use the document counts below each setting, as well as the search walkthrough at the bottom of the dialog, to better understand how these settings are impacting your final results. 

Sampling probabilities apply to each document, rather than the set as a whole. For example, if you apply 10% sampling, each document in your results will have a 10% chance of being returned, rather than 10% of documents being selected from the entire results set. This may affect the total number of documents returned when you sample from relatively smaller results sets. Another reason your total number of documents may differ from the expected % is if you have restricted document access.

Return to table of contents

Grouping

Grouping allows you to organize your search hits by context: exact duplicates, attachments, email threads, or document versions. Each one is defined as the following: 

  • Attachments: Documents in an attachment family. Includes the parent document, often an email, and its attachments.
  • Email Threads: Emails that comprise an email thread, including replies, reply all, and forwarded emails.  Grouping by email thread will also include attachments and duplicate emails.
  • Exact Duplicates: Duplicate copies of the document. A complete definition of duplicates is in this article.
  • Versions: Versions of the same document (produced and pre-produced, translated and untranslated, etc.)

When including grouping into your search, you are pulling associated documents into the search, even if the documents may not meet your search criteria. For example, if you're searching for documents with the word "fraud” and you group by email thread, the search will include documents in the same email thread, even if they don’t have the word “fraud” in it. 

Grouping will always be applied after deduplication and sampling, but before removal. This implies that deduplicated documents may get reintroduced if they are part of the context that you group by. You can use the document counts below each setting, as well as the search walkthrough at the bottom of the dialog, to better understand how these settings are impacting your final results. 

Note: You may see a “Rethreading in progress” warning when grouping by email thread. This indicates that email threads in your search may be incomplete or misrepresented until the rethreading task is complete. To learn more about email rethreading, visit this article

Return to table of contents 

Removal

Once you group your search hits, you have the option to remove certain contexts from that grouping: parents, children, search hits, grouped non-hits, and non-inclusive emails. Each one is defined as the following: 

  • Parents: The topmost member in a document grouping, such as the primary email to which other documents are attached.
  • Child documents: Any document that is not the parent in a group, such as email attachments or project duplicates. 
  • Search hits: Any document that would be returned by your search, after search deduplication is applied.
  • Grouped non-hits: All documents that are not designated as a search hits, but introduced via grouping. 
  • Non-inclusive emails: You can only select this removal option when grouping by email threads. Inclusive emails are the minimum set of emails that creates the most “complete” email content in the thread. It might be one email that is inclusive of all the thread's content, but it might be multiple emails that create the set. It is often the last email in the branch, and all previous emails should appear in the body of the document. Everlaw considers text, recipients, and attachments to determine inclusiveness. 

Removal is always applied after the other settings in More Options. You can use the document counts below each setting, as well as the search walkthrough at the bottom of the dialog, to better understand how these settings are impacting your final results. 

2_-_removal.gif

Note: You cannot remove parents when grouping by email thread. This is so that attachments are not displayed without their associated email parents in the results table, which you can learn more about in the next section of this article. 

Return to table of contents

Viewing your search and adjusting search settings in the results table

Once you’ve run your search, you have the ability to view grouped documents and adjust search settings directly from the results table. 

Grouped documents are collapsed by default and indicated by a caret next to the parent document.  The total number of children in a document grouping is also displayed in parentheses (in the case of email threads, only other children and duplicate emails are included in the count, even though related attachments are grouped in this particular family).

rt1.png

On the left of the results table, you can click the expansion icon and expand all or collapse all document groups. 

rt2.png 

Child documents in a group will include decimaled row numbers of their parent. In this example, the parent is row #47 and its children are represented in the screenshot:

 rt3.png

If any of your grouping settings result in removing parents from your results, either on the search page or results table, the parents will appear as greyed-out documents, with children visible under them. The parents will not be affected by any export, batch modify, or production actions.

rt4.png
To adjust your previously applied search settings, click Options in the results table toolbar. Clicking the icon will prompt the same dialog as the More Options tab in search, with previously applied settings selected.  If you change your settings in any way, then the results table will update to include the selected settings. A new search will be saved as a separate card on the homepage. 

3_-_search_settings.gif 

Return to table of contents

Example search workflows using More Options

Below are some use cases for using More Options search settings. In these examples, you will need to use settings for multiple containers and/or multiple settings within each container. We'll start with a few that are a bit simpler. 

Ensure reviewers are assigned full email threads

You can use search settings to ensure that your search includes entire email threads. This can give you peace of mind that, when creating assignments, you have included all emails in a thread for review. 

Let's say we want to assign all emails from custodian Dasovich, grouped by email thread. Our search should look like this: 

Screen_Shot_2020-04-03_at_4.11.40_PM.png

Importantly, you'll want to make sure to group by Email Thread in the More Options tab: 

Screen_Shot_2020-04-03_at_4.09.16_PM.png

From here, click Begin Review, and then assign your documents out! Email thread grouping will be respected in your assignment, and when possible, individual threads will be kept together and assigned to one user. 

Identify all non-primary duplicate documents to delete 

Perhaps you want to get rid of all duplicate documents on your project so that you can save on cost. To identify all duplicate documents, click the "All Documents" search card on your homepage. Then, click Refine to get to the query builder: 

Screen_Shot_2020-04-03_at_5.27.47_PM.png

Then, click More Options and group by exact duplicates. Then, remove the parents so that you can preserve one primary copy of each duplicate group.

Screen_Shot_2020-04-03_at_5.29.13_PM.png

Run this search, and then you can delete the documents from the results table, if you wish. 

Pre-production QA

You can use search settings to perform QA on your responsive documents before producing them. In this example, you want to ensure that you aren’t producing any privileged documents. First, add an extra AND operator into your query builder. Within it, select the “Coded” term and choose your responsiveness code. Click More Options and group it by Attachments. Then click Save. 

Next, click the outer AND container, and choose the “Coded” term again. Select your Privilege code, then negate it by clicking it once.

Your search identifies documents marked for production, including attachments, that have not been coded for privilege. This allows you to easily check for coding inconsistencies before running a production and assign them for review.

Return to table of contents
 

Identifying the top (parent) email and all of its attachments

Let’s say you’ve got an upload and you’d like to find the first email in a thread as well as its attachments. First, add an extra AND operator into your query builder. Within it, select the “Uploaded” search term and choose the upload you’d like to search for. Click More Options, then group by email thread. Next, remove the children. Then, click Save. Your result includes the parent documents, or the top email, for every email thread in our upload. 

Next, we want to bring in the parent emails’ attachments. Click the outer container, then select More Options, then group by attachments. Then, click Save. We can interpret this search as the parent emails in threads in our upload, as well as the parent emails’ attachments.  

A good way to double-check the logic of your search is via the instant search preview. The grey bar will display the order of operations conducted in your search. 

Return to table of contents

Identifying all non-inclusive emails

Perhaps you want to get rid of non-inclusive emails in your database, so you’re interested in searching for all non-inclusive emails. This one is a bit tricky. We first need to search for all emails grouped by thread. Then, we’ll search for inclusive emails. Finally, we’ll combine those searches together and search for all emails grouped by thread, and not the inclusive ones!

First, start by searching for emails grouped by thread.  Add the Type term, and select Email. Click More Options and group your search by email thread. Ensure that you’ve chosen to show all duplicates, then click Save. Click Begin Review on the search page. This will return all emails on your project, grouped by thread (including their attachments). It will also save your search to be used in the next step. 

4_-_inclusive_1.gif 

 

Create the second component of the search, which is "all emails that are inclusive." Since most settings in the previous search can be reused, you don’t need to create a brand new search. Click Options in the results table. Keep all settings the same, but in the Removal step, select Non-inclusive emails. Then click Save. This will return all inclusive emails because you removed the non-inclusive ones. 

5_-_inclusive_2.gif

Our final step is to find all emails, grouped by thread, that are non-inclusive. Create a new search by clicking the magnifying glass in the navigation bar. 

Add the Prior Search term and select the initial search we created: “Type Email, including duplicates, grouped by email threads.” Next, add Prior Search again, but select the second search (all inclusive emails). Finally, you should negate the second search by clicking the term to turn it red. 

6_-_inclusive_3.gif

The way we can interpret this search, from top-down, is all emails grouped by thread, but NOT the inclusive ones. 

Identify all standalone emails

This search will help you identify "standalone" emails, meaning that they have no email thread or any attachments. This is a complicated query because it requires three separate searches! 

Search 1:
First, we need to search for only emails that are children.

The below search isolates all emails that are attachments (children). Add the term "Type: Email." Then, in More Options, group by attachments and remove the parent. 

Screen_Shot_2020-04-03_at_4.35.51_PM.png

Click Begin Review to run the search. For future reference, let's rename the search to "Child Emails." 

Screen_Shot_2020-04-03_at_4.37.31_PM.png

Search 2:

To do this, use the Prior Search term and select Child Emails as your prior search. Then, click More Options and group the search by email thread. This will bring in the email thread of those child attachments. 

Screen_Shot_2020-04-03_at_4.41.33_PM.png

Click Begin Review, and then rename this search to "Parents with Children." 

Screen_Shot_2020-04-03_at_4.42.18_PM.png

Search 3:

This search will now (finally!) isolate the standalone emails.

Create a search with "Type: Email" AND "Prior Search: Parents with Children." Then, negate the Prior Search because we want every email that does NOT have any children.



Return to table of contents

Have more questions? Submit a request

0 Comments

Article is closed for comments.