Deduplicate, Sample, Group, and Remove Search Hits Via "More Options"

Table of Contents

 

How do I deduplicate, sample, group, or remove my search hits?

Once you've constructed your search query, you can deduplicate, sample, group, and remove document hits that are returned by your query. These additional search settings are available through the "More options" widget at the bottom of each search container. Through the use of different search settings, you can build highly specific searches with only a few simple steps, or run very sophisticated search workflows (visit the "example search workflows using More Options" section below for more). 

Search settings can be applied independently to each logical search container in your query, as well as the query as a whole. With the exception of deduplication, which is only available for the entire query, you can apply any search setting, or combination of settings, to any logical container. To learn more about applying search settings on inner and outer search containers, see this section of the article.

more_options_containers.png

Once you’re in the More Options dialog, you can select any combination of settings. The effect of each setting on your results is reflected below each section as a positive or negative number, indicating the number of document additions or removals resulting from the setting. If there is no effect on your search, the section will say “No Change.”

setting_results.png

You can also click “Show walkthrough of your search settings" to see a step-by-step visual walkthrough of how your settings are modifying your search results. 

visual_walkthrough.png

Once you’re happy with your settings, click Save. A summary of applied settings will be shown below the search container.

settings_summary_container.png 

Return to table of contents

Deduplication

Deduplication is the process of removing exact duplicates from your search results and document sets. By deduplicating your search, you can ensure that only one copy of each document will be returned within your search results. For a conceptual overview of deduplication on Everlaw, visit this help article

If your search returns two or more documents that belong to the same duplicate group, search deduplication will return only one copy in your search. 

For example, let’s say we want all documents in a binder, but with any duplicates removed. Add the binder search term to your query builder, click More Options, then select “Deduplicate within search hits.” A count of how many documents are removed from your search as a result of this setting will be shown. Click “Save,” and your binder is now deduplicated, as indicated by the deduplicate tab on your search container. 

1_-_dedupe_settings.gif  

Your project may have a project-wide search setting enabled that hides all project duplicates by default. If so, the More Options tab will say “Hide All.”

 screenshot3.png 

This default setting is intended to help users who receive productions with many duplicates. These users often want to declutter their search results and document sets, but retain all Bates numbered documents in their database. It is rarely used in any other circumstances. You can override this default setting by clicking More options and selecting a different deduplication setting. 

 screenshot4.png 

When performing deduplications in the Everlaw, it is important to keep in mind the differences between project duplicates and search duplicates. 

  • Search deduplication is a search-wide setting. After Everlaw identifies all documents that match your search criteria, it then removes duplicate copies of documents from the search results, leaving only a single copy of each document in your result set. 
  • Project duplicates is a project-wide setting / status. Duplicates that are project duplicates are identified during upload. Therefore, this identification step occurs before evaluating your search criteria. If you hide all project duplicates, any document flagged as a "project duplicate" will be excluded from your results, regardless of the search criteria. We only recommend using the hide project duplicates setting if you will be running a series of searches and want to ensure that two different copies of the same document are never returned across the union of your searches.

To further illustrate this distinction, let's imagine that you have a project with only three documents, all of which are duplicates of each other. 
dupes_example.png

Because documents 2 and 3 are ingested after document 1, they are considered project duplicates. Now let's imagine you run the following searches:

  • Search for all documents with control numbers that are 2 or greater, then deduplicate the search results
    • This search will return only one of either document 2 or document 3
  • Search for all documents with control numbers that are 2 or greater, then hide project duplicates
    • This search will return no results
  • Search for all documents, then hide project duplicates
    • This search will return only document 1

Return to table of contents 

Sampling

In the sampling section of More Options, you can choose to keep only a randomly sampled subset of your original set of document hits. Sampling is helpful for quickly triaging a set of documents. It’s also a useful setting for training predictive coding models. Training a prediction model with randomly sampled subsets of documents may help improve the precision and recall of the generated predictions.

Document sampling will always be applied after deduplication, and before grouping or filtering decisions. In other words, if you choose to sample your documents and also group them by email thread, your documents will be sampled before they are grouped into threads. This prevents partial email threads from appearing in your results table. You can use the document counts below each setting, as well as the search walkthrough at the bottom of the dialog, to better understand how these settings are impacting your final results. 

Sampling probabilities apply to each document, rather than the set as a whole. For example, if you apply 10% sampling, each document in your results will have a 10% chance of being returned, rather than 10% of documents being selected from the entire results set. This may affect the total number of documents returned when you sample from relatively smaller results sets. Another reason your total number of documents may differ from the expected % is if you have restricted document access.

Return to table of contents

Grouping

Grouping allows you to organize your search hits by context: exact duplicates, attachments, email threads, or document versions.  

  • Attachments: Documents in an attachment family. Includes the parent document, often an email, and its attachments.
  • Email Threads: Emails that comprise an email thread, including replies, reply all, and forwarded emails.  Grouping by email thread will also include attachments and duplicate emails.
  • Exact Duplicates: Duplicate copies of the document. A complete definition of duplicates is in this article.
  • Versions: Versions of the same document (produced and pre-produced, translated and untranslated, etc.)

When including grouping into your search, you are pulling associated documents into the search results, even if the documents may not meet your search criteria. For example, if you're searching for documents with the word "fraud” and you group by email thread, the search results will include documents in the same email thread, even if they don’t have the word “fraud” in it. 

Grouping will always be applied after deduplication and sampling, but before removal. This means that documents removed during the deduplication step may get reintroduced if they are part of the context that you are grouping by. For example, if you group by "exact duplicates", your search results will contain duplicates even if you applied the "deduplicate within search" setting.  

Note: You may see a “Rethreading in progress” warning when grouping by email thread. This indicates that email threads in your search may be incomplete or misrepresented until the rethreading task is complete. To learn more about email rethreading, visit this article

Return to table of contents 

Removal

Removal allows you to remove certain classes of documents from your search results. These classes are: parents, children, search hits, grouped non-hits, and non-inclusive emails.

  • Parent: The topmost member in a document grouping, such as the primary email to which other documents are attached.
    • Note: You cannot remove parents when grouping by email thread. This is to ensure that attachments are not displayed without their associated email parents in the results table.
  • Children: Any document that is not the parent in a group, such as email attachments or project duplicates. 
  • Search hits: Any document that would be returned by your search, after search deduplication is applied.
  • Grouped non-hits: All documents that are not designated as a search hits, but introduced via grouping. 
  • Non-inclusive emails: You can only select this removal option when grouping by email threads. Inclusive emails are the minimum set of emails that creates the most “complete” email content in the thread. This set might comprise one email that is inclusive of all the thread's content, or it might comprise multiple emails that together create the set. Often it comprises only the last email in the branch, and all previous emails appear in the body of that document. Removing non-inclusive emails means the search results will only include emails that are not duplicates within the email thread, and are inclusive; attachments will be included if their parent email is included. Everlaw considers text, recipients, and attachments to determine inclusiveness. 

Removal is always applied after the other settings in More Options. You can use the document counts below each setting, as well as the search walkthrough at the bottom of the dialog, to better understand how these settings are impacting your final results.

Regardless of your removal settings, once you begin review you can always use the context panel to see other documents in each email thread, including duplicates and non-inclusive emails.

2_-_removal.gif

Return to table of contents

Viewing your search and adjusting search settings in the results table

Once you’ve run your search, you have the ability to view grouped documents and adjust search settings directly from the results table. 

Grouped documents are collapsed by default and indicated by a caret next to the parent document.  The total number of children in a document grouping is also displayed in parentheses (in the case of email threads, only other children and duplicate emails are included in the count, even though related attachments are grouped in this particular family).

rt1.png

On the left of the results table, you can click the expansion icon and expand all or collapse all document groups. 

rt2.png 

Child documents in a group will include decimaled row numbers that follow their parents. In this example, the parent is row #47 and its children are represented in the screenshot:

 rt3.png

If any of your grouping settings result in removing parents from your results, either on the search page or results table, the parents will appear as greyed-out documents, with children visible under them. The parents will not be affected by any export, batch modify, or production actions.

rt4.png
To adjust your previously applied search settings, click Options in the results table toolbar. Clicking the icon will prompt the same dialog as the More Options tab in search, with previously applied settings selected.  If you change your settings in any way, then the results table will update to include the selected settings. This modification will result in a new search, as opposed to an update to the original search. As a result, the new search will be saved as a separate card on the homepage. 

3_-_search_settings.gif 

Return to table of contents

Example search workflows using More Options

Below are some use cases for using More Options search settings. In these examples, you will need to use settings for multiple containers and/or multiple settings within each container. We'll start with a few that are a bit simpler. 

Ensure reviewers are assigned full email threads

You can use search settings to ensure that your search includes entire email threads. This can give you peace of mind that, when creating assignments, you have included all emails in a thread for review. 

Let's say we want to assign all emails from custodian Dasovich, grouped by email thread. Our search should look like this: 

Screen_Shot_2022-03-18_at_12.56.23_PM.png

Importantly, you'll want to make sure to group by Email Thread in the More Options tab: 

Screen_Shot_2020-04-03_at_4.09.16_PM.png

From here, click Search, and then assign your documents out! Email thread grouping will be respected in your assignment, and when possible, individual threads will be kept together and assigned to one user. 

Return to table of contents

Identify all non-primary duplicate documents to delete

Note that this workflow is only recommended if you've carefully considered the consequences and implications of performing a mass deletion of project documents identified to be duplicates. If you have questions or uncertainties, we encourage you to reach out to support@everlaw.com for help and guidance. 

Perhaps you want to delete all duplicate documents in your project to save on storage costs. To identify all duplicate documents, click the "All Documents" search card on your homepage. Then, click Refine to get to the query builder: 

Screen_Shot_2020-04-03_at_5.27.47_PM.png

Next, click More Options to bring up the search settings options. Ensure that all project duplicates are shown, then group by exact duplicates. Finally, remove the parents so that you can preserve one primary copy of each duplicate group.

Screen_Shot_2020-04-03_at_5.29.13_PM.png

The document set that results from this search will be all duplicate copies of documents in your project. If you wish, you can then delete those documents from your database via a batch action. 

Keep in mind that duplicates are identified by data intrinsic to documents. There may be important extrinsic data that Everlaw generates upon processing that you may lose when deleting duplicates in this way. For example:

  • A duplicate document can belong to different custodians and be found in different file paths. If you delete all duplicates, you may also lose non-duplicative information about custodians and file paths.
  • Or, since our global deduplication setting takes into account the attachment group (the document must be duplicative within the context of its attachment group to be deduplicated), deleting duplicates in your database after the fact may result in loss of document context. Generally, if you care about knowing which emails documents were attached to, and vice versa, you should not indiscriminately delete duplicate documents using this workflow. 

Finally, this workflow is not recommended for deleting emails from your database, particularly if you have email threading deduplication turned on in your project. To learn more about email threading deduplication, see this article. While our email threading deduplication is robust, there are some situations where non-duplicative emails are mistakenly identified to be duplicates by Everlaw. Bulk deleting emails could cause you to inadvertently remove unique emails from your database. 

Return to table of contents

Pre-production QA

You can use search settings to perform QA on your responsive documents before producing them. In this example, you want to ensure that you aren’t producing any privileged documents. First, add an extra AND operator into your query builder. Within it, select the “Coded” term and choose your responsiveness code. Click More Options and group it by Attachments. Then click Save.

Next, click the outer AND container, and choose the “Coded” term again. Select your Privilege code, then negate it by clicking it once.

Your search identifies documents marked for production, including attachments, that have not been coded for privilege. This allows you to easily check for coding inconsistencies before running a production and assign them for review.

For additional information on running pre-production QA, see the Pre-Production Workflow Guide and Best Practices article.

Return to table of contents
 

Identifying the top (parent) email and all of its attachments

Let’s say you’ve got an upload and you’d like to find the first email in a thread as well as its attachments. First, add an extra AND operator into your query builder. Within it, select the “Uploaded” search term and choose the upload you’d like to search for. Click More Options, then group by email thread. Next, remove the children. Then, click Save. Your result includes the parent documents, or the top email, for every email thread in our upload. 

Next, we want to bring in the parent emails’ attachments. Click the outer container, then select More Options, then group by attachments. Then, click Save. We can interpret this search as the parent emails in threads in our upload, as well as the parent emails’ attachments.  

A good way to double-check the logic of your search is via the instant search preview. The grey bar will display the order of operations conducted in your search. 

Return to table of contents

Identifying all non-inclusive emails

Perhaps you want to get rid of non-inclusive emails in your database, so you’re interested in searching for all non-inclusive emails. This one is a bit tricky. We first need to search for all emails grouped by thread. Then, we’ll search for inclusive emails. Finally, we’ll combine those searches together and search for all emails grouped by thread, and not the inclusive ones!

First, start by searching for emails grouped by thread.  Add the Type term, and select Email. Click More Options and group your search by email thread. Ensure that you’ve chosen to show all duplicates, then click Save. Click Search on the search page. This will return all emails on your project, grouped by thread (including their attachments). It will also save your search to be used in the next step. 

ezgif.com-gif-maker.gif 

Create the second component of the search, which is "all emails that are inclusive." Since most settings in the previous search can be reused, you don’t need to create a brand new search. Click Options in the results table. Keep all settings the same, but in the Removal step, select Non-inclusive emails. Then click Save. This will return all inclusive emails because you removed the non-inclusive ones. 

5_-_inclusive_2.gif

Our final step is to find all emails, grouped by thread, that are non-inclusive. Create a new search by clicking the magnifying glass in the navigation bar. 

Add the Prior Search term and select the initial search we created: “Type Email, including duplicates, grouped by email threads.” Next, add Prior Search again, but select the second search (all inclusive emails). Finally, you should negate the second search by clicking the term to turn it red. 

ezgif.com-gif-maker__1_.gif

The way we can interpret this search, from top-down, is all emails grouped by thread, but NOT the inclusive ones. 

 

Identify all standalone emails

This search will help you identify "standalone" emails, meaning that they have no email thread or any attachments. This is a complicated query because it requires three separate searches! 

Search 1:
First, we need to search for only emails that are children.

The below search isolates all emails that are attachments (children). Add the term "Type: Email." Then, in More Options, group by attachments and remove the parent. 
Screen_Shot_2022-03-18_at_2.14.45_PM.png

Screen_Shot_2022-03-18_at_2.14.20_PM.png

Click Search to run the search. For future reference, let's rename the search to "Child Emails." 

Screen_Shot_2020-04-03_at_4.37.31_PM.png

Search 2:

To do this, use the Prior Search term and select Child Emails as your prior search. Then, click More Options and group the search by email thread. This will bring in the email thread of those child attachments. 

Screen_Shot_2022-03-18_at_2.17.02_PM.png

Screen_Shot_2020-04-03_at_4.41.33_PM.png

Click Search, and then rename this search to "Parents with Children." 

Screen_Shot_2020-04-03_at_4.42.18_PM.png

Search 3:

This search will now (finally!) isolate the standalone emails.

Create a search with "Type: Email" AND "Prior Search: Parents with Children." Then, negate the Prior Search because we want every email that does NOT have any children.

Screen_Shot_2022-03-18_at_2.18.03_PM.png
If you are interested in specifically searching for documents without attachments that may still be a part of a thread, you can additionally use the 'Attachment Group Size' search term to search for documents that have zero attachment groups. This search term can only be used on complete projects. 

Screen_Shot_2022-03-18_at_2.19.59_PM.png

Return to table of contents

Have more questions? Submit a request

0 Comments

Article is closed for comments.