Deduplicate, Sample, Group, and Remove Search Hits Via "Search settings"

Table of Contents

 

How do I deduplicate, sample, group, or remove my search hits?

Once you've constructed your search query, you can deduplicate, sample, group, and remove document hits that are returned by your query. These additional search settings are available through the "Search settings" widget at the bottom of each search container. Through the use of different search settings, you can build highly specific searches with only a few simple steps, or run very sophisticated search workflows (visit the "example search workflows using search settings" section below for more). 

Search settings can be applied independently to each logical search container in your query, as well as the query as a whole. With the exception of deduplication, which is only available for the entire query, you can apply any search setting, or combination of settings, to any logical container. To learn more about applying search settings on inner and outer search containers, see this section of the article.

Screen_Shot_2023-01-06_at_4.34.51_PM.png

Once you’re in the search settings dialog, you can select any combination of settings. The effect of each setting on your results is reflected below each section as a positive or negative number, indicating the number of document additions or removals resulting from the setting. If there is no effect on your search, the section will say “No Change.”

near dupe grouping option.PNG

You can also click “Show walkthrough of your search settings" to see a step-by-step visual walkthrough of how your settings are modifying your search results. 

Screenshot_2023-04-28_at_8.46.08_AM.png

Once you’re happy with your settings, click Save. A summary of applied settings will be shown below the search container.

settings_summary_container.png 

Return to table of contents

Deduplication

Deduplication is the process of removing duplicates from your search results and document sets. By deduplicating your search, you can ensure that only one document from a duplicate group is returned within your search results. For a conceptual overview of deduplication on Everlaw, visit this help article

If email threading deduplication is turned on in your case, deduplication will cover both exact and email duplicates. If email threading deduplication is turned off, only exact duplicate families are used. To learn more about this setting, please visit this help article on administrator deduplication settings

Hide all project duplicates

Your project may also have a project-wide setting enabled that hides all non-primary documents in a duplicate family group by default. If so, the search settings tab will say “Hide All.” To learn more about the effects of this setting, please see this help article

 screenshot3.png 

You can override this default setting for individual searches by clicking search settings and selecting a different deduplication setting. 

Screenshot_2023-04-28_at_8.47.37_AM.png

Search deduplication versus project deduplication

When deduplicating searches in Everlaw, it is important to keep in mind the differences between project duplicates and search duplicates. 

  • Search deduplication is a search-wide setting. After Everlaw identifies all documents that match your search criteria, it then removes duplicate copies of documents from the search results, leaving only a single copy of each document in your result set. 
  • Project exact duplicates is a project-wide setting / status. Duplicates that are project exact duplicates are identified during upload. Therefore, this identification step occurs before evaluating your search criteria. If you hide all project exact duplicates, any document flagged as a "project duplicate" will be excluded from your results, regardless of the search criteria. We only recommend using the hide project duplicates setting if you will be running a series of searches and want to ensure that two different copies of the same document are never returned across the union of your searches.
    • If the setting to combine email dupes with exact dupes is turned on, this option will expand to cover “project exact and email duplicates.” This means when deduplicating, both exact documents identified during upload and email duplicates identified on the platform will be taken into account. For documents that are not emails, only the “primary” document in the duplicate family (ie. the earliest of the exact dupes uploaded to the platform) will be returned in the results, as described above.  For documents that are emails, the most “complete” version of the email duplicates, as determined by text, metadata, and attachments,  will be returned.. For more information on hiding project duplicates and email threading deduplication settings, please see this article on admin deduplication settings.

To further illustrate this distinction, let's imagine that you have a project with only three documents, all of which are exact duplicates of each other. 

Because documents 2 and 3 are ingested after document 1, they are considered project duplicates. Now let's imagine you run the following searches:

  • Search for all documents with control numbers that are 2 or greater, then deduplicate the search results
    • This search will return only one of either document 2 or document 3
  • Search for all documents with control numbers that are 2 or greater, then hide project duplicates
    • This search will return no results
  • Search for all documents, then hide project duplicates
    • This search will return only document 1

Return to table of contents 

Sampling

In the sampling section of search settings, you can choose to keep only a randomly sampled subset of your original set of document hits. Sampling is helpful for quickly triaging a set of documents. It’s also a useful setting for training predictive coding models. Training a prediction model with randomly sampled subsets of documents may help improve the precision and recall of the generated predictions.

Document sampling will always be applied after deduplication, and before grouping or filtering decisions. In other words, if you choose to sample your documents and also group them by email thread, your documents will be sampled before they are grouped into threads. This prevents partial email threads from appearing in your results table. You can use the document counts below each setting, as well as the search walkthrough at the bottom of the dialog, to better understand how these settings are impacting your final results. 

Sampling probabilities apply to each document, rather than the set as a whole. For example, if you apply 10% sampling, each document in your results will have a 10% chance of being returned, rather than 10% of documents being selected from the entire results set. This may affect the total number of documents returned when you sample from relatively smaller results sets. Another reason your total number of documents may differ from the expected % is if you have restricted document access.

Return to table of contents

Grouping

Grouping allows you to organize your search hits by context: duplicates, attachments, email threads, or document versions.  

  • Attachments: Documents in an attachment family. Includes the parent document, often an email, and its attachments.
  • Email Threads: Emails that comprise an email thread, including replies, reply all, and forwarded emails.  Grouping by email thread will also include attachments and both exact and email duplicates.
  • Exact or Exact and Email Duplicates: Duplicate copies of the document. If email threading deduplication is turned on in your project, the grouping will be by exact and email duplicates; if turned off, the grouping will be by exact duplicates only. A complete definition of duplicates is in this article.
  • Near Duplicates: Documents that are in the same near duplicate group based on textual similarity. By default, exact and email duplicates are included in near duplicate grouping. Please see this article on changing the duplicate inclusion criteria for near duplicate groups. To compare differences across your near duplicate group, use Difference Viewer
  • Versions: Versions of the same document (produced and pre-produced, translated and untranslated, etc.)

When including grouping into your search, you are pulling associated documents into the search results, even if the documents may not meet your search criteria. For example, if you're searching for documents with the word "fraud” and you group by email thread, the search results will include documents in the same email thread, even if they don’t have the word “fraud” in it. 

Grouping will always be applied after deduplication and sampling, but before removal. This means that documents removed during the deduplication step may get reintroduced if they are part of the context that you are grouping by. For example, if you group by duplicates, your search results will contain duplicates even if you applied the "deduplicate within search" setting.  

Note: You may see a “Rethreading in progress” warning when grouping by email thread, or a “Regrouping in progress” warning when grouping by near duplicate groups. This indicates that email threads or near duplicate groups in your search may be incomplete or misrepresented until the associated rethreading and regrouping tasks are complete. To learn more about email rethreading and near duplicate regrouping statuses, visit this article.

regrouping rethreading screenshot.PNG

Return to table of contents 

Removal

Removal allows you to remove certain classes of documents from your search results. These classes are: parents, children, search hits, grouped non-hits, email duplicates, and non-inclusive emails.

  • Parent: The topmost member in a document grouping, such as the primary email to which other documents are attached.
    • Note: You cannot remove parents when grouping by email thread. This is to ensure that attachments are not displayed without their associated email parents in the results table.
    • Note: For documents grouped by near duplicates, the parent of the near duplicate group is the document with the lowest bates or control #. For more information on near duplicates, please see this article.
  • Children: Any document that is not the parent in a group, such as email attachments or non-primary duplicates. 
  • Search hits: Any document that would be returned by your search, after search deduplication is applied.
  • Grouped non-hits: All documents that are not designated as a search hits, but introduced via grouping. 
  • Email duplicates: This option is only available if email threading deduplication is turned on in your project and you are grouping by exact and email dupes. This will remove all email duplicates of the primary email in the email dupe family, leaving only exact duplicates of the primary email. The primary email is the version of the email that Everlaw determines is the most complete, based on text, metadata, and attachments.
  • Non-inclusive emails: You can only select this removal option when grouping by email threads. Inclusive emails are the minimum set of emails that creates the most “complete” email content in the thread. This set might comprise one email that is inclusive of all the thread's content, or it might comprise multiple emails that together create the set. Often it comprises only the last email in the branch, and all previous emails appear in the body of that document. Removing non-inclusive emails means the search results will only include emails that are not duplicates within the email thread, and are inclusive; attachments will be included if their parent email is included. Everlaw considers text, recipients, and attachments to determine inclusiveness. 
    • Note: For both searches and STRs, ensure that "Email threading deduplication" is enabled in Project Settings when grouping by email threads and removing non-inclusive emails. If this is not enabled, you may receive an error. 

Removal is always applied after the other settings in search settings. You can use the document counts below each setting, as well as the search walkthrough at the bottom of the dialog, to better understand how these settings are impacting your final results.

Regardless of your removal settings, once you begin review you can always use the context panel to see other documents in each email thread, including duplicates and non-inclusive emails.

Screenshot_2023-04-28_at_8.50.55_AM.png

Return to table of contents

Viewing your search and adjusting search settings in the results table

Once you’ve run your search, you have the ability to view grouped documents and adjust search settings directly from the results table. 

Grouped documents are collapsed by default and indicated by a caret next to the parent document.  The total number of children in a document grouping is also displayed in parentheses (in the case of email threads, only other children and duplicate emails are included in the count, even though related attachments are grouped in this particular family).

rt1.png

On the left of the results table, you can click the expansion icon and expand all or collapse all document groups. 

rt2.png 

Child documents in a group will include decimaled row numbers that follow their parents. In this example, the parent is row #47 and its children are represented in the screenshot:

 rt3.png

If any of your grouping settings result in removing parents from your results, either on the search page or results table, the parents will appear as greyed-out documents, with children visible under them. The parents will not be affected by any export, batch modify, or production actions.

rt4.png
To adjust your previously applied search settings, click settings in the results table toolbar. Clicking the icon will prompt the same dialog as the search settings tab in search, with previously applied settings selected.  If you change your settings in any way, then the results table will update to include the selected settings. This modification will result in a new search, as opposed to an update to the original search. As a result, the new search will be saved as a separate card on the homepage. 

3_-_search_settings.gif 

Return to table of contents

Example search workflows using search settings

Below are some use cases for using search settings. In these examples, you will need to use settings for multiple containers and/or multiple settings within each container. We'll start with a few that are a bit simpler. 

Ensure reviewers are assigned full email threads

You can use search settings to ensure that your search includes entire email threads. This can give you peace of mind that, when creating assignments, you have included all emails in a thread for review. 

Let's say we want to assign all emails from custodian Dasovich, grouped by email thread. Our search should look like this: 

Screen_Shot_2022-03-18_at_12.56.23_PM.png

Importantly, you'll want to make sure to group by Email Thread in the search settings tab: 

Screenshot_2023-04-28_at_8.53.43_AM.png

From here, click Search, and then assign your documents out! Email thread grouping will be respected in your assignment, and when possible, individual threads will be kept together and assigned to one user. 

Return to table of contents

Identify all non-primary duplicate documents to delete

Note that this workflow is only recommended if you've carefully considered the consequences and implications of performing a mass deletion of project documents identified to be duplicates. If you have questions or uncertainties, we encourage you to reach out to support@everlaw.com for help and guidance. 

Perhaps you want to delete all duplicate documents in your project to save on storage costs. To identify all duplicate documents, click the "All Documents" search card on your homepage. Then, click Refine to get to the query builder: 

Screen_Shot_2020-04-03_at_5.27.47_PM.png

Next, click search settings to bring up the search settings options. Ensure that all project duplicates are shown, then group by duplicates. Finally, remove the parents so that you can preserve one primary copy of each duplicate group.

Screenshot_2023-04-28_at_8.55.45_AM.png

The document set that results from this search will be all duplicate copies of documents in your project. If you wish, you can then delete those documents from your database via a batch action. 

Keep in mind that duplicates are identified by data intrinsic to documents. There may be important extrinsic data that Everlaw generates upon processing that you may lose when deleting duplicates in this way. For example:

  • A duplicate document can belong to different custodians and be found in different file paths. If you delete all duplicates, you may also lose non-duplicative information about custodians and file paths.
  • Or, since our global deduplication setting takes into account the attachment group (the document must be duplicative within the context of its attachment group to be deduplicated), deleting duplicates in your database after the fact may result in loss of document context. Generally, if you care about knowing which emails documents were attached to, and vice versa, you should not indiscriminately delete duplicate documents using this workflow. 

Finally, this workflow is not recommended for deleting emails from your database, particularly if you have email threading deduplication turned on in your project. To learn more about email threading deduplication, see this article. While our email threading deduplication is robust, there are some situations where non-duplicative emails are mistakenly identified to be duplicates by Everlaw. Bulk deleting emails could cause you to inadvertently remove unique emails from your database. 

Return to table of contents

Pre-production QA

You can use search settings to perform QA on your responsive documents before producing them. In this example, you want to ensure that you aren’t producing any privileged documents. First, add an extra AND operator into your query builder. Within it, select the “Coded” term and choose your responsiveness code. Click search settings and group it by Attachments. Then click Save.

Next, click the outer AND container, and choose the “Coded” term again. Select your Privilege code, then negate it by clicking it once.

Your search identifies documents marked for production, including attachments, that have not been coded for privilege. This allows you to easily check for coding inconsistencies before running a production and assign them for review.

For additional information on running pre-production QA, see the Pre-Production Workflow Guide and Best Practices article.

Return to table of contents
 

Identifying the top (parent) email and all of its attachments

Let’s say you’ve got an upload and you’d like to find the first email in a thread as well as its attachments. First, add an extra AND operator into your query builder. Within it, select the “Uploaded” search term and choose the upload you’d like to search for. Click search settings, then group by email thread. Next, remove the children. Then, click Save. Your result includes the parent documents, or the top email, for every email thread in our upload. 

Next, we want to bring in the parent emails’ attachments. Click the outer container, then select search settings, then group by attachments. Then, click Save. We can interpret this search as the parent emails in threads in our upload, as well as the parent emails’ attachments.  

A good way to double-check the logic of your search is via the instant search preview. The grey bar will display the order of operations conducted in your search. 

Return to table of contents

Identifying all non-inclusive emails

Perhaps you want to get rid of non-inclusive emails in your database, so you’re interested in searching for all non-inclusive emails. This one is a bit tricky. We first need to search for all emails grouped by thread. Then, we’ll search for inclusive emails. Finally, we’ll combine those searches together and search for all emails grouped by thread, and not the inclusive ones!

First, start by searching for emails grouped by thread.  Add the Type term, and select Email. Click search settings and group your search by email thread. Ensure that you’ve chosen to show all duplicates, then click Save. Click Search on the search page. This will return all emails on your project, grouped by thread (including their attachments). It will also save your search to be used in the next step. 

ezgif.com-gif-maker.gif 

Create the second component of the search, which is "all emails that are inclusive." Since most settings in the previous search can be reused, you don’t need to create a brand new search. Click Options in the results table. Keep all settings the same, but in the Removal step, select Non-inclusive emails. Then click Save. This will return all inclusive emails because you removed the non-inclusive ones. 

5_-_inclusive_2.gif

Our final step is to find all emails, grouped by thread, that are non-inclusive. Create a new search by clicking the magnifying glass in the navigation bar. 

Add the Prior Search term and select the initial search we created: “Type Email, including duplicates, grouped by email threads.” Next, add Prior Search again, but select the second search (all inclusive emails). Finally, you should negate the second search by clicking the term to turn it red. 

ezgif.com-gif-maker__1_.gif

The way we can interpret this search, from top-down, is all emails grouped by thread, but NOT the inclusive ones. 

Identify all standalone emails

This search will help you identify "standalone" emails, meaning that they have no email thread or any attachments. This is a complicated query because it requires three separate searches.

Search 1: 

First, we need to search for only emails that are children. 

The below search isolates all emails that are attachments (children). Add the term "Type: Email." Then, in search settings, group by attachments and remove the parent.

Screen_Shot_2022-03-18_at_2.14.45_PM.png

Screen_Shot_2022-03-18_at_2.14.20_PM.png

Click Search to run the search. For future reference, let's rename the search to "Child Emails." 

Screen_Shot_2020-04-03_at_4.37.31_PM.png

Search 2:

To do this, use the Prior Search term and select Child Emails as your prior search. Then, click search settings and group the search by email thread. This will bring in the email thread of those child attachments.

Screen_Shot_2022-03-18_at_2.17.02_PM.png

Screen_Shot_2020-04-03_at_4.41.33_PM.png

Click Search, and then rename this search to "Parents with Children."

Screen_Shot_2020-04-03_at_4.42.18_PM.png

Search 3:

This search will now (finally!) isolate the standalone emails.

Create a search with "Type: Email" AND "Prior Search: Parents with Children." Then, negate the Prior Search because we want every email that does NOT have any children.

Screen_Shot_2022-03-18_at_2.18.03_PM.png
If you are interested in specifically searching for documents without attachments that may still be a part of a thread, you can additionally use the 'Attachment Group Size' search term to search for documents that have zero attachment groups. This search term can only be used on complete projects.

Screen_Shot_2022-03-18_at_2.19.59_PM.png

Configuring the default search grouping setting in Project Settings

If the majority of searches in your project should be grouped in a certain way, as a Project admin, you can configure the default search grouping in Project Settings. 

For example, let’s say your team needs to review documents in the context of their entire attachment families when running a search. In the General tab of Project Settings, you can enable default search grouping and select the “Attachments” from the Grouping dropdown and “None (keep all)” from Remove from group. 

Default_search_grouping.gif

This grouping—by attachments—will be the initial grouping for all new searches created on your project. Note that this grouping—by attachments—will pull in associated documents, even if the documents may not meet the search criteria. 

All users can override the default on their current search by clicking “Search settings” and selecting the desired grouping for that search. For example, a reviewer on this project may want documents ordered in the results table according to attachment family, but does not wish to pull in documents that do not meet the search criteria for their current search. They can update the setting on their current search to remove grouped non-hits.

Override_default_search_grouping.gif

 

Default search grouping in new projects

By default, new projects created on Everlaw (after Release 89) will have the default search grouping setting enabled and set to group by attachments without grouped non-hits. This grouping—by attachments without grouped non-hits—orders documents in the results table according to their attachment family and does not pull in associated documents that do not meet the search criteria. It visually groups together members of the same attachment families that meet the search criteria in the results table and ensures that they are directly sequential, even when their Bates/Control numbering is not sequential. To learn more about viewing document groups in the results table, please see this article.

Default search grouping may be copied to a new project by using a template to create a new project. You can learn more about using a template to create a new project in this article.

Return to table of contents

Have more questions? Submit a request

0 Comments

Article is closed for comments.