Administrator Deduplication Settings

Table of Contents:

Permissions

Your permission levels will determine the deduplication options that are available on the General tab of the Project Settings page. 

 

Organization admins 

Org admins who are also (1) project admins on at least one project in the database and (2) database admins (if OA access if turned off) can see and modify the following deduplication settings:

  • Hide project duplicates
  • Separate or combine exact and email duplicate
  • Change near duplicate group inclusion criteria

For more information on database OA access, please see this article. 

 

Project and database admins

If you are both a project and database admin, you only have access to the near duplicate grouping setting. 

 

Hide project duplicates

By default, all documents uploaded to the platform are shown in all searches. However, there is a global project setting that hides duplicates from all searches, meaning only “primary” documents of duplicate families will be returned as results. 

If the setting to combine email dupes with exact dupes is also turned on,:

  • For documents that are not emails, the deduplication will occur over all exact duplicates of the document
  • For documents that are emails, the deduplication will occur over all exact and email duplicates of the document

Definition of a "primary document" in an exact duplicate group

The “primary document” in an exact duplicate group is determined arbitrarily: the copy first uploaded to Everlaw is considered the primary document, and all others are considered “project duplicates." Additionally, if two duplicate documents are attached to two different emails, then the first one uploaded will be considered the “primary” document in that duplicate group. 

The “primary” document in an email duplicate group is the most complete version, as determined by text, metadata, and attachments. 

Circumstances in which you might enable "Hide project duplicates" 

This setting is primarily intended for users receiving productions with many duplicates who would like to mimic upload deduplication, while retaining all Bates numbered documents. Everlaw does not recommend turning on this setting outside of this use case. Note that this is a project-level, not database or organization-level, setting.

When enabled, the hide duplicates setting will be applied, by default, to all searches across the project. Users, however, have the option of adjusting it for individual searches via the search settings tab of the query builder. You can learn more about the search settings tab, and options for search deduplication, by reading this help article

 

Only primary documents that are direct matches to your search criteria will be returned. If only child duplicates match your search criteria, but are not the primary document, then no documents from that duplicate group will be returned. 

 

Search deduplication versus “hiding project duplicates”

To further illustrate the distinction between search and project duplicates, let's imagine that you have a project with only three documents, all of which are exact duplicates (not emails)  of each other.

Because documents #2.1 and #3.1 are ingested after document #1.1, they are identified as project duplicates. For the following searches:

  • Search for all documents with control numbers that are 2 or greater, then deduplicate the search results
    • This search returns document #2.1, because it has a lower control number than document #3.1
  • Search for all documents with control numbers that are 2 or greater and hide project duplicates
    • This search returns no results, because both #2.1 and #3.1 have been "hidden" from the search
  • Search for all documents and hide project duplicates
    • This search returns document #1.1, because it is the only document that is not "hidden" from the search

Combining exact and email duplicates 

Email duplicates are documents that Everlaw has determined represent the same email, despite textual differences and different hash values. For email-typed documents, Everlaw uses content similarity, close timestamps and other metadata, and email header fields to identify these documents. For documents not typed as emails, Everlaw analyzes the hash value and text to make the determination. Because email duplicates represent the same underlying email, you may wish to treat them the same as exact duplicates. This setting allows you to enforce that throughout your project

If this setting is turned on, both exact (hash) and email duplicates will be used when users are (1) performing search deduplication, (2) grouping by duplicates, or (3) applying autocode rules based on duplicates. If this setting is off, only exact duplicates are used across the three actions listed above.

This setting is turned on by default (recommended). Organization administrators who have administrative access to a given project can toggle this setting on or off themselves. If you are not an organization admin and wish to turn this setting off, please reach out to Everlaw Support (support@everlaw.com). 

Outcomes from "Hide project duplicates" and separating exact and email duplicates

Here are the implications of the four variations these two settings could be in:

1. Both settings are off

  • Project exact duplicates will not be hidden by default in search results in your project.
  • When performing search deduplication, duplicates-based grouping, and applying duplicates-based autocode rules, only exact duplicates will be used/affected.

2. Hide Project Duplicates is on, Email Duplicates is off

  • Project exact duplicates will be hidden by default in search results in your project. 
  • When performing search deduplication, duplicates-based grouping, and applying duplicates-based autocode rules, only exact duplicates will be used/affected.

3. Hide Project Duplicates is off, Email Duplicates is on

  • Project exact duplicates will not be hidden by default in search results in your project. 
  • When performing search deduplication, duplicates-based grouping, and applying duplicates-based autocode rules, both exact and email duplicates will be used/affected.

4. Both settings are on

  • Project exact and email duplicates will be hidden by default in search results. For non-email duplicate families (where all members will be exact duplicates, the “primary” version that is returned is the copy that was uploaded the earliest. For email duplicate families (where documents can be both exact and email duplicates of each other), the “primary” version that is returned is the most complete version. Completeness is determined by text, metadata, and attachments. 
  • When performing search deduplication, duplicates-based grouping, and applying duplicates-based autocode rules, both exact and email duplicates will be used/affected.

To understand this better, imagine you have a database with eight documents:

  • Documents 1 and 2 are not emails and are exact duplicates of each other, with 1 uploaded before 2
  • Documents 3 and 4 are not emails and are exact duplicates of each other, with 3 uploaded before 4
  • Documents 5, 6, 7, 8 are all emails. In fact, they represent the same email. In addition:
    • Documents 5 and 6 are exact duplicates of each other, with 5 uploaded before 6
    • Documents 7 and 8 are exact duplicates of each other, with 7 uploaded before 8

If email threading deduplication is turned off, the respective duplicate document families are:

  • Documents 1 and 2, with 1 being the “primary”
  • Documents 3 and 4, with 3 being the “primary”
  • Documents 5 and 6, with 5 being the “primary”
  • Documents 7 and 8, with 7 being the “primary”

If “hide all project duplicates” is turned on, then, for a search that hits on all documents, only documents 1, 3, 5, 7 will be returned.

 

If email threading deduplication is turned on, the respective duplicate document families are:

  • Documents 1 and 2, with 1 being the “primary”
  • Documents 3 and 4, with 3 being the “primary”
  • Documents 5, 6, 7, 8, with 7 being the “primary”

If “hide all project duplicates” is turned on, then, for a search that hits on all documents, only documents 1, 3, and 7 will be returned.

If the results are grouped by duplicates and email duplicates are removed, then documents 1, 2, 3, 4, 7, 8, and will be in the result set. The initial grouping action will pull in all exact and email duplicates; the second removal action will remove documents 5 and 6 because they are only email, and not exact, duplicates of the “primary” document in the duplicate family.

 

Change near duplicate grouping inclusion 

Near duplicate groups are formed through linking documents together that have 95% or more textual similarity and through the inclusion of exact and email duplicates. In essence, near duplicate documents are grouped together if they have 95% or more textual similarity to another document in the near duplicate group. By default, near duplicate groups by explicitly also include exact and email duplicates. For information on how near duplicates are defined, as well as how near duplicate groups are created, please see this article on duplicates

If you have the necessary permissions, you can choose to change the inclusion criteria for near duplicate groups. Note that because near duplicate groups are created on a database level, changing the near duplicate inclusion criteria will impact and change the near duplicate grouping inclusion criteria for all projects in the database. Also note that Difference Viewer displays the documents from the near duplicate groups chosen by this setting for comparison. 

There are three different grouping inclusion options for near duplicate groups: groups created with exact, email, and near duplicates, groups created with exact and near duplicates, or groups created with just near duplicates. Changing the inclusion of near duplicate group inclusion will begin the process of regrouping near duplicate groups; this process may take hours depending on the size of your database. For more information on checking the status of regrouping, please see this article about the status page and notifications

 

 

Near, exact, and email duplicate inclusion 

With this setting, all the exact and email duplicates of the near duplicates in the group will be explicitly included. This is the default and recommended setting for near duplicate groups because it brings into one group all related documents that are highly similar to each other.

Explicitly including exact and email duplicates can be important because there are instances where relying on a document’s text file alone can lead to inaccurate grouping. For example: receiving a production where there are no text files provided or, in the case of email dupes, having documents representing the same email, but with different header or footers because of collection methodology. Explicitly including exact and email duplicates ensures that these documents are always together in the same near duplicate group, even if the text files do not prove to be 95% or more textually similar due to a variety of circumstances. 

 

Near and exact duplicate inclusion 

With this setting, all exact duplicates of the near duplicates in the group will be explicitly included. Similar to the setting above, this means that exact duplicates that are not 95% or more textually similar will be included in the near duplicate group.

While most exact duplicates should have more than 95% textual similarity, it is possible that text files might not have been included or were inaccurate in a production you received, similar to the scenario described in the option above. This setting ensures that in these scenarios, all exact duplicates that might not be 95% or more textually similar are still grouped together. This setting does not inherently exclude all email duplicates. Rather, it only excludes email duplicates that are not 95% or more textually similar. 

 

Only near duplicate inclusion 

With this setting, only documents that are connected by  95% or more textual similarity are grouped together. This setting does not inherently exclude all exact and email duplicates. Rather, it only excludes exact and email duplicates that are not 95% or more textually similar. 

Have more questions? Submit a request

0 Comments

Article is closed for comments.