Types of Duplicates and Deduplication Options

 Table of Contents

Duplicate Types

There are three types of duplicates you can encounter while using Everlaw.

Type Definition Application
Exact Everlaw uses hash values to determine exact duplicates. A hash value is essentially a unique fingerprint based on the native  file. There are two types of hash values: SHA1 and MD5. The hash value takes into account the document’s text and intrinsic metadata (eg., author, date created, etc.). Extrinsic metadata values (eg., custodian, file path) are not used when generating the hash value. If no hash is available, documents with the same text are exact duplicates.

Upload Deduplication: During native upload, you have various options for deduplicating by exact duplicates


Review: During review, you have various options for deduplicating result sets, viewing, grouping, and setting up autocode rules based on exact duplicates.
Email Email duplicates are documents that Everlaw has determined represent the same email, despite textual differences and different hash values. For email-typed documents, Everlaw uses content similarity, close timestamps and other metadata, and email header fields to identify these documents. For documents not typed as emails, Everlaw analyzes the hash value and text to make the determination.

Review: While in the review window, you have the option to view and take action against email duplicates.

In addition, if email threading deduplication is turned on for your project, email duplicates will also be used for search deduplication, grouping, and autocode.

Near Near duplicates are documents that are highly textually similar. Review: While in the review window, you have the option to view and take action against near duplicates.

 

Additional information about email duplicates

Email duplicates are a special type of near duplicate that Everlaw has determined represent the same underlying email. In this sense, email duplicates are much more conceptually similar to exact duplicates: textual differences between near duplicates – even when small – are more likely to be materially meaningful than the textual differences between email duplicates. For example, the same bureaucratic form filled with different values are categorically different documents while the same email missing one header field, or some of the footer, are not.  

Common differences underlying email duplicates include versions of the collected email that: (1) are missing timestamps or some other header field, (2) are missing attachments, (3) have clipped footer text, or (4) have different amounts of text from any preceding email(s). As a result, while a family of email duplicates all represent the same underlying email, there may be versions that are more complete than others. The most complete version – determined by presence of timestamps, presence of attachments, and text – is treated as the “parent” of a given email duplicate family. “Parents” are shown first when grouping or listing email duplicates, which also means they will be the version retained when search deduplication is applied, or when “children” documents are removed from sets grouped by duplicates.

Return to Table of Contents

Availability of exact and email duplicates

By default, email threading deduplication is turned on for all projects. This means that both exact and email duplicates are used when (1) deduplicating searches, (2)  grouping documents by duplicates, and (3) applying autocode rules based on duplicates. 

If email threading deduplication is turned off, then only the exact duplicates are used across the three actions listed above. Everlaw will explicitly tell you whether duplicates include email duplicates, or is only based on exact duplicates. 

Screen_Shot_2023-04-18_at_9.57.18_AM.pngScreen_Shot_2023-04-18_at_10.02.41_AM.png

You can request that email threading deduplication be turned off for your project by contacting Everlaw Support. Org admins can toggle this setting on or off for projects in their organization without going through Support. 

Screen_Shot_2023-04-18_at_2.21.53_PM.png

Deduplication

Deduplication is the process of removing duplicates from the action you’re taking. For example, you can deduplicate upon uploading and/or during search. Deduplication can save you and your reviewers time and energy by reducing the need to sift through identical documents. 

Return to Table of Contents

Upload Deduplication

Upload deduplication occurs when native files are first uploaded to Everlaw. With upload deduplication, documents will not be uploaded if they are exact duplicates of documents that already exist in the database or are in the same upload. 

Upload deduplication affects native uploads only. Processed data uploads will not undergo upload deduplication. 

Note that upload deduplication respects document families. This means that documents attached to families are not considered duplicates unless the entire family is a duplicate. For example, let’s say we have a spreadsheet that is an attachment to an email thread. We would like to upload a different email thread that has the same spreadsheet attached. While the spreadsheets are duplicate files in a vacuum, the families they are a part of are not, so both copies of the spreadsheet will be uploaded, along with their families.

You can adjust upload deduplication settings when uploading native data. By default, upload deduplication is global, meaning that Everlaw will deduplicate against all documents in the database. You also have the option of deduplicating only within the same custodian or not to deduplicate.

Even if you choose to deduplicate globally, Everlaw will preserve a record of the deduplicated document in the All Custodians and All Paths fields that are populated for the existing document on the database.  This means that if a document with custodian Sam is deduplicated against a document with custodian Jenny, the existing document on the database will now list both Sam and Jenny in its All Custodians metadata.

You can learn more about uploading native data in this help article.

Upload_natives.png

Return to Table of Contents

Search deduplication

When search deduplication is applied, only one copy in each duplicate document family is returned within your search results. For exact duplicate families, the version with the lowest bates/control number will be returned, which is equivalent to the earliest copy uploaded to the platform. For email duplicate families, the most complete version will be returned. Completeness is determined by factors like text, presence of metadata fields, and attachments.

Unlike upload deduplication, search deduplication will not respect families. If two duplicate documents are attached to two different emails, then search deduplication will result in only one of these duplicates being returned in the search. 

Search deduplication is an option you can apply to any search. You can select “Deduplicate within search hits” in the search settings tab, which you can learn more about by reading this search article.

Screen_Shot_2019-12-20_at_5.17.26_PM.png

Note that email duplicates identification occurs after documents are uploaded to Everlaw. If new documents are uploaded to your project and email rethreading has not completed, you may have stale or inaccurate email duplicates information. You can see the status of any ongoing rethreading via the Project Settings > Statuses tab. In addition, if rethreading is ongoing, a warning will also appear under the duplicates context in the grouping section of the search settings widget.

Screen_Shot_2023-04-18_at_2.58.03_PM.png

Return to Table of Contents

[Org Admin or by request] Hide duplicates across your entire project

By default, all documents uploaded to the platform are shown in all searches. However, there is a global project setting that hides duplicates from all searches, meaning only “primary” documents of duplicate families will be returned as results. If email threading deduplication is turned on, this operation will occur over exact and email duplicate families; if email threading deduplication is turned off, this operation will only occur over exact duplicate families. To learn more about email duplicates, see this article.

You can ask Everlaw support to turn on this setting for your project. Org Admins can also enable this setting if they are a Database Admin and the project is in their organization. This setting is primarily intended for clients receiving productions with many duplicates who would like to mimic upload deduplication, while retaining all Bates numbered documents. Everlaw does not recommend turning on this setting outside of this use case. Note that this is a project-level, not database or organization-level, setting.

The “primary document” in an exact duplicate group is determined arbitrarily: the copy first uploaded to Everlaw is considered the primary document, and all others are considered “project duplicates." Additionally, if two duplicate documents are attached to two different emails, then the first one uploaded will be considered the “primary” document in that duplicate group. 

Screen_Shot_2019-12-20_at_5.28.08_PM.png

The “primary” document in an email duplicate group is the most complete version, as determined by text, metadata, and attachments. 

When enabled, the hide duplicates setting will be applied, by default, to all searches across the project. Users, however, have the option of adjusting it for individual searches via the search settings tab of the query builder. You can learn more about the search settings tab, and options for search deduplication, by reading this help article

Screen_Shot_2019-12-20_at_5.31.23_PM.png

Only primary documents that are direct matches to your search criteria will be returned. If only child duplicates match your search criteria, but are not the primary document, then no documents from that duplicate group will be returned. 

Return to Table of Contents

Search deduplication versus “hiding project duplicates”

Let’s use a basic example to illustrate the difference between search deduplication and hiding the duplicates entirely. We have a duplicate group where document #123 belongs to Custodian Greg and its duplicate, Doc #456, belongs to Custodian Dean. 

Now we run a search for “documents with Custodian: Dean.”

Screen_Shot_2019-12-20_at_5.16.53_PM.png

In search settings, we would select “Deduplicate within search hits.” 

Screen_Shot_2019-12-20_at_5.17.26_PM.png

In this example, document #456 (Custodian: Dean) will be returned in our search results because search deduplication returns one copy of documents that meet your specified search criteria. 

Now, let’s change our option to “Hide all project duplicates.”

In this example, document #123 is deemed the “primary” document in its duplicate group and #456 the “duplicate” copy. Since that is the case, we will get no results for our search because document #456 is hidden due to its status as a “duplicate”, even though it matches the search criteria. Said generally, hiding project duplicates prioritizes the primary document first, then searching across those primary documents.

Return to Table of Contents

Implications of Exact and Email duplicate group settings

To understand exact and email duplicates better in the context of search and deduplication settings, imagine you have a database with eight documents:

  • Documents 1 and 2 are not emails and are exact duplicates of each other, with 1 uploaded before 2
  • Documents 2 and 3 are not emails and are exact duplicates of each other, with 3 uploaded before 4
  • Documents 5, 6, 7, 8 are all emails. In fact, they represent the same email. In addition:
    • Documents 5 and 6 are exact duplicates of each other, with 5 uploaded before 6
    • Documents 7 and 8 are exact duplicates of each other, with 7 uploaded before 8

If email threading deduplication is turned off, the respective duplicate document families are:

  • Documents 1 and 2, with 1 being the “primary”
  • Documents 3 and 4, with 3 being the “primary”
  • Documents 5 and 6, with 5 being the “primary”
  • Documents 7 and 8, with 7 being the “primary”

If “hide all project duplicates” is turned on, then, for a search that hits on all documents, only documents 1, 3, 5, 7 will be returned.

group_ex_OFF.png

If email threading deduplication is turned on, the respective duplicate document families are:

  • Documents 1 and 2, with 1 being the “primary”
  • Documents 3 and 4, with 3 being the “primary”
  • Documents 5, 6, 7, 8, with 7 being the “primary”

If “hide all project duplicates” is turned on, then, for a search that hits on all documents, only documents 1, 3, and 7 will be returned.

group_ex_ON.png

If the results are grouped by duplicates and email duplicates are removed, then documents 1, 2, 3, 4, 7, 8, and will be in the result set. The initial grouping action will pull in all exact and email duplicates; the second removal action will remove documents 5 and 6 because they are only email, and not exact, duplicates of the “primary” document in the duplicate family.

Return to Table of Contents

Have more questions? Submit a request

0 Comments

Article is closed for comments.