Types of Duplicates and Deduplication Options – Knowledge Base

Some document sets include duplicates, which are multiple representations of the same document. Review teams often want to avoid spending time reviewing duplicative information, and thus want to deduplicate their sets to review only one copy of each document. On Everlaw, you have flexibility with how you handle duplicate documents. This article describes the types of duplicate documents and how they are identified, and explains how to deduplicate your documents at different phases of your project.

Use this article to understand the different duplicate types and to determine when and how to deduplicate your documents.

Duplicate Types

There are three types of duplicates you can encounter when using Everlaw: Exact, Email, and Near. These are described in the table and sections below:

Type

Definition

Application

Exact

Everlaw uses hash values to determine exact duplicates. A hash value is essentially a unique fingerprint based on the native file. There are two types of hash values: SHA1 and MD5.

For documents that do not have MD5 or SHA1 hashes, such as processed data with missing hash data in the load file, exact duplicates are identified through comparing hashes created through the documents’ text files.

Upload Deduplication: During native upload, you have various options for deduplicating by exact duplicates

Review: During review, you have various options for deduplicating result sets, viewing, grouping, and setting up autocode rules based on exact duplicates.

Email duplicates are documents that Everlaw has determined represent the same email, despite textual differences and different hash values. For email-typed documents, Everlaw uses content similarity, close timestamps and other metadata, and email header fields to identify these documents. For documents not typed as emails, Everlaw analyzes the hash value and text to make the determination.

Review: While in the review window, you have the option to view and take action against email duplicates.

In addition, if email threading deduplication is turned on for your project, email duplicates are also used for search deduplication, grouping, and autocode.

Near

Near duplicates are documents that are in the same near duplicate group. Near duplicate groups by default include exact and email duplicates and are formed through a 95% text similarity web. For example, if Document A is 95% similar to Document B, which is 95% similar to Document C, Document A and C would be near duplicates of each other even though they may be less than 95% similar to each other textually.

Review: While in the review window, you have the option to view and take action against near duplicates.

You can deduplicate and group search results by near duplicates

To compare differences across near duplicate groups, use Difference Viewer. This tool allows you to review the differences between sets of near duplicate documents in a single view.

Exact duplicates

Exact duplicates are determined by native document hash values, which are unique to each native document. The two types of hash values are MD5 and SHA1. MD5 and SHA1 hash values are created by algorithms that utilize the document’s raw bytes (small discrete units derived from the document) that are then used to create a distinct hash value. Due to the nature of MD5 and SHA1 algorithms, calculating these hashes for native files allows for the same native file uploaded multiple times to have the same hash, allowing them to be flagged as exact duplicates.

To identify exact duplicates for processed data that does not have a MD5 or SHA1 hash provided in its load file, Everlaw uses an Everlaw-generated hash from the text file to identify exact duplicates. To generate a hash, documents must have a minimum of 8 words. This is to prevent inaccurately identifying documents as duplicates based on a few words of placeholder text.

Project administrators can change the definition of exact duplicates – for example, to only use MD5 hashes – in the project metadata settings. For more information, please see this article about metadata and metadata settings.

Email duplicates

Email duplicates are a special type of duplicate that Everlaw determines represent the same underlying email despite slight textual differences. Everlaw identifies email duplicates using text comparison and email threading. To identify emails that appear to fit into the same spot in a given thread, it uses:

The email text content
Timestamp metadata: Two emails with timestamp metadata up to 2 minutes apart can be identified as email duplicates
Attachments and embedded emails
From
To
Cc
Bcc
Date
Subject
Message IDs

Email duplicates are much more conceptually similar to exact duplicates than near duplicates. Textual differences between near duplicates – even when small – are more likely to be materially meaningful than the textual differences between email duplicates. For example, the same bureaucratic form filled with different values are categorically different documents while the same email missing one header field, or some of the footer, are not.

Common differences underlying email duplicates include versions of the collected email that:

Are missing timestamps or some other header field
Are missing attachments
Have clipped footer text
Have different amounts of text from any preceding email(s)

As a result, while a family of email duplicates all represent the same underlying email, there may be versions that are more complete than others. The most complete version – determined by presence of timestamps, presence of attachments, and text – is treated as the “parent” of a given email duplicate family. “Parents” are shown first when grouping or listing email duplicates, which also means they are the version retained when search deduplication is applied, or when “children” documents are removed from sets grouped by duplicates.

Exact and email duplicates

By default, email threading deduplication is turned on for all projects. This means that both exact and email duplicates are used when:

Deduplicating searches
Grouping documents by duplicates
Applying auto-code rules based on duplicates

If email threading deduplication is turned off, then only the exact duplicates are used across the three actions listed above.

This setting is managed at the project level. To learn more about disabling or reenabling this setting, visit Administrator Deduplication Settings.

Near duplicates

Near duplicates are documents that are in the same near duplicate group based on textual percent similarity. Near duplicate groups exact and email duplicates, by default.

For more information on changing settings to exclude email or exact duplicates, as well as how different settings will impact grouping, please see this article on Administrator deduplication settings. Documents must have at least eight words in their text file in order to be placed into a near duplicate group.

Near duplicate groups are formed through linking documents together that have 95% or more textual similarity and through the inclusion of exact and email duplicates. These documents are connected through a web structure to form the group. For example, if Document A is 95% similar to Document B, which is 95% similar to Document C, Document A, B, and C are all in the same near duplicate group even though it is possible that Document A and Document C are less than 95% textually similar. Because near duplicate groups are formed through these 95% text similarity chain relationships, there is not a center or ‘head’ of the near duplicate group. As such, there is no one document that connects every near duplicate together by 95% text similarity.

This chain and web approach to near duplicate groups has a few advantages:

It ensures that all textually similar documents are in the same near duplicate group
It guarantees that a document is only ever in one near duplicate group and prevents overlapping near duplicate groups
It creates stable near duplicate groups that do not change unless documents are added or removed from the database

Near duplicate groups are created on a database level; as such, documents are always in the same near duplicate groups across complete and partial projects. For near duplicate groups, the ‘parent’ document is the document with the lowest Bates or control number. More information on near duplicate deduplication search settings can be found in this article on search settings.

Understanding near duplicate percent similarity

Because near duplicates are determined by a document’s text file, faulty and inaccurate text files can impact near duplicate grouping. Thus, it’s possible that you may see near duplicate documents in the context panel with low textual similarity scores or, conversely, documents that you believe are near dupes of each other, but are not grouped as such.
Here are some examples where textual features can lead to confusing or inaccurate near duplicate groupings:

You receive a production where all the document text files state “Please refer to the native file of the document.” Since near duplicates are based on document text files, all the documents in this production would likely be grouped in the same near duplicate group even though they are different documents.
You have a number of emails where the email body is short but the signature and footer are very long and always consistent. These emails may get grouped as near duplicates due to the degree of textual similarity in the footer.
If your near duplicate groups explicitly include email duplicates, you may see grouped documents that have low textual similarity to one another. For example, Document A and Document B are email duplicates that have the same email body but are only 80% textually similar due to large differences in the recipients listed. Document B has a near duplicate, Document C, that is 98% similar. Documents A and C may not be that textually similar to each other (say, only 78% similar), but all three documents are in the same group due to their connection to Document B. For more information about near duplicates in the context panel, please see this article on using the context panel.

Deduplication

Deduplication is the process of removing duplicates from the action you’re taking. For example, you can deduplicate upon uploading and/or during search. Deduplication can save you and your reviewers time and energy by reducing the need to sift through identical documents.

Upload Deduplication

Upload deduplication occurs when native files are first uploaded to Everlaw. With upload deduplication, documents are not uploaded if they are exact duplicates of documents that already exist in the database or are in the same upload.

Upload deduplication against native files respects document families. This means that documents attached to families are not considered duplicates unless the entire family is a duplicate. For example, let’s say we have a spreadsheet that is an attachment to an email thread. We are uploading a different email thread that has the same spreadsheet attached. While the spreadsheets are duplicate files in a vacuum, the families they belong to are not, so both copies of the spreadsheet are uploaded, along with their families.

Deduplication against processed data can result in new documents within your current native upload being deduplicated out. If a parent/container file is a duplicate of a processed document already in the database, the native container/parent along with all children/attachments will be deduplicated out of the upload, even if the attachments are not duplicates of anything already in the database.

You can adjust upload deduplication settings when uploading native data. By default, upload deduplication against native documents is global, meaning that Everlaw deduplicates against all native documents in the database. You also have the option of deduplicating only within the same custodian or not to deduplicate. You can choose to also deduplicate against processed documents, which deduplicates against processed documents already in the database following the setting chosen for native deduplication (global or by custodian).

Even if you choose to deduplicate globally, Everlaw preserves a record of the deduplicated document in the All Custodians and All Paths fields that are populated for the existing document on the database. This means that if a document with custodian Sam is deduplicated against a document with custodian Jenny, the existing document on the database will now list both Sam and Jenny in its All Custodians metadata.

You can learn more about uploading native data in this help article about uploading native data.

Search deduplication

When you apply search deduplication, only one copy in each duplicate document family is returned within your search results. For exact duplicate families, the version with the lowest Bates or control number is returned, which is equivalent to the earliest copy uploaded to the platform. For email duplicate families, the most complete version is returned. Completeness is determined by text, presence of metadata fields, and attachments.

Important

Unlike upload deduplication, search deduplication does not respect families. If two duplicate documents are attached to two different emails, then search deduplication results in only one of these duplicates being returned in the search.

Search deduplication is an option you can apply to any search. To deduplicate your results, select Deduplicate within search hits in the search settings tab. Learn how to enable this setting in this search article on search settings.

Note that email and near duplicate identification occurs after documents are uploaded to Everlaw. If new documents are uploaded to your project and email rethreading or near duplicate grouping has not completed, you may have stale or inaccurate email duplicates or near duplicate groups. Please see this article on project statuses to learn more about checking the status of email threading and near duplicate grouping.

Hide duplicates across your entire project

Required permissions: Organization Admin. If your organization doesn't have an Organization Admin, you can email support@everlaw.com to enable this setting.

By default, searches on Everlaw look across every document uploaded to your database. However, you can choose documents to be deduplicated across all searches in your project. To do so:

Select Project Management > Project Settings.
Select the General tab.
Select Hide all project duplicates from search. This option is turned off by default (recommended).

This setting is primarily intended for users receiving productions with many duplicates who would like to mimic upload deduplication, while retaining all Bates numbered documents. Everlaw does not recommend turning on this setting outside of this use case.

This is a project-level, not database or organization-level, setting. For more information about this setting, please refer to this article on Administrator deduplication settings.