Types of Duplicates and Deduplication Options

Table of Contents

Duplicate Types

There are three types of duplicates you can encounter while using Everlaw. To compare differences across near duplicate groups, use Difference Viewer

Type Definition Application
Exact
Everlaw uses hash values to determine exact duplicates. A hash value is essentially a unique fingerprint based on the native file. There are two types of hash values: SHA1 and MD5.  For documents that do not have MD5 or SHA1 hashes, such as processed data with missing hash data in the load file, exact duplicates are identified through comparing hashes created through the documents’ text files.

 

Upload Deduplication: During native upload, you have various options for deduplicating by exact duplicates


Review: During review, you have various options for deduplicating result sets, viewing, grouping, and setting up autocode rules based on exact duplicates.
Email Email duplicates are documents that Everlaw has determined represent the same email, despite textual differences and different hash values. For email-typed documents, Everlaw uses content similarity, close timestamps and other metadata, and email header fields to identify these documents. For documents not typed as emails, Everlaw analyzes the hash value and text to make the determination.

Review: While in the review window, you have the option to view and take action against email duplicates.

In addition, if email threading deduplication is turned on for your project, email duplicates will also be used for search deduplication, grouping, and autocode.

Near Near duplicates are documents that are in the same near duplicate group. Near duplicate groups by default include exact and email duplicates and are formed through a 95% text similarity web. For example, if Document A is 95% similar to Document B, which is 95% similar to Document C, Document A and C would be near duplicates of each other even though they may be less than 95% similar to each other textually.

Review: While in the review window, you have the option to view and take action against near duplicates.

You can also deduplicate and group search results by near duplicates 

 

Exact duplicates 

Definition
Exact duplicates are determined by native document hash values, which are unique to each native document. The two types of hash values are MD5 and SHA1. MD5 and SHA1 hash values are created by algorithms that utilize the document’s raw bytes (small discrete units derived from the document) that can then be used to create a distinct hash value. Due to the nature of MD5 and SHA1 algorithms, calculating these hashes for native files allows for the same native file uploaded multiple times to have the same hash, allowing them to be flagged as duplicates..

Documents in processed uploads will often include the MD5 and SHA1 hashes in the load file. Therefore, Everlaw does not separately calculate MD5 and SHA1 hashes for processed data. For processed data that does not have a MD5 or SHA1 hash provided, Everlaw will use an Everlaw-generated hash from the text file to identify exact duplicates. To generate a hash, documents must have a minimum of 8 words. This is to prevent inaccurately identifying documents as duplicates based on a few words of placeholder text.

Project administrators can change the definition of exact duplicates – for example, to only use MD5 hashes – in the project metadata settings. For more information, please see this article about metadata and metadata settings.

 

Email duplicates

Definition

Email duplicates are a special type of duplicate that Everlaw has determined represent the same underlying email despite slight textual differences. Everlaw identifies email duplicates using text comparison and email threading. It uses the email text content, metadata, and message IDs to identify emails that appear to fit into the same spot in a given thread. Specifically, email thread duplicates are determined by content, timestamps, attachments and embedded emails, as well as the metadata fields From, To, Cc, Bcc, Date, and Subject. Everlaw creates a custom hash for emails from this information and determines email duplicates through this hash due to the fact that native email files do not have a standardized representation or file type. For example, emails can be in the EML or MSG format. By creating and using a custom hash, Everlaw can identify and flag the same emails as duplicates across file types. 

Email duplicates are much more conceptually similar to exact duplicates than near duplicates. Textual differences between near duplicates – even when small – are more likely to be materially meaningful than the textual differences between email duplicates. For example, the same bureaucratic form filled with different values are categorically different documents while the same email missing one header field, or some of the footer, are not.  

Common differences underlying email duplicates include versions of the collected email that: (1) are missing timestamps or some other header field, (2) are missing attachments, (3) have clipped footer text, or (4) have different amounts of text from any preceding email(s). As a result, while a family of email duplicates all represent the same underlying email, there may be versions that are more complete than others. The most complete version – determined by presence of timestamps, presence of attachments, and text – is treated as the “parent” of a given email duplicate family. “Parents” are shown first when grouping or listing email duplicates, which also means they will be the version retained when search deduplication is applied, or when “children” documents are removed from sets grouped by duplicates.

Return to Table of Contents

Exact and email duplicates

By default, email threading deduplication is turned on for all projects. This means that both exact and email duplicates are used when (1) deduplicating searches, (2) grouping documents by duplicates, and (3) applying autocode rules based on duplicates. 

If email threading deduplication is turned off, then only the exact duplicates are used across the three actions listed above. Everlaw will explicitly tell you whether duplicates include email duplicates, or is only based on exact duplicates.

search_options.png

You can request that email threading deduplication be turned off for your project by contacting Everlaw Support. Org admins can toggle this setting on or off for projects in their organization without going through Support. This setting is present under Deduplication in the General tab of the Project Settings page. For more information, please see this article on organization admin deduplication settings. 

 

Near duplicates 

Definition

Near duplicates are documents that are in the same near duplicate group based on textual percent similarity. Additionally, near duplicate groups by default also include exact and email duplicates. For more information on changing settings to exclude email or exact duplicates, as well as how different settings will impact grouping, please see this article. Documents must have at least eight words in their text file in order to be placed into a near duplicate group. 

Near duplicate groups are formed through linking documents together that have 95% or more textual similarity and through the inclusion of exact and email duplicates. These documents are connected through a web structure to form the group. For example, if Document A is 95% similar to Document B, which is 95% similar to Document C, Document A, B, and C will all be in the same near duplicate group even though it is possible that Document A and Document C are less than 95% textually similar. Because near duplicate groups are formed through these 95% text similarity chain relationships, there is not a center or ‘head’ of the near duplicate group. As such, there is no one document that connects every near duplicate together by 95% text similarity. 

This chain and web approach to near duplicate groups ensures that all textually similar documents are in the same near duplicate group; it guarantees that a document is only ever in one near duplicate group and prevents overlapping near duplicate groups, and it creates stable near duplicate groups that do not change unless documents are added or removed from the database. Near duplicate groups are created on a database level; as such, documents will always be in the same near duplicate groups across complete and partial projects. For near duplicate groups, the ‘parent’ document is the document with the lowest bates or control number. More information on near duplicate deduplication search settings can be found in this article. 

Understanding near duplicate percent similarity

Because near duplicates are determined by a document’s text file, faulty and inaccurate text files can impact near duplicate grouping. Thus, it’s possible that you may see near duplicate documents in the context panel with low textual similarity scores or, conversely, documents that you believe are near dupes of each other, but are not grouped as such.
Here are some examples where textual features can lead to confusing or inaccurate near duplicate groupings:

  • You receive a production where all the document text files state “Please refer to the native file of the document.” Since near duplicates are based on document text files, all the documents in this production would likely be grouped in the same near duplicate group even though they are different documents.

  • You have a number of emails where the email body is incredibly short but the signature and footer are very long and always consistent. These emails may get grouped as near duplicates due to the degree of textual similarity in the footer.

  • If your near duplicate groups explicitly include email duplicates, you may see grouped documents that have low textual similarity to one another. For example, Document A and Document B are email duplicates that have the same email body but are only 80% textually similar due to large differences in the recipients listed. Document B has a near duplicate, Document C, that is 98% similar. Documents A and C may not be that textually similar to each other (say, only 78% similar), but all three documents would be in the same group due to their connection to Document B. For more information about near duplicates in the context panel, please see this article.

Deduplication

Deduplication is the process of removing duplicates from the action you’re taking. For example, you can deduplicate upon uploading and/or during search. Deduplication can save you and your reviewers time and energy by reducing the need to sift through identical documents. 

Return to Table of Contents

Upload Deduplication

Upload deduplication occurs when native files are first uploaded to Everlaw. With upload deduplication, documents will not be uploaded if they are exact duplicates of documents that already exist in the database or are in the same upload. 

Upload deduplication affects native uploads only. Processed data uploads will not undergo upload deduplication. 

Note that upload deduplication respects document families. This means that documents attached to families are not considered duplicates unless the entire family is a duplicate. For example, let’s say we have a spreadsheet that is an attachment to an email thread. We would like to upload a different email thread that has the same spreadsheet attached. While the spreadsheets are duplicate files in a vacuum, the families they are a part of are not, so both copies of the spreadsheet will be uploaded, along with their families.

You can adjust upload deduplication settings when uploading native data. By default, upload deduplication is global, meaning that Everlaw will deduplicate against all native documents in the database. You also have the option of deduplicating only within the same custodian or not to deduplicate.

Even if you choose to deduplicate globally, Everlaw will preserve a record of the deduplicated document in the All Custodians and All Paths fields that are populated for the existing document on the database.  This means that if a document with custodian Sam is deduplicated against a document with custodian Jenny, the existing document on the database will now list both Sam and Jenny in its All Custodians metadata.

You can learn more about uploading native data in this help article.

Upload_natives.png

Return to Table of Contents

Search deduplication

When search deduplication is applied, only one copy in each duplicate document family is returned within your search results. For exact duplicate families, the version with the lowest bates/control number will be returned, which is equivalent to the earliest copy uploaded to the platform. For email duplicate families, the most complete version will be returned. Completeness is determined by factors like text, presence of metadata fields, and attachments.

Unlike upload deduplication, search deduplication will not respect families. If two duplicate documents are attached to two different emails, then search deduplication will result in only one of these duplicates being returned in the search. 

Search deduplication is an option you can apply to any search. You can select “Deduplicate within search hits” in the search settings tab, which you can learn more about by reading this search article.

Screen_Shot_2019-12-20_at_5.17.26_PM.png

Note that email and near duplicate identification occurs after documents are uploaded to Everlaw. If new documents are uploaded to your project and email rethreading or near duplicate grouping has not completed, you may have stale or inaccurate email duplicates or near duplicate groups. Please see this article to learn more about checking the status of email threading and near duplicate grouping.

 

Return to Table of Contents

[Org Admin or by request] Hide duplicates across your entire project

By default, searches on Everlaw will look across every document uploaded to your database. However, if you want documents to be deduplicated across all searches in your project, select “Hide all project duplicates from search” in the General section of the Project Settings page. This option is turned off by default (recommended). You can ask Everlaw support to turn on this setting for your project. Org Admins can also enable this setting if they are a Database Admin and the project is in their organization. 

This setting is primarily intended for users receiving productions with many duplicates who would like to mimic upload deduplication, while retaining all Bates numbered documents. Everlaw does not recommend turning on this setting outside of this use case. Note that this is a project-level, not database or organization-level, setting. For more information about this setting, please refer to this article.

Have more questions? Submit a request

0 Comments

Article is closed for comments.