Some document sets include duplicates, which are multiple representations of the same document. Review teams often want to avoid spending time reviewing duplicative information, and thus want to deduplicate their sets to review only one copy of each document. On Everlaw, you have flexibility with how you handle duplicate documents. This article describes the types of duplicate documents and how they are identified, and explains how to deduplicate your documents at different phases of your project. You can use this article:
- to understand the duplicate types
- to determine when and how to deduplicate your documents
Table of Contents
Duplicate Types
There are three types of duplicates you can encounter while using Everlaw. To compare differences across near duplicate groups, use Difference Viewer, which allows you to review the differences between sets of near duplicate documents in a single view.
Type | Definition | Application |
Exact |
Everlaw uses hash values to determine exact duplicates. A hash value is essentially a unique fingerprint based on the native file. There are two types of hash values: SHA1 and MD5. For documents that do not have MD5 or SHA1 hashes, such as processed data with missing hash data in the load file, exact duplicates are identified through comparing hashes created through the documents’ text files.
|
Upload Deduplication: During native upload, you have various options for deduplicating by exact duplicates Review: During review, you have various options for deduplicating result sets, viewing, grouping, and setting up autocode rules based on exact duplicates. |
Email duplicates are documents that Everlaw has determined represent the same email, despite textual differences and different hash values. For email-typed documents, Everlaw uses content similarity, close timestamps and other metadata, and email header fields to identify these documents. For documents not typed as emails, Everlaw analyzes the hash value and text to make the determination. |
Review: While in the review window, you have the option to view and take action against email duplicates. In addition, if email threading deduplication is turned on for your project, email duplicates are also used for search deduplication, grouping, and autocode. |
|
Near | Near duplicates are documents that are in the same near duplicate group. Near duplicate groups by default include exact and email duplicates and are formed through a 95% text similarity web. For example, if Document A is 95% similar to Document B, which is 95% similar to Document C, Document A and C would be near duplicates of each other even though they may be less than 95% similar to each other textually. |
Review: While in the review window, you have the option to view and take action against near duplicates. You can also deduplicate and group search results by near duplicates |
Exact duplicates
Exact duplicates are determined by native document hash values, which are unique to each native document. The two types of hash values are MD5 and SHA1. MD5 and SHA1 hash values are created by algorithms that utilize the document’s raw bytes (small discrete units derived from the document) that are then used to create a distinct hash value. Due to the nature of MD5 and SHA1 algorithms, calculating these hashes for native files allows for the same native file uploaded multiple times to have the same hash, allowing them to be flagged as exact duplicates.
Everlaw does not separately calculate MD5 and SHA1 hashes for processed data, though documents in processed uploads often include the MD5 and SHA1 hashes in the load file. For processed data that does not have a MD5 or SHA1 hash provided, Everlaw uses an Everlaw-generated hash from the text file to identify exact duplicates. To generate a hash, documents must have a minimum of 8 words. This is to prevent inaccurately identifying documents as duplicates based on a few words of placeholder text.
Project administrators can change the definition of exact duplicates – for example, to only use MD5 hashes – in the project metadata settings. For more information, please see this article about metadata and metadata settings.
Email duplicates
Email duplicates are a special type of duplicate that Everlaw determines represent the same underlying email despite slight textual differences. Everlaw identifies email duplicates using text comparison and email threading. To identify emails that appear to fit into the same spot in a given thread, it uses:
- the email text content
- timestamp metadata
- attachments and embedded emails
- From
- To
- Cc
- Bcc
- Date
- Subject
- message IDs
Everlaw creates a custom hash for emails from this information and determines email duplicates through this hash due to the fact that native email files do not have a standardized representation or file type. For example, emails can be in the EML or MSG format. By creating and using a custom hash, Everlaw can identify and flag the same emails as duplicates across file types.
Email duplicates are much more conceptually similar to exact duplicates than near duplicates. Textual differences between near duplicates – even when small – are more likely to be materially meaningful than the textual differences between email duplicates. For example, the same bureaucratic form filled with different values are categorically different documents while the same email missing one header field, or some of the footer, are not.
Common differences underlying email duplicates include versions of the collected email that:
- are missing timestamps or some other header field
- are missing attachments
- have clipped footer text
- have different amounts of text from any preceding email(s)
As a result, while a family of email duplicates all represent the same underlying email, there may be versions that are more complete than others. The most complete version – determined by presence of timestamps, presence of attachments, and text – is treated as the “parent” of a given email duplicate family. “Parents” are shown first when grouping or listing email duplicates, which also means they are the version retained when search deduplication is applied, or when “children” documents are removed from sets grouped by duplicates.
Exact and email duplicates
By default, email threading deduplication is turned on for all projects. This means that both exact and email duplicates are used when:
- deduplicating searches
- grouping documents by duplicates
- applying auto-code rules based on duplicates
If email threading deduplication is turned off, then only the exact duplicates are used across the three actions listed above. Everlaw explicitly tells you whether duplicates include email duplicates, or is only based on exact duplicates.
Permissions Required: Org admins can toggle this setting on or off for projects .
To access this setting:
- Select Project Management > Project Settings.
- Select the General tab.
- Select whether to toggle on or off Combine email duplicates with exact duplicates in deduplication and auto-code settings.
If your organization does not have an Org admin, you can request that email threading deduplication be turned off for your project by contacting Everlaw Support. For more information, please see this article on Organization admin deduplication settings.
Near duplicates
Near duplicates are documents that are in the same near duplicate group based on textual percent similarity. Additionally, near duplicate groups by default include exact and email duplicates. For more information on changing settings to exclude email or exact duplicates, as well as how different settings will impact grouping, please see this article on Administrator deduplication settings. Documents must have at least eight words in their text file in order to be placed into a near duplicate group.
Near duplicate groups are formed through linking documents together that have 95% or more textual similarity and through the inclusion of exact and email duplicates. These documents are connected through a web structure to form the group. For example, if Document A is 95% similar to Document B, which is 95% similar to Document C, Document A, B, and C are all in the same near duplicate group even though it is possible that Document A and Document C are less than 95% textually similar. Because near duplicate groups are formed through these 95% text similarity chain relationships, there is not a center or ‘head’ of the near duplicate group. As such, there is no one document that connects every near duplicate together by 95% text similarity.
This chain and web approach to near duplicate groups has a few advantages:
- It ensures that all textually similar documents are in the same near duplicate group
- It guarantees that a document is only ever in one near duplicate group and prevents overlapping near duplicate groups
- It creates stable near duplicate groups that do not change unless documents are added or removed from the database
Near duplicate groups are created on a database level; as such, documents are always in the same near duplicate groups across complete and partial projects. For near duplicate groups, the ‘parent’ document is the document with the lowest Bates or control number. More information on near duplicate deduplication search settings can be found in this article on search settings.
Understanding near duplicate percent similarity
Because near duplicates are determined by a document’s text file, faulty and inaccurate text files can impact near duplicate grouping. Thus, it’s possible that you may see near duplicate documents in the context panel with low textual similarity scores or, conversely, documents that you believe are near dupes of each other, but are not grouped as such.
Here are some examples where textual features can lead to confusing or inaccurate near duplicate groupings:
-
You receive a production where all the document text files state “Please refer to the native file of the document.” Since near duplicates are based on document text files, all the documents in this production would likely be grouped in the same near duplicate group even though they are different documents.
-
You have a number of emails where the email body is short but the signature and footer are very long and always consistent. These emails may get grouped as near duplicates due to the degree of textual similarity in the footer.
-
If your near duplicate groups explicitly include email duplicates, you may see grouped documents that have low textual similarity to one another. For example, Document A and Document B are email duplicates that have the same email body but are only 80% textually similar due to large differences in the recipients listed. Document B has a near duplicate, Document C, that is 98% similar. Documents A and C may not be that textually similar to each other (say, only 78% similar), but all three documents are in the same group due to their connection to Document B. For more information about near duplicates in the context panel, please see this article on using the context panel.
Deduplication
Deduplication is the process of removing duplicates from the action you’re taking. For example, you can deduplicate upon uploading and/or during search. Deduplication can save you and your reviewers time and energy by reducing the need to sift through identical documents.
Upload Deduplication
Upload deduplication occurs when native files are first uploaded to Everlaw. With upload deduplication, documents are not uploaded if they are exact duplicates of documents that already exist in the database or are in the same upload.
Upload deduplication affects native uploads only. Processed data uploads do not undergo upload deduplication.
Note that upload deduplication respects document families. This means that documents attached to families are not considered duplicates unless the entire family is a duplicate. For example, let’s say we have a spreadsheet that is an attachment to an email thread. We are uploading a different email thread that has the same spreadsheet attached. While the spreadsheets are duplicate files in a vacuum, the families they belong to are not, so both copies of the spreadsheet are uploaded, along with their families.
You can adjust upload deduplication settings when uploading native data. By default, upload deduplication is global, meaning that Everlaw deduplicates against all native documents in the database. You also have the option of deduplicating only within the same custodian or not to deduplicate.
Even if you choose to deduplicate globally, Everlaw preserves a record of the deduplicated document in the All Custodians and All Paths fields that are populated for the existing document on the database. This means that if a document with custodian Sam is deduplicated against a document with custodian Jenny, the existing document on the database will now list both Sam and Jenny in its All Custodians metadata.
You can learn more about uploading native data in this help article about uploading native data.
Search deduplication
When you apply search deduplication, only one copy in each duplicate document family is returned within your search results. For exact duplicate families, the version with the lowest Bates or control number is returned, which is equivalent to the earliest copy uploaded to the platform. For email duplicate families, the most complete version is returned. Completeness is determined by text, presence of metadata fields, and attachments.
Important
Unlike upload deduplication, search deduplication does not respect families. If two duplicate documents are attached to two different emails, then search deduplication results in only one of these duplicates being returned in the search.
Search deduplication is an option you can apply to any search. To deduplicate your results, select Deduplicate within search hits in the search settings tab. Learn how to enable this setting in this search article on search settings.
Note that email and near duplicate identification occurs after documents are uploaded to Everlaw. If new documents are uploaded to your project and email rethreading or near duplicate grouping has not completed, you may have stale or inaccurate email duplicates or near duplicate groups. Please see this article on project statuses to learn more about checking the status of email threading and near duplicate grouping.
[Org Admin or by request] Hide duplicates across your entire project
Required permissions: Org Admin. If your organization doesn't have an Org admin, you can email support@everlaw.com to enable this setting.
By default, searches on Everlaw look across every document uploaded to your database. However, you can choose documents to be deduplicated across all searches in your project. To so so:
- Select Project Management > Project Settings.
- Select the General tab.
- Select Hide all project duplicates from search. This option is turned off by default (recommended).
This setting is primarily intended for users receiving productions with many duplicates who would like to mimic upload deduplication, while retaining all Bates numbered documents. Everlaw does not recommend turning on this setting outside of this use case. Note that this is a project-level, not database or organization-level, setting. For more information about this setting, please refer to this article on Administrator deduplication settings.