Table of Contents
- How is a duplicate defined?
How is a duplicate defined?
When uploading a production or dataset, you may have documents that are duplicates, or exact copies, of others. Everlaw uses a combination of three primary methods to identify duplicate documents: hash value, text comparison, and email threading.
When you upload documents, Everlaw will look at the hash values to determine exact duplicates. A hash value is essentially a unique fingerprint based on the native file. There are two types of hash values: SHA1 and MD5. The hash value takes into account the document's text and intrinsic metadata (e.g., author, date created, etc.). Extrinsic metadata values (e.g., custodian, file path) are not evaluated in generating the hash value.
After ingestion, Everlaw identifies duplicate emails using text comparison and email threading. It uses the email text content, metadata, and message IDs to identify emails that appear to fit into the same spot in a given thread. Specifically, email thread duplicates are determined by similar content, close timestamps, and various email header fields. For non-email files, duplicates are determined based on comparison of hash values and text files.
Therefore, the definition of a “duplicate” for emails is fundamentally different from other file types. For non-email file types, duplicates mean “the files are identical” (according to matching hash values). However, for emails, duplicates means "the files represent the same email.” For example, an email without attachments with its replies clipped, or with its headers reformatted, may still be considered an exact duplicate of another email, even though the text is not the same.
Because Everlaw can identify documents as duplicates, it also allows you to manage duplicates as you go through review. This is called deduplication. Deduplication is the process of removing exact duplicates from the action you’re taking. For example, you can deduplicate upon uploading and/or during search. Deduplication can save you and your reviewers time and energy by reducing the need to sift through identical documents.
Upload deduplication is an option that occurs when native files are first uploaded to Everlaw. With upload deduplication, documents will not be uploaded if they are exact duplicates of documents that already exist in the database. This includes documents from your dataset that have just been uploaded.
Upload deduplication affects native uploads only. Processed data uploads will not undergo upload deduplication.
However, upload deduplication respects document families. This means that documents attached to families are not considered duplicates unless the entire family is a duplicate. For example, let’s say we have a spreadsheet that is an attachment to an email thread. We would like to upload a different email thread that has the same spreadsheet attached. While the spreadsheets are duplicate files in a vacuum, the families they are a part of are not, so both copies of the spreadsheet will be uploaded, along with their families.
You can adjust upload deduplication settings when uploading native data. By default, upload deduplication is global, meaning that Everlaw will deduplicate against all documents in the database. You also have the option of deduplicating only within the same custodian.
However, even if you choose to deduplicate globally, Everlaw will preserve a record of the deduplicated document in the All Custodians and All Paths fields that are populated for the existing document on the database. This means that if a document with custodian Sam is deduplicated against a document with custodian Jenny, the existing document on the database will now list both Sam and Jenny in its All Custodians metadata.
You can learn more about uploading native data in this help article.
Search deduplication means that only one copy of each document on the platform will be returned within your search results. If your search returns two or more documents that belong to the same duplicate group, search deduplication will return only one copy in your search, which will be the copy of the document in the search that was first uploaded to Everlaw.
Unlike upload deduplication, search deduplication will not respect families. If two duplicate documents are attached to two different emails, then search deduplication will result in only one of these duplicates being returned in the search.
Search deduplication is an option you can apply to any search. You can select “Deduplicate within search hits” in the More Options tab, which you can learn more about by reading this search article.
[Org Admin or by request] Hide duplicates across your entire project
By default, all documents uploaded to the platform are shown in all searches. However, you may request to Everlaw support that duplicates be hidden from all searches. Org Admins can also enable this setting if they are a Database Admin and the project is in their organization. This setting is intended for clients receiving productions with many duplicates who would like to mimic upload deduplication, while retaining all Bates numbered documents. Org Admins can find this option in Project Settings, where deduplication will only apply to the specific project it is enabled on, not across the database.
The “primary document” in a duplicate group is determined arbitrarily: the copy first uploaded to Everlaw is considered the primary document, and all others are duplicates hidden from search. Additionally, if two duplicate documents are attached to two different emails, then the first one uploaded will be considered the “primary” document in that duplicate group.
When enabled, this setting will be applied, by default, to all searches across the project. Users can adjust it in the More Options tab of the query builder. You can learn more about the More Options tab, and options for search deduplication, by reading this help article.
Only primary documents that are direct matches to your search criteria will be returned. If only child duplicates match your search criteria, but are not the primary document, then no documents from that duplicate group will be returned.
Search deduplication versus “hiding project duplicates”
Let’s use a basic example to illustrate the difference between search deduplication and hiding the duplicates entirely. We have a duplicate group where document #123 belongs to Custodian Greg and its duplicate, Doc #456, belongs to Custodian Dean.
Now we run a search for “documents with Custodian: Dean.”
In More Options, we would select “Deduplicate within search hits.”
In this example, document #456 (Custodian: Dean) will be returned in our search results because search deduplication returns one copy of documents that meet your specified search criteria.
Now, let’s change our option to “Hide all project duplicates.”
In this example, document #123 is deemed the “primary” document in its duplicate group and #456 the “duplicate” copy. Since that is the case, we will get no results for our search because document #456 is hidden due to its status as a “duplicate”, even though it matches the search criteria. Said generally, hiding project duplicates prioritizes the primary document first, then searching across those primary documents.