Table of Contents
- Upload deduplication
- On-platform deduplication
- Including duplicates in a search
- Including duplicates in the results table
You can turn on deduplication for native documents when they are uploaded onto the platform. Duplicate documents that are not emails are determined at the time of upload based on the document’s hash value. The hash value is a unique fingerprint based on the native file, which takes into account the document's text and intrinsic metadata (e.g., author, date created, etc.). Extrinsic metadata values (e.g., custodian, file path) are not evaluated in generating the hash value. Emails are determined by features such as the email's metadata, content, and attachments.
If deduplication is turned on, duplicate documents will be removed and will not be uploaded. However, upload deduplication respects document families. This means that if the exact same document is attached to two different emails, both copies of the document will be uploaded.
Processed documents do not undergo deduplication upon upload, but will undergo on-platform deduplication. Additionally, documents that are not processed by Everlaw may not have a hash value.
Deduplication when searching for documents on Everlaw differs somewhat from deduplication when uploading documents. To determine which documents are duplicates, the Everlaw search tool compares the documents’ text versions, then their hash values (if available).
Everlaw can also identify duplicates through email threading. Everlaw will identify emails that appear to fit into the same spot in a given thread, while also using the content and metadata to recompose email threads. These emails are marked as duplicates. To utilize this function, email threading deduplication will need to be toggled on for your case by a member of Everlaw support. Please reach out to firstname.lastname@example.org to see if this setting is turned on for your case. (If you see documents identified as duplicates in the Email Thread section of the Context Panel, but not in the Duplicates section, it is probably because email threading deduplication has not been enabled.)
Please note that the text comparison method has its limitations. For example, different processing tools may generate different text files from two copies of the same document, documents with limited or no text cannot be correctly compared, and unrelated documents produced with the same placeholder text might be misidentified as duplicates. As a result, when we ingest produced documents that do not have hash or any other metadata values, the system may not correctly identify all the duplicate documents.
Including duplicates in a search
There is another important difference between upload and search deduplication in Everlaw. When searches are deduplicated, all duplicates are removed from search results, regardless of family groups. If the exact same document is attached to two different emails, both copies of the document will be uploaded to Everlaw. However, because the documents are identical in every way besides their families, they will be identified as duplicates on the platform and be removed from search results. This situation is most common when you run a Bates search for a single document that is a duplicate of another document on Everlaw. In the example below, EVER 6092 is a spreadsheet that is a duplicate of another spreadsheet on Everlaw. For this reason, searching for this document retrieves zero results.
You can display duplicate documents among search results by checking the “include duplicates” box that appears in the lower left of the query builder. (Depending on the age of your case, search deduplication may be turned off by default. If this is the case, the “include duplicates” box will not appear for you. Contact email@example.com if you would like it turned on.) In the below example, checking the “Include 1 duplicate” box retrieves a result for the Bates search.
As an exception to the duplicate rule noted above, documents matching your search criteria that have been coded with any code, have a note applied to them, and/or have a hot or warm rating will not be removed from the results regardless of duplicate status. In other words, a duplicate document with any of the three characteristics described above will show up in the search results without needing to check the “include duplicates” box.
Including duplicates in the results table
You can include or exclude duplicates from the full results table by clicking the duplicates icon immediately to the left of the document count. This will refresh your results table to either include or exclude exact duplicates. You can also use the grouping icon to group your results with their exact duplicates.