Table of Contents
- How can I deduplicate my search results?
- Is there something important to know about duplicate documents?
- Can I include or exclude duplicates after I've run the search?
- How does Everlaw identify duplicate documents?
How can I deduplicate my search results?
By default, your search results are deduplicated. To include duplicates in your set of results, check the “include duplicates” box in the lower left of the query builder.
Is there something important to know about duplicate documents?
Duplicate documents are determined at the time of upload based on a set of criteria. When searches are deduplicated, all duplicates are removed, not just those matching your search. The primary consequence of this is that if, for example, your search criteria returns a document that is considered a duplicate, it will not be displayed until you check the “include duplicates” box that appears in the lower left of the query builder. This is most common when you run a Bates search for a document that is a duplicate.
As an exception to the duplicate rule noted above, documents matching your search criteria that have been coded with any code, have a note applied to them, and/or have a hot or warm rating will not be removed from the results regardless of duplicate status. In other words, a duplicate document with any of the three characteristics described above will show up in the search results without the “include duplicates” box being checked.
Can I include or exclude duplicates after I’ve run the search?
Yes. You can toggle this setting from the full results tables by clicking the duplicates icon immediately to the left of the document count. This will refresh your results table to either include or exclude exact duplicates.
How does Everlaw identify duplicate documents?
Everlaw uses three methods to identify duplicate documents: text comparison, hash value, and email threading:
- Firstly, Everlaw compares the text files of the documents.
- If the text files are identical, then Everlaw compares the hash value of documents (if available). The hash value is a unique fingerprint based on the native file, which takes into account of the text and intrinsic metadata (e.g. author, date created, etc.) Extrinsic metadata value (e.g. custodian, file path) is not evaluated in generating the hash value.
- Independent of the previous methods, Everlaw can also identify duplicates through email threading. While using the content and metadata to recompose email threads, the system will identify emails that appear to fit into the same spot in a given thread. These emails are marked as duplicates. To utilize this function, the 'use email threading deduplication' option will need to be toggled on. Please reach out to email@example.com to see if this setting is turned on for your case.
If you see documents identified as duplicates in the Email Thread section of the Context Panel, but not in the Duplicates section, it is probably because the “use email threading deduplication” setting is disabled.
Please note that the text comparison method has its limitations. For example: different processing tools may generate different text files from two copies of the same document; documents with limited or no text cannot be correctly compared; unrelated documents produced with the same placeholder text might be misidentified as duplicates . As a result, when we ingest produced documents that do not have hash or any other metadata values, the system may not correctly identify all the duplicate documents.
Deduplication can be activated upon request. Please email us at firstname.lastname@example.org if you wish to activate this functionality.