Search Deduplication Options and Definition of "Duplicate"

 

 Table of Contents

Upload deduplication

You can turn on deduplication for native documents when they are uploaded onto the platform. Duplicate documents that are not emails are determined at the time of upload based on the document’s hash value. The hash value is a unique fingerprint based on the native file, which takes into account the document's text and intrinsic metadata (e.g., author, date created, etc.). Extrinsic metadata values (e.g., custodian, file path) are not evaluated in generating the hash value. Emails are determined by features such as the email's metadata, content, and attachments. 

upload_dialog.png

If deduplication is turned on, duplicate documents will be removed and will not be uploaded. However, upload deduplication respects document families. This means that if the exact same document is attached to two different emails, both copies of the document will be uploaded.

Processed documents do not undergo deduplication upon upload, but will undergo on-platform deduplication. Additionally, documents that are not processed by Everlaw may not have a hash value.

Return to table of contents

On-platform deduplication

Even if documents are deduplicated upon upload, some duplicate copies may end up on the platform. This is because upload deduplication (see previous section) respects document families. Nevertheless, you can remove these duplicate copies from search results through the More Options menu, accessible from the search page or results table toolbar.

To determine which documents are duplicates, the Everlaw search tool compares the documents’ text versions, then their hash values (if available).

Everlaw can also identify duplicates through email threading. Everlaw will identify emails that appear to fit into the same spot in a given thread and mark them as duplicates, while also using the content and metadata to recompose email threads. To utilize this function, email threading deduplication will need to be toggled on for your project by a member of Everlaw support. Please reach out to support@everlaw.com to see if this setting is turned on for your project. (If you see documents identified as duplicates in the Email Thread section of the Context Panel, but not in the Duplicates section, it is probably because email threading deduplication has not been enabled.)

Please note that the text comparison method has its limitations. For example, different processing tools may generate different text files from two copies of the same document, documents with limited or no text cannot be correctly compared, and unrelated documents produced with the same placeholder text might be misidentified as duplicates. As a result, when we ingest produced documents that do not have hash or any other metadata values, the system may not correctly identify all the duplicate documents.

When choosing to exclude duplicates on-platform, you have the option of deduplicating your documents in a project-wide or search-wide context. Read on to find out more about each option:

Return to table of contents

Project-wide deduplication

Project-wide deduplication consists of choosing to search for and view only the reference copy of a group of duplicate documents throughout your entire project. The reference (or "original") copy of the duplicate group is determined arbitrarily: the copy first uploaded to Everlaw is considered the reference, and all other copies are considered duplicates. You can build searches on Everlaw that will never retrieve any document that is not a reference copy. 

Project-wide deduplication is conducted through an Include/Exclude Duplicates toggle on your More Options tab. As the label names indicate, this toggle allows you to include or exclude documents marked as duplicates from your search results. (Depending on your project settings, the toggle may not be visible; if this is the case, and you would like it enabled, please reach out to support@everlaw.com.)

search78.png

When you exclude duplicates from search results through this toggle, all duplicate documents will be removed, regardless of family or document groups. If the exact same document is attached to two different emails, both copies of the document will be uploaded to Everlaw. However, one copy will be marked as a duplicate, and will not appear if duplicates are excluded from search results.

Toggling the Exclude Duplicates option will remove all duplicate documents from search results, so using it may exclude a document that you intended to retrieve. This most commonly occurs when you run a Bates search for a single document that has been marked as a duplicate copy. In the example below, #35801 is an email that is a duplicate of another email on Everlaw. For this reason, searching for this document and excluding duplicates retrieves zero results.

search46.png

Toggling the Include Duplicates option, however, retrieves a result for the Bates search.

search47.png

Although the More Options tab can be activated for multiple logical containers, the option to include or exclude duplicates, if present on your project, will only be available on the outermost container. In other words, you will not be able to set project-wide deduplication for a portion of your search and disable it for another portion.

As an exception to the behavior described above, documents that have been coded, have a note applied to them, and/or have a hot or warm rating will not be removed from search results, even if you choose to exclude duplicates. 

Return to table of contents

Search-wide deduplication

When running searches on Everlaw, you may want to see only one copy of each document that matches your search terms. This largely overlaps with, but is slightly different from, project-wide deduplication, in which you are asking to see only one copy of each document that exists on the entire project. 

To deduplicate within a search on Everlaw, open the More Options tab and group your documents by their exact duplicates. This will group the reference (or "original") copy of the document, as well as any duplicate copies, together in the results table. Then, remove children from the groups. This will keep only the reference copy among your search results, and each duplicate group will be represented by a single document. 

search_dedupe.png

Deduplicating within a search is useful for quickly isolating the reference copy for each duplicate group, without worrying that any copies may be inadvertently left out (such as the Bates example given in the previous section). The reference copies can then be saved to a binder for review. 

Because deduplicating within a search will only include the reference copy among search results, you may see different documents in the results table than those you originally searched for. For example, if you search for document #35801, group by exact duplicates, and remove children, your search will retrieve a single document—document #26433! 

include_dupe.png

 

Return to table of contents

 

Have more questions? Submit a request

0 Comments

Article is closed for comments.