Clustering FAQ – Knowledge Base

This article contains common questions about Clustering.

For more information about Clustering, please see the help articles in our Clustering section.

Frequently Asked Questions

How Clustering works
Clustering visualization and terms
Troubleshoot Clustering
When and how to use Clustering

How Clustering works

What information is Clustering based on?

The placement of documents in clusters considers the text of the document as well as the following metadata fields: Subject, Author, Title, To, From, Cc, and Bcc.

What are the minimum and maximum Clustering document counts?

Clustering is possible with 1,000 or more clusterable documents, which typically means that documents have enough text to be clustered. Clustering supports up to 25 million documents. For projects larger than this, we recommend clustering a search or placing your documents into a smaller partial project to utilize Clustering.

What is the minimum cluster size?

The minimum cluster size is 32 documents, but cluster sizes will vary.

Why do my identical projects with the same documents clustered have different Clustering visualizations?

The Clustering algorithm is non-deterministic. This means that two projects with the same documents clustered will not necessarily have the same Clustering outputs. Because of this, the Clustering visualizations on two identical document sets may appear slightly different from each other through slight variance in clusters or clusters located in different places.

How are duplicates handled? What about near duplicates? Produced versions?

Duplicates, including produced versions, are included in Clustering visualizations. All exact, email, and near duplicates should appear in the same cluster, as they are identified as conceptually similar. When a cluster is selected, the side panel displays how many of the documents are unique and how many are duplicates (exact and produced version).

Why do the singular and plural of the same word both show up as Cluster defining terms?

The Clustering system treats different text strings as entirely different, regardless of how similar they are. For example, it views the words “car” and “cars” as equally dissimilar as the words “boat” and “vampire.” This is because if we were to assume that words with -s added are the same, entirely different words like “wood” and “woods” would be parsed as the same thing when they actually have different meanings and words like “half” and “halves” would not be captured as plural variations.

Will Clustering work with non-English documents?

Yes. Clustering is language-agnostic, like Everlaw’s Predictive Coding feature. You can consult our FAQ around non-English documents in Predictive Coding for more details.

Clustering visualization and terms

Why does the list of nearest neighbors in the context panel include documents that aren’t in the same cluster as the document I’m looking at?

Both the Clustering page and the Clustering section of the context panel use term frequency-inverse document frequency (TF-IDF). In the context panel, however, we don't use a document's cluster to compute its neighbors. Instead, we compute the distance from one document and each other document in the corpus, then select the nearest ones.

A document on the edge of Cluster A may be closer to a document in Cluster B than some documents in its own cluster. The document in cluster B could be considered the nearest neighbor, but not part of the same cluster. You can think of various clusters and neighbors like a smiley face with two eyes and a mouth. The left eye might be closer to the smile than the right eye, but the eyes are grouped together due to their shared characteristics. We believe that both elements, the distance between documents and their particular clusters, provide valuable information you can use to better understand your corpus.

Why are some of the documents in a cluster really far away from the rest of the cluster?

Everlaw clusters documents in dozens of dimensions, though the Clustering page only represents them in two. When the display of clusters is flattened to appear in 2D, documents in the same cluster may appear to be far apart or even appear amid another cluster when in a higher dimension they are clustered together.

You can imagine that you are looking at a clear soda can. If you place two stickers on its top and bottom, the stickers will appear separated by the entire can. However, if you were to take a picture of the can from the top down, the two stickers would appear close together. This effect, viewing a 3D figure in 2D, is similar to what happens when Everlaw flattens the dimensions from dozens to two. This graphic is a helpful display of this effect; you can see that clusters appear homogeneous from some views, but when the cube turns, the clusters seem to mix together. This can also contribute to a situation where a document is part of one cluster but appears to be located in the middle of another.

Why are there cluster defining terms that are barely in my documents?

The Clustering system first clusters documents, and then looks for the documents that are most typical of each cluster (the “exemplars” of a given cluster). From there, the cluster defining terms are the terms that are the highest weighted to those exemplar documents. Since the system heavily weights words that only appear in a few documents, it could lead to top terms that are not very common in your documents. This can mean that the cluster defining terms may not be typical of the cluster overall, just the exemplars.

I’m seeing the same term across multiple clusters. Why is this happening, and what does it mean?

Some clusters have the same defining terms, but the algorithm categorizes the document sets as conceptually different and thus, the terms appear in separate clusters. For example, two clusters with the term “fraud” relate to fraud, but one is a set of financial spreadsheets, and one is some emails referencing fraud.

In some cases, this may not appear to be true and you’ll look at two clusters with the same term, but the documents will appear to be very similar. This is likely to improve over time as our Clustering algorithms learn from being implemented in your projects.

Troubleshoot Clustering

My Clustering search criteria showed more document hits than are present in Clustering. Why?

Not all the documents that meet your Clustering search criteria get included in Clustering. This is because the documents may not have enough text to be clustered.

Some documents may be included in Clustering, but not present on the visualization because their text is not sufficiently similar to any of the clusters. Rather than include that “white noise” in the Clustering page, we have chosen to remove these documents.

Why does it say on my project that "clusters are currently being generated"?

This means that someone has initiated clustering or reclustering on your project. The time it takes to generate clusters is impacted by how large the project is. For larger projects, it may take up to 48 hours for Clustering to complete. Please reach out to our support team at support@everlaw.com if your clusters are still generating after 48 hours.

I created a search based on a cluster and now it doesn’t return any documents. What happened?

Clustering is meant to reflect the current view into the documents. This means that when reclustering happens and your clusters change, previous searches that were based off of clusters will no longer return documents. If you would like to keep a record of the documents that were in a particular cluster, we recommend adding them to a binder before reclustering.

I’m having issues viewing the Clustering page

The Clustering page requires WebGL, a JavaScript API for rendering interactive 3d graphics, to render the Clustering data visualization. There are a number of reasons why WebGL would fail to work. Please ensure that your browser version and graphics drivers are up to date and compatible with WebGL. To see if your browser supports WebGL, please visit https://get.webgl.org/ to confirm that you have WebGL enabled.

If you are running an enterprise version of your browser, it is possible for WebGL to be disabled by a policy set by your IT department. For example, in Chrome, the Disable3DAPIs policy may be set to "true" and/or the HardwareAccelerationModeEnabled policy may have been set to "false." If this is the case, then Clustering may not be viewable for your browser until these policies are changed. Please note that it may take a while for these settings to affect your browser!

If you are having issues and are using Chrome, Firefox, or Microsoft Edge, please ensure that the Use hardware acceleration when available setting is toggled on. This setting can also help with reducing lag when interacting with the Clustering visualization.

In Chrome this can be found by typing chrome://settings/system into the URL.
In Edge, this can be found by typing edge://settings/system into the URL.
In Firefox, go to about:preferences, and under Performance, check that you have enabled the Use hardware acceleration when available setting. If you are using the recommended performance setting then this should be enabled.

Once it is toggled on, relaunch your browser and try viewing the Clustering page again. If WebGL still fails to work, please contact our support team for further assistance.

When and how to use Clustering

When would I want to edit my Clustering search criteria versus create a filter overlay?

Editing your Clustering search criteria changes which documents are included in the Clustering algorithm and will recluster your visualization. Filtering your visualization does not change which documents are included in Clustering. Instead, it creates a visual filter where only the documents that match your filter will be displayed. Filters can easily and quickly be added or removed. We recommend using the Filter functionality to quickly hone in on the location of relevant data, such as choosing to overlay an important binder of documents and then narrowing in on clusters of interest based on the clusters those documents are in. On the other hand, we recommend editing your Clustering search criteria when there are specific datasets you know for certain you want to exclude or include, such as clustering all documents except those from known spam email senders. We also recommend editing your Clustering search criteria if your dataset is too large, such as for projects with millions of documents.

When should I recluster?

If a substantial number of documents now match your Clustering criteria, or if a significant number of documents no longer match your criteria as of the last Clustering task, you will want to recluster. This will run our Clustering algorithm on your documents and assign them to clusters based on the most recent available information, resulting in more accurate clusters. Because reclustering takes time and removes the results from any existing searches of clusters you have done, we recommend that you recluster sparingly.

What happens when a new document fulfills the Clustering search criteria? Does the visualization need to be reclustered?

When a new document fulfills the Clustering search criteria, no changes are made to the existing Clustering visualization until a user with Cluster administrator permissions kicks off a recluster. After a recluster, any previous searches that reference Cluster IDs will no longer return documents, so we recommend that you recluster sparingly.

When should I use Predictive Coding and when should I use Clustering?

Everlaw’s Predictive Coding feature relies on user-generated review decisions to determine which documents will likely be relevant based on whatever criteria you set. In contrast, Clustering is unsupervised - it creates visualizations without user input. This means that as you review documents, Predictive Coding will likely have more accurate insights into your data. At the outset of a case, however, Clustering will be fully up and running before you begin review.