Clustering FAQ

Clustering FAQ

For more information about Clustering, please see the help articles in our Clustering section.


Table of contents: 


What are the minimum and maximum project sizes? 

Clustering is possible on any project with 1,000 or more documents. These documents must also be clusterable, one example being that documents without any text are not clusterable. The minimum cluster size is 32 documents, but cluster sizes will vary. Clustering is supported on projects with up to 25 million documents. For projects larger than this, we recommend placing your documents into a smaller partial project to utilize Clustering. 


How are duplicates handled? What about near duplicates? Produced versions?

Exact duplicates, including produced versions, are included in Clustering visualizations. A parent document and its duplicates should appear in the same cluster, as they will be identified as conceptually similar. The same is true for documents and their near duplicates. When a cluster is selected, the side panel will display how many of the documents are unique and how many are duplicates. You can run a search for either kind.   


What happens when a new document is added to a project? Does the entire project need to be reclustered? 

When a new document is added to a project, no changes are made to the existing Clustering visualization until a user with Cluster administrator permissions kicks off a recluster. After a recluster, any previous searches that reference Cluster IDs will no longer return documents, so we recommend that you recluster sparingly.


What information is Clustering based on?

The placement of documents in clusters considers the text of the document as well as the following metadata fields: Subject, Author, Title, To, From, Cc, and Bcc. 


Why are some of the documents in a cluster really far away from the rest of the cluster?

Everlaw clusters documents in dozens of dimensions, though the Clustering page only represents them in two. When the display of clusters is flattened to appear in 2D, documents in the same cluster may appear to be far apart or even appear amid another cluster when in a higher dimension they are clustered together. 

You can imagine that you are looking at a clear soda can. If you place two stickers on its top and bottom, the stickers will appear separated by the entire can. However, if you were to take a picture of the can from the top down, the two stickers would appear close together. This effect, viewing a 3D figure in 2D, is similar to what happens when Everlaw flattens the dimensions from dozens to two. This graphic is a helpful display of this effect; you can see that clusters appear homogeneous from some views, but when the cube turns, the clusters seem to mix together.This can also contribute to a situation where a document is part of one cluster but appears to be located in the middle of another. 


Why do I have words that show up as cluster defining terms that are barely in my documents?

The Clustering system first clusters documents and then looks for the documents that are most typical of each cluster (the “exemplars” of a given cluster). From there, the cluster defining terms are the terms that are the highest weighted to those exemplar documents. Since the system heavily weights words that only appear in a few documents, it could lead to top terms that are not very common in your documents. This can mean that the cluster defining terms may not be typical of the cluster overall, just the exemplars.


Why do the singular and plural of the same word both show up as Cluster defining terms?

The Clustering system treats different text strings as entirely different, regardless of how similar they are. For example, it views the words “car” and “cars” as equally dissimilar as the words “boat” and “vampire.” This is because if we were to assume that words with -s added are the same, entirely different words like “wood” and “woods” would be parsed as the same thing when they actually have different meanings and words like “half” and “halves” would not be captured as plural variations. 


Will Clustering work with non-English documents?

Yes. Clustering is language-agnostic, like Everlaw’s Predictive Coding feature. You can consult our FAQ around non-English documents in Predictive Coding for more details. 


When should I use Predictive Coding and when should I use Clustering?

Everlaw’s Predictive Coding feature relies on user-generated review decisions to determine which documents will likely be relevant based on whatever criteria you set. In contrast, Clustering is unsupervised - it creates visualizations without user input. This means that as you review documents, Predictive Coding will likely have more accurate insights into your data. At the outset of a case, however, Clustering will be fully up and running before you begin review. 


I’m seeing the same term across multiple clusters. Why is this happening, and what does it mean? 

Some clusters will have the same defining terms, but the algorithm categorizes the document sets as conceptually different and thus, the terms appear in separate clusters. For example, two clusters with the term “fraud” relate to fraud, but one is a set of financial spreadsheets, and one is some emails referencing fraud.

In some cases, this may not appear to be true and you’ll look at two clusters with the same term, but the documents will appear to be very similar. This is likely to improve over time as our Clustering algorithms learn from being implemented in your projects. 

As of our June 3, 2022 release, we have released improvements to Clustering that will reduce cluster term redundancies. We recommend reclustering promptly in order to have improved, refactored cluster terms.


I created a search based on a cluster and now it doesn’t return any documents. What happened? 

Clustering is meant to reflect the current view into the documents. This means that when reclustering happens and your clusters change, previous searches that were based off of clusters will no longer return documents. If you would like to keep a record of the documents that were in a particular cluster, we recommend adding them to a binder before reclustering


Why are there more documents in my project than there are in Clustering? 

In your project there will likely be a set of documents, outliers, that don’t get included in Clustering. This is because the Clustering algorithm could not place them into a cluster, meaning that their text isn’t sufficiently similar to any of the clusters. Rather than include that “white noise” in the Clustering page, we have chosen to remove these outliers. 


Why do my identical projects have different Clustering visualizations? 

The Clustering algorithm is non-deterministic. This means that two projects with the same documents will not necessarily have the same Clustering outputs. Because of this, the Clustering visualizations on two identical projects may appear slightly different from each other through slight variance in clusters or clusters located in different places.


Why does the list of nearest neighbors in the context panel include documents that aren’t in the same cluster as the document I’m looking at?

Both the Clustering page and the Clustering section of the context panel use TF-IDF. In the context panel, however, we don't use a document's cluster to compute its neighbors. Instead, we compute the distance from one document and each other document in the corpus, then select the nearest ones. 

A document on the edge of cluster A may be closer to a document in cluster B than some documents in its own cluster. The document in cluster B could be considered the nearest neighbor, but not part of the same cluster. You can think of various clusters and neighbors like a smiley face with two eyes and a mouth. The left eye might be closer to the smile than the right eye, but the eyes are grouped together due to their shared characteristics. We believe that both elements, the distance between documents and their particular clusters, provide valuable information you can use to better understand your corpus. 


When would I want to recluster my documents? 

If you add or remove a substantial number of documents from your project, you will want to recluster. This will run our Clustering algorithm on your documents and assign them to clusters based on the most recent available information, resulting in more accurate clusters. Because reclustering takes time and removes the results from any existing searches of clusters you have done, we recommend that you recluster sparingly.

However, for users who accessed Clustering before our June 3, 2022 release, we recommend reclustering promptly in order to use cluster depth functionality and have improved, refactored cluster terms.


Why does it say on my project that "clusters are currently being generated"?

The time it takes to generate clusters is impacted by how large the project is. For larger projects, it may take up to 48 hours for Clustering to complete. Please reach out to our support team if your clusters are still generating after this time. 


I’m having issues viewing the Clustering Page

The Clustering page requires WebGL, a Javascript API for rendering interactive 3d graphics, to render the Clustering data visualization. There are a number of reasons why WebGL would fail to work. Please ensure that your browser version and graphics drivers are up to date and compatible with WebGL. To see if your browser supports WebGL, please visit to confirm that you have WebGL enabled. 

If you are running an enterprise version of your browser, it is possible for WebGL to be disabled by a policy set by your IT department. For example, in Chrome, the Disable3DAPIs policy may be set to "true" and/or the HardwareAccelerationModeEnabled policy may have been set to "false." If this is the case, then Clustering may not be viewable for your browser until these policies are changed. Please note that it may take a while for these settings to affect your browser!

If you are having issues and are using Chrome, Firefox, or Microsoft Edge, please ensure that the Use hardware acceleration when available setting is toggled on. This setting can also help with reducing lag when interacting with the Clustering visualization.

  • In Chrome this can be found by typing chrome://settings/system into the URL. 
  • In Edge, this can be found by typing edge://settings/system into the URL.
  • In Firefox, go to about:preferences, and under Performance, check that you have enabled the Use hardware acceleration when available setting. If you are using the recommended performance setting then this should be enabled.

Once it is toggled on, relaunch your browser and try viewing the Clustering page again. If WebGL still fails to work, please contact our support team for further assistance.

[return to top]

Have more questions? Submit a request


Article is closed for comments.