For more information about Clustering, please see the help articles in our Clustering section.
I read the release notes, but I don’t see Clustering in my project. How do I access Clustering?
Clustering has been released in beta on the Everlaw platform with limited customer availability. We want to be sure that we do it right without disrupting your existing workflows, so it will take some time before Clustering’s general release. When Clustering is fully released, it will be available on all projects. If you would like to see if your project is eligible for the Clustering beta and immediate access, please contact email@example.com.
Will Clustering be available for all projects?
Yes, Clustering will be available to all projects by default (provided that they have more documents than our required minimum) in the months following its December 2020 beta release.
What is the minimum case size?
Clustering is possible on any project with 1,000 or more documents. The minimum cluster size is 32 documents, but cluster sizes will vary.
How are duplicates handled? What about near duplicates? Produced versions?
Exact duplicates are included in Clustering visualizations. A parent document and its duplicates should appear in the same cluster, as they will be identified as conceptually similar. The same is true for documents and their near duplicates. When a cluster is selected, the side panel will display how many of the documents are unique and how many are duplicates. You can run a search for either kind.
What happens when a new doc is added to a project? Does the entire project need to be reclustered?
When a new document is added to a project, no changes are made to the existing Clustering visualization until a user with Cluster administrator permissions kicks off a recluster. After a recluster, any previous searches that reference Cluster IDs will no longer return documents, so we recommend that you recluster sparingly.
What information is Clustering based on?
The placement of documents in clusters considers the text of the document as well as the following metadata fields: Subject, Author, Title, To, From, Cc, and Bcc.
Why are some of the documents in a cluster really far away from the rest of the cluster?
Everlaw clusters documents in dozens of dimensions, though the Clustering page only represents them in two. When the display of clusters is flattened to appear in 2D, documents in the same cluster may appear to be far apart or even appear amid another cluster when in a higher dimension they are clustered together.
You can imagine that you are looking at a clear soda can. If you place two stickers on its top and bottom, the stickers will appear separated by the entire can. However, if you were to take a picture of the can from the top down, the two stickers would appear close together. This effect, viewing a 3D figure in 2D, is similar to what happens when Everlaw flattens the dimensions from dozens to two. This graphic is a helpful display of this effect; you can see that clusters appear homogeneous from some views, but when the cube turns, the clusters seem to mix together.This can also contribute to a situation where a document is part of one cluster but appears to be located in the middle of another.
Why do I have words that show up as cluster defining terms that are barely in my documents?
The Clustering system first clusters documents and then looks for the documents that are most typical of each cluster (the “exemplars” of a given cluster). From there, the cluster defining terms are the terms that are most unique to those exemplar documents. Since the system heavily weights words that only appear in a few documents, it could lead to top terms that are not very common in your documents. This can mean that the cluster defining terms may not be typical of the cluster overall, just the exemplars.
Why do the singular and plural of the same word both show up as Cluster defining terms?
The Clustering system treats different text strings as entirely different, regardless of how similar they are. For example, it views the words “car” and “cars” as equally dissimilar as the words “boat” and “vampire.” This is because if we were to assume that words with -s added are the same, entirely different words like “wood” and “woods” would be parsed as the same thing when they actually have different meanings and words like “half” and “halves” would not be captured as plural variations.
Will Clustering work with non-English documents?
When should I use Predictive Coding and when should I use Clustering?
Everlaw’s Predictive Coding feature relies on user-generated review decisions to determine which documents will likely be relevant based on whatever criteria you set. In contrast, Clustering is unsupervised - it creates visualizations without user input. This means that as you review documents, Predictive Coding will likely have more accurate insights into your data. At the outset of a case, however, Clustering will be fully up and running before you begin review.
I’m seeing the same term across multiple clusters. Why is this happening, and what does it mean?
Some clusters will have the same defining terms, but the algorithm categorizes the document sets as conceptually different and thus, the terms appear in separate clusters. For example, two clusters with the term “fraud” relate to fraud, but one is a set of financial spreadsheets, and one is some emails referencing fraud.
In some cases, this may not appear to be true and you’ll look at two clusters with the same term, but the documents will appear to be very similar. This is likely to improve over time as our Clustering algorithms learn from being implemented in your projects.
I created a search based on a cluster and now it doesn’t return any documents. What happened?
Clustering is meant to reflect the current view into the documents. This means that when reclustering happens and your clusters change, previous searches that were based off of clusters will no longer return documents. If you would like to keep a record of the documents that were in a particular cluster, you should add them to a binder before reclustering.
Why are there more documents in my project than there are in Clustering?
In your project, there will likely be a set of documents, outliers, that don’t get included in Clustering. This is because the Clustering algorithm could not place them into a cluster, meaning that their text isn’t sufficiently similar to any of the clusters. Rather than include that “white noise” in the Clustering page, we have chosen to remove these outliers.
Why does the list of nearest neighbors in the context panel include documents that aren’t in the same cluster as the document I’m looking at?
Both the Clustering page and the Clustering section of the context panel use TF-IDF. In the context panel, however, we don't use a document's cluster to compute its neighbors. Instead, we compute the distance from one document and each other document in the corpus, then select the nearest ones.
A document on the edge of cluster A may be closer to a document in cluster B than some documents in its own cluster. The document in cluster B could be considered the nearest neighbor, but not part of the same cluster. You can think of various clusters and neighbors like a smiley face with two eyes and a mouth. The left eye might be closer to the smile than the right eye, but the eyes are grouped together due to their shared characteristics. We believe that both elements, the distance between documents and their particular clusters, provide valuable information you can use to better understand your corpus.
When would I want to recluster my documents?
If you add or remove a substantial number of documents from your project, you will want to recluster. This will run our Clustering algorithm on your documents and assign them to clusters based on the most recent available information, resulting in more accurate clusters. Because reclustering takes time and removes the results from any existing searches of clusters you have done, we recommend that you recluster sparingly.