For more information about Clustering, please see the help articles in our Clustering section.
Table of contents
- What are the minimum and maximum Clustering document counts?
- How are duplicates handled? What about near duplicates? Produced versions?
- What happens when a new document fulfills the Clustering search criteria? Does the visualization need to be reclustered?
- What information is Clustering based on?
- Why are some of the documents in a cluster really far away from the rest of the cluster?
- Why do I have words that show up as cluster defining terms that are barely in my documents?
- Why do the singular and plural of the same word both show up as Cluster defining terms?
- Will Clustering work with non-English documents?
- When should I use Predictive Coding and when should I use Clustering?
- I’m seeing the same term across multiple clusters. Why is this happening, and what does it mean?
- I created a search based on a cluster and now it doesn’t return any documents. What happened?
- Why are there more documents that hit my Clustering search criteria than there are present in Clustering?
- Why do my identical projects with the same documents clustered have different Clustering visualizations?
- Why does the list of nearest neighbors in the context panel include documents that aren’t in the same cluster as the document I’m looking at?
- When would I want to recluster my documents?
- When would I want to edit my Clustering search criteria versus create a filter overlay?
- Why does it say on my project that "clusters are currently being generated"?
- I’m having issues viewing the Clustering Page
What are the minimum and maximum Clustering document counts?
Clustering is possible with 1,000 or more documents. These documents must also be clusterable, one example being that documents without any text are not clusterable. The minimum cluster size is 32 documents, but cluster sizes will vary. Clustering supports up to 25 million documents. For projects larger than this, we recommend clustering a search or placing your documents into a smaller partial project to utilize Clustering.
How are duplicates handled? What about near duplicates? Produced versions?
Exact duplicates, including produced versions, are included in Clustering visualizations. A parent document and its duplicates should appear in the same cluster, as they will be identified as conceptually similar. The same is true for documents and their near duplicates. When a cluster is selected, the side panel will display how many of the documents are unique and how many are duplicates. You can run a search for either kind.
What happens when a new document fulfills the Clustering search criteria? Does the visualization need to be reclustered?
When a new document fulfills the Clustering search criteria, no changes are made to the existing Clustering visualization until a user with Cluster administrator permissions kicks off a recluster. After a recluster, any previous searches that reference Cluster IDs will no longer return documents, so we recommend that you recluster sparingly.
What information is Clustering based on?
The placement of documents in clusters considers the text of the document as well as the following metadata fields: Subject, Author, Title, To, From, Cc, and Bcc.
Why are some of the documents in a cluster really far away from the rest of the cluster?
Everlaw clusters documents in dozens of dimensions, though the Clustering page only represents them in two. When the display of clusters is flattened to appear in 2D, documents in the same cluster may appear to be far apart or even appear amid another cluster when in a higher dimension they are clustered together.
You can imagine that you are looking at a clear soda can. If you place two stickers on its top and bottom, the stickers will appear separated by the entire can. However, if you were to take a picture of the can from the top down, the two stickers would appear close together. This effect, viewing a 3D figure in 2D, is similar to what happens when Everlaw flattens the dimensions from dozens to two. This graphic is a helpful display of this effect; you can see that clusters appear homogeneous from some views, but when the cube turns, the clusters seem to mix together.This can also contribute to a situation where a document is part of one cluster but appears to be located in the middle of another.
Why do I have words that show up as cluster defining terms that are barely in my documents?
The Clustering system first clusters documents and then looks for the documents that are most typical of each cluster (the “exemplars” of a given cluster). From there, the cluster defining terms are the terms that are the highest weighted to those exemplar documents. Since the system heavily weights words that only appear in a few documents, it could lead to top terms that are not very common in your documents. This can mean that the cluster defining terms may not be typical of the cluster overall, just the exemplars.
Why do the singular and plural of the same word both show up as Cluster defining terms?
The Clustering system treats different text strings as entirely different, regardless of how similar they are. For example, it views the words “car” and “cars” as equally dissimilar as the words “boat” and “vampire.” This is because if we were to assume that words with -s added are the same, entirely different words like “wood” and “woods” would be parsed as the same thing when they actually have different meanings and words like “half” and “halves” would not be captured as plural variations.
Will Clustering work with non-English documents?
Yes. Clustering is language-agnostic, like Everlaw’s Predictive Coding feature. You can consult our FAQ around non-English documents in Predictive Coding for more details.
When should I use Predictive Coding and when should I use Clustering?
Everlaw’s Predictive Coding feature relies on user-generated review decisions to determine which documents will likely be relevant based on whatever criteria you set. In contrast, Clustering is unsupervised - it creates visualizations without user input. This means that as you review documents, Predictive Coding will likely have more accurate insights into your data. At the outset of a case, however, Clustering will be fully up and running before you begin review.
I’m seeing the same term across multiple clusters. Why is this happening, and what does it mean?
Some clusters will have the same defining terms, but the algorithm categorizes the document sets as conceptually different and thus, the terms appear in separate clusters. For example, two clusters with the term “fraud” relate to fraud, but one is a set of financial spreadsheets, and one is some emails referencing fraud.
In some cases, this may not appear to be true and you’ll look at two clusters with the same term, but the documents will appear to be very similar. This is likely to improve over time as our Clustering algorithms learn from being implemented in your projects.
I created a search based on a cluster and now it doesn’t return any documents. What happened?
Clustering is meant to reflect the current view into the documents. This means that when reclustering happens and your clusters change, previous searches that were based off of clusters will no longer return documents. If you would like to keep a record of the documents that were in a particular cluster, we recommend adding them to a binder before reclustering.
Why are there more documents that hit my Clustering search criteria than there are present in Clustering?
Of the documents that fulfill your Clustering search criteria, there will likely be a set of documents that do not get included in Clustering. This is because the documents may not any text or enough text to be clustered. Additionally, some documents may be included in Clustering but not present on the visualization because their text is not sufficiently similar to any of the clusters. Rather than include that “white noise” in the Clustering page, we have chosen to remove these documents.
Why do my identical projects with the same documents clustered have different Clustering visualizations?
The Clustering algorithm is non-deterministic. This means that two projects with the same documents clustered will not necessarily have the same Clustering outputs. Because of this, the Clustering visualizations on two identical document sets may appear slightly different from each other through slight variance in clusters or clusters located in different places.
Why does the list of nearest neighbors in the context panel include documents that aren’t in the same cluster as the document I’m looking at?
Both the Clustering page and the Clustering section of the context panel use TF-IDF. In the context panel, however, we don't use a document's cluster to compute its neighbors. Instead, we compute the distance from one document and each other document in the corpus, then select the nearest ones.
A document on the edge of cluster A may be closer to a document in cluster B than some documents in its own cluster. The document in cluster B could be considered the nearest neighbor, but not part of the same cluster. You can think of various clusters and neighbors like a smiley face with two eyes and a mouth. The left eye might be closer to the smile than the right eye, but the eyes are grouped together due to their shared characteristics. We believe that both elements, the distance between documents and their particular clusters, provide valuable information you can use to better understand your corpus.
When would I want to recluster my documents?
If a substantial number of documents now match your Clustering criteria, or if a significant number of documents no longer match your criteria as of the last Clustering task, you will want to recluster. This will run our Clustering algorithm on your documents and assign them to clusters based on the most recent available information, resulting in more accurate clusters. Because reclustering takes time and removes the results from any existing searches of clusters you have done, we recommend that you recluster sparingly.
When would I want to edit my Clustering search criteria versus create a filter overlay?
Editing your Clustering search criteria changes which documents are included in the Clustering algorithm and will recluster your visualization. Filtering your visualization does not change which documents are included in Clustering. Instead, it creates a visual filter where only the documents that match your filter will be displayed. Filters can easily and quickly be added or removed. We recommend using the Filter functionality to quickly hone in on the location of relevant data, such as choosing to overlay an important binder of documents and then narrowing in on clusters of interest based on the clusters those documents are in. On the other hand, we recommend editing your Clustering search criteria when there are specific datasets you know for certain you want to exclude or include, such as clustering all documents except those from known spam email senders. We also recommend editing your Clustering search criteria if your dataset is too large, such as for projects with millions of documents.
Why does it say on my project that "clusters are currently being generated"?
The time it takes to generate clusters is impacted by how large the project is. For larger projects, it may take up to 48 hours for Clustering to complete. Please reach out to our support team if your clusters are still generating after this time.
I’m having issues viewing the Clustering page
The Clustering page requires WebGL, a JavaScript API for rendering interactive 3d graphics, to render the Clustering data visualization. There are a number of reasons why WebGL would fail to work. Please ensure that your browser version and graphics drivers are up to date and compatible with WebGL. To see if your browser supports WebGL, please visit https://get.webgl.org/ to confirm that you have WebGL enabled.
If you are running an enterprise version of your browser, it is possible for WebGL to be disabled by a policy set by your IT department. For example, in Chrome, the Disable3DAPIs policy may be set to "true" and/or the HardwareAccelerationModeEnabled policy may have been set to "false." If this is the case, then Clustering may not be viewable for your browser until these policies are changed. Please note that it may take a while for these settings to affect your browser!
If you are having issues and are using Chrome, Firefox, or Microsoft Edge, please ensure that the Use hardware acceleration when available setting is toggled on. This setting can also help with reducing lag when interacting with the Clustering visualization.
- In Chrome this can be found by typing chrome://settings/system into the URL.
- In Edge, this can be found by typing edge://settings/system into the URL.
- In Firefox, go to about:preferences, and under Performance, check that you have enabled the Use hardware acceleration when available setting. If you are using the recommended performance setting then this should be enabled.
Once it is toggled on, relaunch your browser and try viewing the Clustering page again. If WebGL still fails to work, please contact our support team for further assistance.
0 Comments