Clustering visualizes documents in your dataset by conceptual similarity. It generates insights about concepts in your documents without requiring any user input. Traditional search tools require you to have a baseline understanding of what’s in your documents and what to search for, but with Clustering, you can begin to learn about data without any prior background. This makes Clustering a valuable tool during early case assessment and other critical workflows throughout the discovery process.
Use this article to understand how to:
- Generate clusters
- Navigate the Clustering page
- Use Clusters to identify key documents you might not otherwise know about
The Clustering visualization
On the Clustering page, documents are represented as a data point, and each document belongs to a color-corresponding cluster. The cluster is also represented by a polygon, which is an approximation of where the clustered documents are on the page. Terms associated with each cluster can give you a sense of the concepts within the documents.
Reviewers can also view conceptually similar documents from the review window. They can access these documents by selecting the Clustering context. For more information, see this help article about the context panel.
As review is conducted on your project, you can color-code your clustered documents by ratings, codes, and predictive coding scores to get a better sense of how your review is progressing. You can learn more about these workflows at the end of this article.
How does Clustering work?
Clustering uses an unsupervised machine learning algorithm that analyzes words and metadata (author, subject, title, to, from, cc, and bcc) across all of your documents to determine conceptual similarity. This algorithm utilizes a bag of words model weighted by TF-IDF. Clustering also uses a density-based clustering algorithm, allowing you to visualize document similarity by relative distance more easily than in traditional k-means clustering algorithms.
Here’s a very basic overview of the algorithm’s process. Since an algorithm isn't a human, it can’t interpret the meaning of words to determine document similarity. First, it has to break down the words and metadata from documents into numbers. Then, the algorithm removes “stop words” and punctuation from its consideration. We filter these stop words: a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with. These are the same stop word considerations used in Everlaw’s predictive coding algorithm.
Next, the algorithm compares the frequency of each word in each document to all the other words in the documents throughout the entire database. The model starts to acquire clues about important concepts in documents by weighting terms by their relative frequency.
To better explain the notion of comparing relative weighted frequency, here’s a simple example. Let’s say you have two documents that both contain the word "apple.” A quick “ctrl+f” search for the keyword “apple” tells you that each document contains the word “apple” 100 times. Compared to the other documents in the database, this word shows up quite a bit more in these two documents relative to other words in all documents. After reading both documents, you gather that one is a warranty for an Apple computer, and one a contract with apple farmers. Like you in this example, Clustering can identify this conceptual difference because in addition to counting the frequency of the word “apple,” it would compare the relative frequency of apple to other words in the document and across all words belonging to all documents throughout the database. Using the same example, it may notice that the Apple computer document has other weighted relevant terms like “Cupertino” and “device” while the apple farmer contract has relevant terms like “agriculture” and “harvest.” These documents would be considered conceptually different, but might each belong to a cluster defined by the term “apple.”
Once the algorithm has a better understanding of the relative frequent terms in each document, the clustering algorithm can cluster documents together by similarity and define topics for each cluster. The ten highest weighted terms will be associated with clusters, and you can select and explore by terms in the clustering visualization.
The algorithm does not visualize documents that are considered “outliers,” which are documents considered not meaningfully similar to the other documents representing each cluster. Clustering will also exclude documents that don’t have clusterable text from visualization. As just one example, a document may have text, but the text might be only symbols or single character letters.
Access Clustering
Required permissions: Any user with at least Clustering View permissions can view the Clustering page. Users with Cluster Admin permission can recluster. Both permissions allow users to view, select, filter, and run searches on clustered documents.
To access Clustering go to Document Analytics > Clustering. This takes you to the Clustering page. If this is the first time someone from the project has accessed Clustering, the clusters will not yet be generated.
If you have Clustering Admin permissions, you can select which documents you would like to cluster by using the search query builder. To get started, select Select documents to cluster.
To cluster all the documents in the project, select Cluster.
Note
The amount of time it takes to generate Clustering is dependent on the number of documents clustered and may take several hours.
Each cluster is a group of documents. Each document is represented as a dot, and dots of the same color belong to the same cluster. The cluster is also rendered as a polygon, which denotes generally where the cluster’s documents are. Each cluster has ten terms that best represent it; the top three are shown in the Clustering visualization. By clicking a cluster, you can view all ten terms.
Use Clustering
Basic navigation
With Clustering, you can explore concepts in your data at a glance and at a high level. To do this, you can leverage basic navigation tools. You can access shortcuts by pressing “?” (shift + /) on your keyboard while on the Clustering page.
- Click and drag to pan around the page
-
To explore clusters more closely, you can:
- pinch-to-zoom on a laptop
- click the zoom in/out buttons in the toolbar
- press “i” on your keyboard to zoom in.
As you zoom in, you can see more detail of the clusetrs. You can press “o” on your keyboard to zoom out.
Clustering depth
You can break clusters down further and drill deeper into your visualization by changing Clustering depth settings, found on the toolbar. Clustering depth describes the level of granularity of the clusters. At a lower depth, conceptual clusters are grouped together into bigger themes. As you increase the depth, the larger clusters break into more specific clusters, representing sub-topics in the concept.
For example, at a low depth, you might see one big cluster with the top three words "coverage, claim, policy," representing documents that generally relate to insurance. When you increase the depth, this cluster would break into smaller clusters, representing more specific aspects of insurance. For example you might see clusters with terms such as "collision, appraisal, casualty," which would represent car insurance and "beneficiary, annuity, term," which might relate more specifically to life insurance.
Clustering has two different depth settings:
-
Auto-depth: When auto depth is toggled on, clusters dynamically break up into smaller sub-clusters as you zoom in, while merging together into larger clusters as you zoom out.
- You can toggle auto depth on and off through the keyboard shortcut “a.”
-
Manual depth: When auto-depth is toggled off, you can manually change depth level through the numbered slider located in the toolbar. Manual depth levels range between 1 to 5 and are custom to your dataset. As such, some visualizations may have 3 levels, while others may have 5. In general, projects with more documents will tend to have more depth levels compared to smaller projects.
- You can increase manual depth to show more detail through the shortcut “m,” and decrease it to show less detail with the shortcut “l” (lowercase L).
Outliers
As larger clusters are broken into smaller sub-clusters, some documents included in the original, large cluster will not be conceptually similar enough to the more specific, smaller sub-clusters. These documents are considered outliers and change color to be displayed in gray. The documents that are considered outliers change based on the depth level you set in Clustering.
In general, as you increase depth and therefore cluster specificity, the number of documents considered outliers will increase. There are no outlier documents at the lowest, default depth level; as such, all outlier documents are included in the total number of documents visible in Clustering, which can be found in the gear on the toolbar.
You can choose to hide and show clusters or documents, including outliers, by using the Show dropdown menu checkboxes in the toolbar. If you deselect outliers, outlier documents specific to that depth from your visualization are removed. They are reincorporated as you decrease depth. Some functionalities in the toolbar (color overlays, filters, or document select mode) require documents to be visible.
Note
If your visualization was created before our June 3, 2022 release, you need to recluster to utilize depth functionality.
Cluster and document selection
When you select clusters or documents, you can see additional information about your selection.
To select clusters and documents in Clustering :
- Select any cluster by selecting the cluster itself: Clicking on a document in an unselected cluster selects the cluster.
Clicking any document in an already selected cluster opens a document preview. In the preview, you can open the document for review, or move to the next document in the cluster. Previewing documents helps you understand how they might be similar to each other and whether you want to consider them for further review or exploration. - Drag-select arbitrary sets of documents: Document selection mode in the toolbar; press "d" on your keyboard. This lets you draw a rectangle to select any documents within the rectangle, regardless ofthe cluster they belong to.
- Click to select multiple clusters: Select Multiselection mode in the toolbar, press “shift+click” and then drag to select multiple clusters, or press "x" on your keyboard to turn on multiselection mode. This lets you select multiple clusters at once.
-
Select clusters through term selection: Each cluster includes three representative words based on the clustering algorithm. To view a full list of top three Cluster terms across all clusters on the page, select Explore cluster terms dropdown in the toolbar.
This menu lists each term, the number of clusters for which the term is a top-three term, and the number of unique documents that are in those clusters.
- Select Export to export all terms.
- Select Search to search for a specific term.
- Navigate through the terms by selecting the arrows at the bottom of the menu
- Move the menu by dragging the Select cluster terms header.
- Resize the table by clicking and dragging the bottom right corner.
-
Sort the table through clicking on the arrows next to Term to sort alphabetically and reverse alphabetically by cluster term or from the highest or lowest number of clusters through clicking on the arrows next to Clusters.
Select the checkbox to add clusters with that term as a top term to your selection. You can click multiple terms in the dropdown to include clusters with those terms in your selection.
Once you have a selection, you can also click Fit View (next to the magnifying glass buttons), or press "f" on your keyboard to zoom out to all clusters you’ve selected or to the entire visualization if no clusters are selected.
Clustering panel
Upon any selection, a resizable side panel appears. The side panel includes:
- The count of unique documents in your selected cluster(s)
- The count of total documents in your selected cluster(s)
- The number of clusters selected
- A list of the most representative terms in your selection
- A list of any selected terms.
At the top is the number of unique documents in your selection and the total number of documents including duplicates. Select either number to go to a results table with that search. Each search is assigned a Cluster ID. Cluster IDs are represented in your search, and you can refine your search to build a more narrow set of documents for your review.
Note
Cluster IDs are obsolesced at the point of reclustering, and so reclustering may have an effect on previous searches.
For example, if you created a dynamic assignment using a Cluster ID as the inclusion criteria, no new documents will meet that inclusion criteria after reclustering. To learn more about reclustering, visit this article's section on settings and reclustering
The panel includes many ways to learn about and access the documents in the selected cluster(s):
-
You can view the ten most representative cluster terms in your selection under Terms. If you have multiple clusters selected, the cluster terms list will weight the terms by the document count of all the selected clusters. If you have selected any cluster terms through the Explore cluster terms table, they will appear under Selected terms.
- Search and Search Term Reports for cluster terms are incorporated into Clustering in this panel. Select Create search from terms at the bottom of the panel. From here you can choose to open a Search or Search Term Report of your cluster defining terms and/or your selected terms. Note that the Search or Search Term Report will include cluster terms as "OR" content searches.
- Data Visualizer is incorporated in the panel, so you can compare document properties like metadata or document characteristics to your selection.
To access Data Visualizer, select Data visualization. You can seelct the dropdown to select which visualization you would like to see, such as document type, file path, and binders. Click Open Data Visualizer to see your visualization in Data Visualizer.
Filter documents by search
To filter out documents from view, you can create a search filter. To do so:
- Select Filter in the Clustering toolbar.
- Here, you can use the standard Everlaw query builder to narrow down your clustering view.
- Select Apply, and documents that do not meet your search criteria will be filtered out.
Your search is represented in the toolbar. Click the “X” icon to remove the search. Click Edit Filter or the filter itself in the toolbar to edit your previous search.
Color code documents by coding/predicted relevance
You can color code individual documents by their rating, code(s), or predicted relevance from the toolbar. This is particularly useful for quality controlling (QCing) review, or prioritizing certain sets of documents by their coding decisions or prediction scores. To color code your documents, select Color documents by in the toolbar and select between rating, coding category, or prediction model to color code your documents.
Once you make your selection, a legend appears with each code or rating in the selected category.
Select any code or rating in the legend to filter documents by just that code or rating. All uncoded documents are displayed in light grey.
If you choose a non-mutually exclusive category, documents with multiple codes applied are displayed in taupe.
Settings
You can see information about clustered documents, as well as edit which documents you would like clustered, in the Settings dialog. Select any of the highlighted numbers to open a results table of those documents.
- Under Current clustered search is:
- The search criteria for clustered documents
- The number of documents that fulfill that criteria and are therefore clustered
The total number of documents in the project
If you have Clustering Admin permissions, you can change your current clustered search through select the Edit button. A search query builder will open, allowing you to build a new search and redefine which documents you would like clustered.
Changing your clustered search reclusters based on your new criteria; reclustering may take several hours to complete depending on the number of documents. Please refer to the end of the section on best practices and recommendations around reclustering.
-
Included in Clustering shows can see both the number of documents visible and not visible in Clustering.
- Documents visible in Clustering are documents that are in the visualization and are represented by a dot on the page.
-
Documents not visible in Clustering are documents that are not present in the visualization due to not being similar enough to other clustered documents. These documents, however, are still evaluated by the Clustering algorithm, and you can still access the Clustering context in the context panel for these documents. For more information on the Clustering context in the context panel, please see this article.
Under Not included in Clustering you can see both the number of documents that were not included in Clustering, as well as those that were not clusterable.
- The number of documents not included represents documents that did not meet the Clustering search criteria as of the last recluster. For example, if your current clustered search was all documents rated 'Hot,' this number represents all documents not rated hot at the time you clustered the search. As such, documents not included in Clustering also counts new documents added to search, described in the section below.
- Some documents may not be clusterable, which is represented by the number of documents that are not clusterable. Reasons why a document may not be clusterable is that it does not have any text or enough text to be clustered.
- Under New changes since last recluster you can see how many new documents match your clustered search criteria but are not included in the visualization.
- As documents get uploaded to your project or change to match your Clustering document criteria, they are not automatically added to your Clustering visualization.
- Documents that no longer match your Clustering criteria are not automatically removed from the visualization; however, documents that are deleted from the project will be removed. The cluster polygon remains even if all documents in a cluster are deleted.
To incorporate new documents that match your criteria and remove documents that no longer hit your criteria, you need to recluster. You can recluster if you are a Clustering Admin.
Recluster
We recommend that you recluster sparingly. After you recluster, any previous searches that reference Cluster IDs will no longer return documents. If you would like to preserve your searches by cluster before reclustering, we recommend adding them to a binder.
To recluster your search:
- Select the gear button
- Select Recluster.
- Select Begin reclustering.
Once reclustering begins, you cannot access any information on the Clustering page or in the Clustering context of the review window. It may take anywhere from ten minutes to many hours depending on the size of your project. When reclustering is complete, you will see clusters visualized again.
Clustering for common workflows
Clustering opens the door to a variety of workflows that span the discovery lifecycle, including early case assessment, organizing review priorities, assigning work, and performing quality control on reviewed documents. This section provides recommended workflows based on likely scenarios where clustering can be leveraged.
Data exploration in early case assessment
In this scenario, you have just received access to Everlaw. A large batch of documents has been uploaded and you want a high-level overview of the concepts in your corpus. You could try running a search, but you’re not quite sure what to search for yet.
- Open Clustering to view the top concepts by panning around the page and zooming in and out (use basic navigation). See which concepts are applicable to large and small sets of documents.
- To skim all of the top terms displayed across clusters, click the terms dropdown menu in the toolbar for a list of top terms and their frequency.
- Click the meaningful terms, which will select those clusters on the page and make it easy to see what documents apply to those concepts.
- Select various data visualizer properties to understand the distribution of documents across properties like Custodian or Doc Type.
Use prediction scores to identify meaningful concepts
In this scenario, you have set up predictive coding on your project. Your team has started review, but you want to act on the information more meaningfully by getting a sense of what concepts are most relevant by prediction score.
You can leverage clustering and predictive coding together to strengthen predictions and save time in review.
- Use the color coding overlay and select your prediction model from the dropdown list.
- Zoom and pan across the page to see which clusters of documents might be more likely to be rated hot.
- Use the document selector to select these documents. Open the selection in a results table and share the documents with your team to prioritize those documents for review.
Assign documents by conceptual similarity
In this scenario, you’d like to assign documents related to certain concepts. You’ve been given an initial set of terms to base searches off of, but you’d like a bit more information to organize your assignments.
You can leverage search, data visualizer, and concept clustering all at once to help you prioritize and organize your assignments.
Important
This is a good workflow for early assignments, but there are significant implications to assignments if you recluster.
- If you have an initial search to narrow down your set, create a search filter in the Clustering toolbar.
- With your filtered visualization, you can see which clusters have a greater concentration of documents.
- Since those documents are likely to be conceptually similar, you could consider assigning those clusters. Select the cluster and open a search in the results table.
- Assign the documents from the results table toolbar, by clicking “Batch” then “Assign.” We recommend creating a static assignment. Any dynamic assignments that have inclusion criteria with “Cluster ID” will no longer pull in documents after reclustering.
Perform quality control on reviewed documents
In this scenario, you’re an attorney who is responsible for assessing the quality of review decisions made by your team. You want to make sure that nothing slipped through the cracks, and that documents were coded correctly. You can utilize the coding overlay to identify potentially uncoded or incorrectly coded documents.
- Select Color docments by and select the coding category of interest from the dropdown list.
- Identify outlier documents by color. You may want to see whether documents that are conceptually similar are coded differently.
- You can also view uncoded documents, which are light grey. This might be helpful to see if these documents slipped through the cracks during review.
- You can select a subsection of a cluster, or documents across multiple clusters, by using “document selection mode” in the toolbar (or d shortcut on your keyboard).
- Click and drag to select the section of interest, and open a results table of those documents to see why the outliers are coded differently.