Table of contents
- What is Clustering?
- How does Clustering work?
- Accessing Clustering
- Using Clustering
- Clustering for common workflows
What is Clustering?
Clustering visualizes documents in your dataset by conceptual similarity. It generates insights about concepts in your documents without requiring any user input. Traditional search tools require you to have a baseline understanding of what’s in your documents and what to search for, but with Clustering, you can begin to learn about data without any prior background. This makes Clustering a valuable tool during early case assessment and other critical workflows throughout the discovery process.
On the Clustering page, documents are represented as a data point, and each document belongs to a color-corresponding cluster. The cluster is also represented by a polygon, which is an approximation of where the clustered documents are on the page. Terms associated with each cluster can give you a sense of the concepts within the documents.
Reviewers can also view conceptually similar documents from the review window. They can access these documents by selecting the Clustering context. For more information, see this help article.
As review is conducted on your project, you can color-code your clustered documents by ratings, codes, and predictive coding scores to get a better sense of how your review is progressing. You can learn more about these workflows at the end of this article.
How does Clustering work?
Clustering uses an unsupervised machine learning algorithm that analyzes words and metadata (author, subject, title, to, from, cc, and bcc) across all of your documents to determine conceptual similarity. This algorithm utilizes a bag of words model weighted by TF-IDF. Clustering also uses a density-based clustering algorithm, allowing you to visualize document similarity by relative distance more easily than in traditional k-means clustering algorithms.
Here’s a very basic overview of the algorithm’s process. Since an algorithm isn't a human, it can’t interpret the meaning of words to determine document similarity. First, it has to break down the words and metadata from documents into numbers. Then, the algorithm removes “stop words” and punctuation from its consideration. We filter these stop words: a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with. These are the same stop word considerations used in Everlaw’s predictive coding model.
Next, the algorithm compares the frequency of each word in each document to all the other words in the documents throughout the entire database. The model starts to acquire clues about important concepts in documents by weighting terms by their relative frequency.
To better explain the notion of comparing relative weighted frequency, here’s a simple example. Let’s say you have two documents that both contain the word "apple.” A quick “ctrl+f” search for the keyword “apple” tells you that each document contains the word “apple” 100 times. Compared to the other documents in the database, this word shows up quite a bit more in these two documents relative to other words in all documents. After reading both documents, you gather that one is a warranty for an Apple computer, and one a contract with apple farmers. Like you in this example, Clustering can identify this conceptual difference because in addition to counting the frequency of the word “apple,” it would compare the relative frequency of apple to other words in the document and across all words belonging to all documents throughout the database. Using the same example, it may notice that the Apple computer document has other weighted relevant terms like “Cupertino” and “device” while the apple farmer contract has relevant terms like “agriculture” and “harvest.” These documents would be considered conceptually different, but might each belong to a cluster defined by the term “apple.”
Once the algorithm has a better understanding of the relative frequent terms in each document, the clustering algorithm can cluster documents together by similarity and define topics for each cluster. The ten highest weighted terms will be associated with clusters, and you can select and explore by terms in the clustering visualization.
The algorithm will not visualize documents that are considered “outliers,” which are documents considered not meaningfully similar to the other documents representing each cluster. Clustering will also exclude documents that don’t have clusterable text from visualization. As just one example, a document may have text, but the text might be only symbols or single character letters.
Any user with at least Clustering View permissions can view the Clustering page, but only users with Cluster Admin permission can recluster. Both permissions allow users to view, select, filter, and run searches on clustered documents.
To access Clustering, click on Document Analytics in the navigation bar and select “Clustering.”
Each cluster is a group of documents. Each document is represented as a dot, and dots of the same color belong to the same cluster. The cluster is also rendered as a polygon, which denotes generally where the cluster’s documents are. Each cluster has ten terms that best represent it; the top three are shown in the Clustering visualization. By clicking a cluster, you can view all ten terms.
With Clustering, you can explore concepts in your data at a glance and at a high level. To do this, you can leverage basic navigation tools. You can access shortcuts by pressing “?” (shift + /) on your keyboard while on the Clustering page.
Click and drag to pan around the page. If you would like to explore clusters more closely, pinch-to-zoom on a laptop, click the zoom in/out buttons in the toolbar, or press “i” on your keyboard to zoom in. As you zoom in, you will find that more clusters appear. You can press “o” on your keyboard to zoom out.
You can break clusters down further and drill deeper into your visualization by changing Clustering depth settings, found on the toolbar. Clustering has two different depth settings: dynamic zoom – also known as auto depth – and manual depth. When auto depth is toggled on, clusters will dynamically break up into smaller sub-clusters as you zoom in, while merging together into larger clusters as you zoom out. When auto depth is toggled off, you can manually change depth level through the numbered slider located in the toolbar. Possible manual depth levels range between 1 to 5 and are custom to your dataset. As such, some visualizations may have 3 levels, while others may have 5. In general, projects with more documents will tend to have more depth levels compared to smaller projects.
Auto depth can be toggled on and off through the keyboard shortcut “a.” Manual depth can be increased to show more detail through the shortcut “m,” and decreased to show less detail with the shortcut “l” (lowercase L).
As larger clusters are broken into smaller sub-clusters, some documents included in the original, large cluster will not be conceptually similar enough to the more specific, smaller sub-clusters. These documents are considered outliers and will change color to be displayed in gray. The documents that are considered outliers will change based on the depth level you set in Clustering. In general, as you increase depth and therefore cluster specificity, the number of documents considered outliers will increase. There are no outlier documents at the lowest, default depth level; as such, all outlier documents are included in the total number of documents visible in Clustering, which can be found in the gear icon on the toolbar.
You can choose to hide and show clusters or documents, including outliers, by using the ‘Show’ dropdown menu checkboxes in the toolbar. If you deselect outliers, this will remove outlier documents specific to that depth from your visualization and reincorporate them as you decrease depth. Some functionalities in the toolbar (color overlays, filters, or document select mode) require documents to be visible. Note that if your visualization was created before our June 3, 2022 release, you will need to recluster in order to utilize depth functionality.
Click “Fit View” (or press f on your keyboard) to zoom out to all clusters you’ve selected or to the entire visualization if no clusters are selected.
You can select any cluster by clicking the cluster itself. Clicking on a document within a cluster will select the cluster. Clicking any document in an already selected cluster will open a document preview. In the preview, you can open the document for review, or move to the next document in the cluster. Previewing documents will help you understand how they might be similar to each other and whether you want to consider them for further review or exploration.
Each cluster includes a cluster’s three representative words based on the clustering algorithm. To view a full list of top Cluster terms across all clusters on the page, click the “Select clusters by term” dropdown in the toolbar. Here, you will see each term and the number of clusters for which the term is a top-three term. Click a term to add that cluster to your selection. You can click multiple terms in the dropdown to include clusters with those terms in your selection.
In addition to selecting clusters via the dropdown selector, you can:
- drag-select arbitrary sets of documents
- click to select multiple clusters
These selection tools can be found in the toolbar and can also be accessed via keyboard shortcuts. You can also press “shift+click” and then drag to select multiple clusters.
Panel and data visualizer view
Upon any selection, a resizable side panel will appear. The side panel includes the count of documents in your selected cluster(s), a list of the most representative terms in your selection, and an embedded data visualizer view.
At the top, you can see the number of clusters selected and the total number of documents by unique doc and including duplicates. Click either number to go to a results table with that search. Each search is assigned a Cluster ID. Cluster IDs will be represented in your search, and you can refine your search to build a more narrow set of documents for your review. Note that Cluster IDs are obsolesced at the point of reclustering. Reclustering may have an effect on previous searches. For example, if you created a dynamic assignment using a Cluster ID as the inclusion criteria, no new documents will meet that inclusion criteria after reclustering. You can learn more about reclustering in this section.
You can also view the ten most representative cluster terms in your selection. If you have multiple clusters selected, the cluster terms list will weight the terms by the document count of all the selected clusters.
Data Visualizer is incorporated into Clustering, so you can compare properties like metadata or document characteristics to your selection. Click the dropdown to select which visualization you’d like to preview. To filter by visualization, click “Open data visualizer” and a new tab will take you to the Data Visualizer page.
Filter documents by search
To filter out documents from view, you can create a search filter. Click “Create Search” in the toolbar. Here, you can use the standard Everlaw query builder to narrow down your clustering view. When you click Apply, documents that do not meet your search criteria will be filtered out. Your search will be represented in the toolbar. Click the “x” icon to remove the search. Click “Edit Search” or the search itself in the toolbar to edit your previous search.
Color code documents by coding/predicted relevance
You can color code individual documents by their rating, code(s), or predicted relevance from the toolbar. This is particularly useful for QC’ing review, or perhaps prioritizing certain sets of documents by their coding decisions or prediction scores. Click the “Color documents by” in the toolbar and select between rating, coding category, or prediction model to color code your documents.
Once you make your selection, a legend will appear with each code or rating in the selected category. You can click any code or rating in the legend to filter documents by just that code or rating. All uncoded documents will be displayed in light grey. If you overlay a non-mutually exclusive category, documents with multiple codes applied will be displayed in taupe. In the example below, Responsiveness is a mutually exclusive category and Production Designations is non-mutually exclusive.
You can also select a prediction model to color documents by predicted relevance.
As documents get uploaded to your project, they will not be automatically included in your Clustering visualization. Documents that are deleted from your project will automatically be removed from the Clustering visualization, but the cluster polygon will remain even if all documents in a cluster are deleted.
Incorporating these changes to your dataset into your Clustering visualization will require you to recluster your project’s dataset. You can recluster if you’re a Clustering Admin.
We recommend that you recluster sparingly. After you recluster, any previous searches that reference Cluster IDs will no longer return documents. If you would like to preserve your searches by cluster before reclustering, we recommend adding them to a binder.
To recluster, click the settings dialog in the toolbar. If you have uploaded or deleted a document since any Clustering Admin has last generated clusters, you will be able to click Recluster. Confirm that you want to recluster and the task will begin.
Once reclustering begins, you cannot access any information on the Clustering page or in the Clustering context of the review window. It may take anywhere from ten minutes to many hours depending on the size of your project.
When reclustering is complete, you will see clusters visualized again. You may notice that some documents were not clustered. This is because they didn’t have enough information to be considered clustered (e.g. the document included many single characters or symbols). You can access searches for these sets of documents in the Settings dialog, indicated by a gear icon.
Clustering for common workflows
Clustering opens the door to a variety of workflows that span the discovery lifecycle, including ECA, organizing review priorities, assigning work, and performing quality control on reviewed documents. This section provides recommended workflows based on likely scenarios where clustering can be leveraged.
Data exploration in early case assessment (ECA)
In this scenario, you have just received access to Everlaw. A large batch of documents has been uploaded and you want a high-level overview of the concepts in your corpus. You could try running a search, but you’re not quite sure what to search for yet.
- Open Clustering to view the top concepts by panning around the page and zooming in and out (use basic navigation). See which concepts are applicable to large and small sets of documents.
- To skim all of the top terms displayed across clusters, click the terms dropdown menu in the toolbar for a list of top terms and their frequency.
- Click the meaningful terms, which will select those clusters on the page and make it easy to see what documents apply to those concepts.
- Select various data visualizer properties to understand the distribution of documents across properties like Custodian or Doc Type.
Using prediction scores to identify meaningful concepts
In this scenario, you have set up predictive coding on your project. Your team has started review, but you want to act on the information more meaningfully by getting a sense of what concepts are most relevant by prediction score.
You can leverage clustering and predictive coding together to strengthen predictions and save time in review.
- Use the color coding overlay and select your prediction model from the dropdown list.
- Zoom and pan across the page to see which clusters of documents might be more likely to be rated hot.
- Use the document selector to select these documents. Open the selection in a results table and share the documents with your team to prioritize those documents for review.
Assign documents by conceptual similarity
In this scenario, you’d like to assign documents related to certain concepts. You’ve been given an initial set of terms to base searches off of, but you’d like a bit more information to organize your assignments.
You can leverage search, data visualizer, and concept clustering all at once to help you prioritize and organize your assignments. Please keep in mind that this is a good workflow for early assignments, but there are significant implications to assignments if you recluster.
- If you have an initial search to narrow down your set, create a search filter in the Clustering toolbar.
- With your filtered visualization, you can see which clusters have a greater concentration of documents.
- Since those documents are likely to be conceptually similar, you could consider assigning those clusters. Simply click the cluster and open a search in the results table.
- Assign the documents from the results table toolbar, by clicking “Batch” then “Assign.” We recommend creating a static assignment. Any dynamic assignments that have inclusion criteria with “Cluster ID” will no longer pull in documents after reclustering.
Perform quality control on reviewed documents
In this scenario, you’re an attorney who is responsible for assessing the quality of review decisions made by your team. You want to make sure that nothing slipped through the cracks, and that documents were coded correctly. You can utilize the coding overlay to identify potentially uncoded or incorrectly coded documents.
- Select the coding category of interest from the dropdown list.
- Identify outlier documents by color. You may want to see whether documents that are conceptually similar are coded differently.
- You can also view uncoded documents, which are light grey. This might be helpful to see if these documents slipped through the cracks during review.
- You can select a subsection of a cluster, or documents across multiple clusters, by using “document selection mode” in the toolbar (or d shortcut on your keyboard).
- Click and drag to select the section of interest, and open a results table of those documents to see why the outliers are coded differently.