Advanced Content Searches (Wildcard, Proximity, Fuzzy, Regular Expression)

 

Table of Contents

 

The Contents search term supports regular expression as well as wildcard, fuzzy, and proximity searches. This article will go over search syntax for advanced content searches on Everlaw. 

The Contents search term also auto-detects dtSearch® syntax. Learn more about translating from dtSearch and autodetection of dtSearch.

Return to table of contents

Wildcard Searches

Everlaw supports single and multiple character wildcard searches for content in documents or metadata (e.g. file path, custodian, etc.): a question mark (“?”) for single character; and an asterisk (“*”) for single and multiple characters. Below are some examples for how you can construct wildcard searches. 

To search for words starting with certain characters, append “?” or “*” at the end of the word, e.g., 

  • Rela?  will find words such as relax and relay
  • Rela*  will find words such as relax, relay, relaxing, relate, and related 

To search for words starting and ending with specified characters, use “?” or “*” in the middle of the word, e.g., 

  • re?t  will find words such as rent and rest
  • re*t  will find words such as rent, rest, receipt and relevant 

Return to table of contents

Fuzzy Searches

Everlaw supports fuzzy searches, which finds similar words. To do a fuzzy search, use the tilde symbol "~" at the end of a single word term. Fuzzy searches are a good way to find documents with possible misspellings of words or names.

For example, to search for a term similar in spelling to "rise" use the fuzzy search: rise~. This search will find terms like "risk" and "rises".

An additional (optional) parameter can be used to specify the required similarity threshold. The value of the parameter is between 0 and 1, not inclusive. A value closer to 1 signifies higher similarity: rise~0.8

The default parameter is .5 if no other value is specified.

Return to table of contents

Proximity Searches

Everlaw supports finding words that are within a specific distance away from each other. To perform a proximity search between two words, put the two words in quotation marks. Then, use the tilde symbol "~" at the end of a list of words you want to search for enclosed with quotation marks. Third, specify a word distance.

For example, let's say we want to search for the word cumulative and the word assessment within 3 words of each other (i.e., with no more than 3 words between them) in a document. The search would be represented like this:

"cumulative assessment"~3

We have put the two words within quotations (cumulative and assessment), added a tilde (~), then specified the word distance (3). This will search for all instances where cumulative and assessment have, at most, 3 words between them. This searches for the words in either order of the document, meaning that assessment can come before cumulative even though it is listed after in the search.

You can also do proximity searches with phrases. Please note that, in addition to being contained in quotation marks, phrases in proximity searches must be surrounded by parentheses. “cookie (“chocolate chip”)”~20 is a correctly formatted search, while “cookie “chocolate chip””~20 is not. Some additional examples are below:

  • “jelly (“peanut butter”)”~30
    • This search retrieves results for jelly within 30 words of “peanut butter.”
  • “(sandwich* cook*) (jelly “peanut butter”)”~30
    • This search can be read as "sandwich* OR cook* within 30 words of jelly OR "peanut butter."" It will  retrieve results for any or all of the following:
      • sandwich* within 30 words of jelly (“sandwich* jelly”~30)
      • cook* within 30 words of jelly (“cook* jelly”~30)
      • sandwich* within 30 words of “peanut butter” (“sandwich* (“peanut butter”)”~30)
      • cook* within 30 words of “peanut butter” (“cook* (“peanut butter”)”~30)

In Everlaw's query builder, this search would look like this:

proximity_search_2.png

This will yield the same results as a search that looks like this:

proximity_search_1.png

  • “sandwich* cook* (jelly “peanut butter”)”~30
    • This search requires all three clauses (sandwich*, cook*, and (jelly OR “peanut butter”)) to appear together within 30 extra words at most. The search retrieves documents for which all of the following are true:
      • sandwich* is within 30 words of jelly OR “peanut butter”
      • cook* is within 30 words of jelly OR “peanut butter”
      • sandwich* is within 30 words of cook*

You can also perform nested proximity searches. For example, the search "sandwich ("ham cheese"~10)"~20 will look for the word "sandwich" within 20 words of every instance where "ham" and "cheese" occur within 10 words of each other. 

Finally, you can perform a negative proximity search by adding a NOT or exclamation mark ("!") before a proximity search. Such a search will exclude documents with proximity hits. This is useful if you are searching for a common word that has a highly specific contextual meaning, but may also appear in other documents under its more common meaning. For example, you may only be interested in documents where "Sun" is used to denote a corporate entity. However "sun" may also be a common shorthand used to denote Sunday. To help exclude documents that may be using "sun" as a shorthand for Sunday, we can construct a negative proximity search that will exclude any and all documents where the word "sun" appears within a certain distance of date words, like the days of the week or the months of the year. A search like this in Everlaw could look like the following:

negative_prox_example.png

Return to table of contents

Regular Expressions

Regular expressions (or regex) allow you to search for text strings that match certain patterns of characters. Common use cases for regular expression searches include: finding common patterns of personally identifying information, such as Social Security or credit card numbers, without having to know and list out all known permutations that could appear in your dataset; finding words regardless of spelling variations or common misspellings; and finding all emails from a certain domain. 

Before diving into some examples, we will first break down the key components of a basic regular expression search:

Search boundary: Because regular expression is a special type of content search, we must use the forward slash character, “/“ to indicate the boundaries of our search. 

  • For example, “/Everlaw/“ is a regular expression search for the word “Everlaw”

Search operators: Search operators allow you to add additional precision, constraints, or logic to your search. Some notable operators are described in the table below. 

Operator Description Example Matches

|

OR

/Prices|Today/

Prices OR Today

.

Single character wildcard search

/c.d/

cad, cbd, ccd, cdd, …

*

The character preceding the operator will be matched zero or more times

/buz*/

bu, buz, buzz, buzzz, etc

?

The character preceding the operator is optional and will be matched, at most, once

/colou?r/

colour, color

+

The character preceding the operator must be matched at least once.

/ma+d/

mad, maad, maaad, etc.

{n}, {n,}, {n,m}

The character preceding the operator must be matched either (1) exactly n-number of times, (2) at least n-number of times, or (3) at least n- number of times, but no more than m-number of times. 

/3{2,5}/

33, 333, 3333, 33333

 

Character sets: Character sets allow you to specify searches where a single character can match any of the characters in the set. To create a character set, enclose the characters you want to match by in brackets (“[]”).

  • For example, depending on whether you are using British or American spelling, verbs can end either in “ize” or “ise”. An example of this is “organize” and “organise”. With regular expression, you can easily search for both variations with the following search: “/organi[zs]e/“.
  • Character sets can also be negated by adding the caret character, “^”, after the opening bracket. This indicates that the the character can match any character not specified in the set. For example, /a[^abcde]/ will search for any two letter word where the first character is “a” and the second character is anything that is not “a”, “b”, “c”, “d”, or “e”. 

Character ranges: Instead of listing out all values you want to match by in a character set, you can also specify a range by using hyphens (“-“). For example:

  • [a-z] means a given character can take on any letter value between “a” and “z”. Regex is case sensitive, so [a-z] is different from [A-Z]. However, note that searches in Everlaw are case insensitive
  • [1-4] means that a given digit can take on any value between 1 and 4, inclusive (ie. any of 1, 2, 3, or 4). Note that regex does not have any formal understanding of numbers. Instead it treats each digit separately as a character that can take on any value between 0 and 9, inclusive. This introduces some non-obvious behaviors when it comes to searching numbers. For example, if you want to find numbers in the range 10 to 59, you may expect the following regex search to be successful: /[10-59]/. But while this is a valid search, it will actually only match the digits 0, 1, 2, 3, 4, 5, and 9. If you instead want to find numbers in the range 10 to 59, you must use the following regex search: /[1-5][0-9]/. This will match any two digit number where the first digit is between 1 and 5, and the second digit is between 0 and 9. Thus, the lower bound of this range is 10 and the upper bound is 59.
  • Character ranges can also be concatenated. For example, /[a-zA-z0-9]X/ will search for any two character word where the first character is any digit or lower and upper case letter, and the second character is X. 

Escaping: Regex has special characters that carry special meaning when used in a search. These special characters include the operators described in the table above. But what if you want to use a special character as the character itself instead of its special search meaning? For example, what if you want to search a dot/period or a question mark instead of performing a wildcard or matching search? In such cases, you have to “escape” the special character by enclosing it in brackets ("[]"). This tells the system you wish to use the character as a character instead of by its special search functionality. So, for example, /d[.]d/ will return only the exact string “d.d” whereas /d.d/ returns any three character string that begins and ends with d and has any character in between. 

 

Using these operators and components in tandem allow you to create complex regex searches. Here are some useful examples: 

Number Pattern: Assume the numbers you are searching for follow this pattern: XXX.XXX.XXXX. Here is a regex search that will match this number pattern: /[0-9]{3}[.][0-9]{3}[.][0-9]{4}/

Regex_Examples_-_phone_numbers.png

While it may look complicated at first glance, it is much easer to understand once decomposed into its component parts. You can adapt this to fit any unique number pattern, whether it be phone numbers, social id numbers, credit cards, etc. Keep in mind that this search could be under-inclusive in the sense that it only captures patterns where the three chunks of numbers are separated by precisely one dot. A more inclusive search could be constructed as follows: /[0-9]{3}/ /[0-9]{3}/ /[0-9]{4}/. However, this may be too over-inclusive as it would match any of the three chunks independently of the others. Please note that many non-letter characters (such as hyphens) are not indexed by Everlaw, and therefore cannot be searched as such. 

Email addresses:

Regex_Examples_-_email__1_.png

The logic of this search (/[a-z0-9._!%+]+@[a-z0-9]+[.][a-z0-9]{2,4}/) is that there must be at least one character before the [at] symbol for the username, at least one character between the [at] and [dot] symbols for the domain name, and at least two (but no more than four) characters for the top-level domain. You can modify this template regex search to fit your particular needs. For example, if you believe that the top-level domain can have more than four characters, you can relax this constraint. You may also already know the domain and top-level domain, or there may be other special symbols that could appear in the username. 

Content searches:  

Regular expression can be used to create useful content searches. For example:

  • /[bcg]oat/  will find boat, coat, or goat. Either b, c, or g could occupy the first spot in the text string.
  • /gr[ae]y/ will find grey or gray. Either a or e could occupy the third spot in the text string. 
  • /.?oa.?/ will find words that are between 2 and 4 characters in length and contain “oa”. For example, this search will find words like loan, load, goad, oar, and boa.
  • /.*oa.*/ will find words that are between 2 and any arbitrary length long and contain “oa”. For example, this search will find words loan, load, goad, oar, boa, coats, and float.  

Prebuilt Searches

To help you find common content patterns for personally identifying information, Everlaw has prebuilt some searches that can be added as hit highlights by project admins. To add these prebuilt hit highlight searches, navigate to the "General > Persistent Highlights" section of Project Settings. 

Screen_Shot_2022-04-15_at_11.49.38_AM.png

 

Return to table of contents

Limiting your search based on context


You can also search for a certain word or phrase, while excluding specific contexts in which the word or phrase may occur. To do this, use NOT within your contents search. For example, if you want to do a wildcard search for “proto*” but don’t want your search to return any variant of “protocol”, you can build a search that looks like this:

proto_NOT.png

You can do the same for phrase searches. For example, if you want emails with the phrase “secret meeting” in them, but do not want emails returned where “secret meeting” only appears within the phrase “I have no interest in a secret meeting”, you can build a search that looks like this:

secret_meeting_NOT_no_secret_meeting.png

Limiting your search based on context also works for regular expression searches. For example, if you are looking for documents that contain the year 2017 in their contents, but you do not want your results to yield instances of “2017” that are part of phone numbers, you could construct a search that looks like this: 2017_NOT_phone_number.png

Finally, you may want to search for a certain term but ignore places where it occurs in a predictable common phrase, such as in the footer of many docs. You can add a hyphen (-) to your content search preceding the word or phrase you want to ignore.

For example, if you want to search for the word "privilege" but ignore instances where "privilege" is included in the phrase "attorney client privilege," build your search like this: 

Screen_Shot_2020-05-08_at_1.50.49_PM.png 

Return to table of contents

Special note on characters indexed by Everlaw

Everlaw's indexing methodology will affect the type of content you can search for. Here are some key points to be aware of as you construct content searches:

  • Most non-letter characters are not indexed, and therefore are treated as spaces by Everlaw. The exceptions are: $#@&%€¢£¥₩₹฿₫₴₪
  • Some characters are treated as spaces when prepending or appending words, but are considered part of the word (and thus indexed) when they appear within a word. Examples of such characters include colons and dots. 
  • The apostrophe (') is indexed. This means, e.g., "John's" can be distinguished from "Johns." However, when an apostrophe occurs in a word in the middle of a phrase that is three words or longer (ex. "St. John's Road"), the search will return no results. If you need to run such a search, replace the apostrophe with a space (ex. "St. John s Road").  

Return to table of contents

 

 

Have more questions? Submit a request

0 Comments

Article is closed for comments.