Advanced Content Searches (Wildcard, Proximity, Fuzzy, Regular Expression)

The Contents search term supports regular expression as well as wildcard, fuzzy, and proximity searches.

This article covers search syntax for advanced content searches on Everlaw.

Note

The Contents search term also auto-detects dtSearch® syntax. To learn more about translating from dtSearch and autodetection of dtSearch visit dtSearch® Translation.

Wildcard Searches

Everlaw supports single and multiple character wildcard searches for content in documents or metadata (e.g. file path, custodian, etc.):

  • a question mark (“?”) for single character
  • an asterisk (“*”) for single characters, multiple characters, and no characters

Below are some examples for how you can construct wildcard searches. 

To search for words starting with certain characters, append “?” or “*” at the end of the word. For example,

  • Rela?  finds words such as relax and relay
  • Rela*  finds words such as relax, relay, relaxing, relate, and related 

To search for words starting and ending with specified characters, use “?” or “*” in the middle of the word, For example, 

  • re?t  finds words such as rent and rest
  • re*t  finds words such as rent, rest, receipt and relevant 

Note

You cannot use wildcard searches for terms that include unindexed characters, which include most non-alphanumeric characters. The exception is for email addresses. For example, the search “accounts-receiv*” would not return text referencing “accounts-receivable” as a hit, but it would return a hit for the email “accounts-receivable@company.com.”

Fuzzy Searches

Everlaw supports fuzzy searches, which finds similar words. To do a fuzzy search, use the tilde symbol "~" at the end of a single word term. Fuzzy searches are a good way to find documents with possible misspellings of words or names.

For example, to search for a term similar in spelling to "rise" use the fuzzy search: rise~. This search finds terms like "risk" and "rises".

Fuzzy searching can also be applied to strings within a phrase match. For example, “cell~ phone~”, returns phrases with misspellings of either word in sequence.

An additional (optional) parameter can be used to specify the required similarity threshold, (e.g. rise~1). The value of the parameter is either 1 or 2. Values greater than 2 are set to 2. The parameter signifies the edit distance that is allowed in the fuzzy search. The edit distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. If no edit distance parameter is entered, the search defaults to 2.

Proximity Searches

Everlaw supports finding words that are within a specific distance away from each other. To perform a proximity search between two words, put the two words in quotation marks. Then, use the tilde symbol "~" at the end of a list of words you want to search for enclosed with quotation marks. Third, specify a word distance.

For example, let's say we want to search for the word cumulative and the word assessment within 3 words of each other (i.e., with no more than 3 words between them) in a document. The search would be represented like this:

"cumulative assessment"~3

We have put the two words within quotations (cumulative and assessment), added a tilde (~), then specified the word distance (3). This searches for all instances where cumulative and assessment have, at most, 3 words between them. This searches for the words in either order of the document, meaning that assessment can come before cumulative even though it is listed after in the search.

You can also do proximity searches with phrases. In addition to being contained in quotation marks, phrases in proximity searches must be surrounded by parentheses. “cookie (“chocolate chip”)”~20 is a correctly formatted search, while “cookie “chocolate chip””~20 is not. Some additional examples are below:

  • “jelly (“peanut butter”)”~30
    • This search retrieves results for jelly within 30 words of “peanut butter.”
  • “(sandwich* cook*) (jelly “peanut butter”)”~30

    • This search can be read as "sandwich* OR cook* within 30 words of jelly OR "peanut butter."" It retrieves results for any or all of the following:

      • sandwich* within 30 words of jelly (“sandwich* jelly”~30)
      • cook* within 30 words of jelly (“cook* jelly”~30)
      • sandwich* within 30 words of “peanut butter” (“sandwich* (“peanut butter”)”~30)
      • cook* within 30 words of “peanut butter” (“cook* (“peanut butter”)”~30)

In Everlaw's query builder, this search looks like this:

proximity_search_2.png

This yields the same results as a search that looks like this:

proximity_search_1.png

  • “sandwich* cook* (jelly “peanut butter”)”~30
    • This search requires all three clauses (sandwich*, cook*, and (jelly OR “peanut butter”)) to appear together within 30 extra words at most. The search retrieves documents for which all of the following are true:
      • sandwich* is within 30 words of jelly OR “peanut butter”
      • cook* is within 30 words of jelly OR “peanut butter”
      • sandwich* is within 30 words of cook*

You can also perform nested proximity searches. For example, the search "sandwich ("ham cheese"~10)"~20 looks for the word "sandwich" within 20 words of every instance where "ham" and "cheese" occur within 10 words of each other.

Note

Proximity searches cannot include the logical operators AND or OR. For phrases in quotation marks, the operators are treated as part of the phrase being searched. For example, the search “jelly AND jam (“peanut butter”)”~15 searches for documents that include the exact phrase "jelly and jam," rather than documents that include both "jelly" and "jam."

You can use a space between two words that are not in quotation marks to function as an OR operator. For example, "strawberry (jelly jam)"~5 searches for documents that contain either "strawberry" within 5 words of "jelly", or "strawberry" within 5 words of "jam".

Note

Invalid characters included in a proximity search cause the proximity functionality to be excluded from the search. For example, the search "apple orange"~s returns the same number of results as "apple orange," because the ~s is ignored for creating the serach.

Proximity searches with word order specification

You can specify word order in proximity searches by using a double tilde (~~). For example, if you want to search for the word Sacramento and the word California within 3 words of each other, and Sacramento should appear before California, you can set up the search like this:

“Sacramento California”~~3

 

Negative Proximity Searches

You can also perform a negative proximity search by adding a NOT or exclamation mark ("!") before a proximity search. Such a search excludes documents with proximity hits. This is useful if you are searching for a common word that has a highly specific contextual meaning, but may also appear in other documents under its more common meaning. For example, you may only be interested in documents where "Sun" is used to denote a corporate entity. However "sun" may also be a common shorthand used to denote Sunday. To help exclude documents that may be using "sun" as a shorthand for Sunday, we can construct a negative proximity search that excludes any and all documents where the word "sun" appears within a certain distance of date words, like the days of the week or the months of the year. A search like this in Everlaw could look like the following:

negative_prox_example.png

Regular Expressions

Regular expressions (or regex) allow you to search for text strings that match certain patterns of characters. Common use cases for regular expression searches include finding:

  • Common patterns of personally identifying information, such as Social Security or credit card numbers, without having to know and list out all known permutations that could appear in your dataset
  • Words regardless of spelling variations or common misspellings
  • All emails from a certain domain

Before diving into some examples, we let's break down the key components of a basic regular expression search:

Search boundary: Because regular expression is a special type of content search, we must use the forward slash character, “/“ to indicate the boundaries of our search. 

  • For example, “/Everlaw/“ is a regular expression search for the word “Everlaw”

Search operators: Search operators allow you to add additional precision, constraints, or logic to your search. Some notable operators are described in the table below. 

Operator Description Example Matches

|

OR

/Prices|Today/

Prices OR Today

.

Single character wildcard search

/c.d/

cad, cbd, ccd, cdd, …

*

The character preceding the operator will be matched zero or more times

/buz*/

bu, buz, buzz, buzzz, etc

?

The character preceding the operator is optional and will be matched, at most, once

/colou?r/

colour, color

+

The character preceding the operator must be matched at least once.

/ma+d/

mad, maad, maaad, etc.

{n}, {n,}, {n,m}

The character preceding the operator must be matched either (1) exactly n-number of times, (2) at least n-number of times, or (3) at least n- number of times, but no more than m-number of times. 

/3{2,5}/

33, 333, 3333, 33333

 

Character sets: Character sets allow you to specify searches where a single character can match any of the characters in the set. To create a character set, enclose the characters you want to match by in brackets (“[]”).

  • For example, depending on whether you are using British or American spelling, verbs can end either in “ize” or “ise”. An example of this is “organize” and “organise”. With regular expression, you can easily search for both variations with the following search: “/organi[zs]e/“.
  • Character sets can also be negated by adding the caret character, “^”, after the opening bracket. This indicates that the the character can match any character not specified in the set. For example, /a[^abcde]/ searches for any two letter word where the first character is “a” and the second character is anything that is not “a”, “b”, “c”, “d”, or “e”. 

Character ranges: Instead of listing out all values you want to match by in a character set, you can also specify a range by using hyphens (“-“). For example:

  • [a-z] means a given character can take on any letter value between “a” and “z”. Regex is case sensitive, so [a-z] is different from [A-Z]. However, note that searches in Everlaw are case insensitive
  • [1-4] means that a given digit can take on any value between 1 and 4, inclusive (ie. any of 1, 2, 3, or 4). Note that regex does not have any formal understanding of numbers. Instead it treats each digit separately as a character that can take on any value between 0 and 9, inclusive. This introduces some non-obvious behaviors when it comes to searching numbers. For example, if you want to find numbers in the range 10 to 59, you may expect the following regex search to be successful: /[10-59]/. But while this is a valid search, it actually only matches the digits 0, 1, 2, 3, 4, 5, and 9. If you instead want to find numbers in the range 10 to 59, you must use the following regex search: /[1-5][0-9]/. This matches any two digit number where the first digit is between 1 and 5, and the second digit is between 0 and 9. Thus, the lower bound of this range is 10 and the upper bound is 59.
  • Character ranges can also be concatenated. For example, /[a-zA-z0-9]X/ searches for any two character word where the first character is any digit or lower and upper case letter, and the second character is X. 

Escaping: Regex has special characters that carry special meaning when used in a search. These special characters include the operators described in the table above. But what if you want to use a special character as the character itself instead of its special search meaning? For example, what if you want to search a dot/period or a question mark instead of performing a wildcard or matching search? In such cases, you have to “escape” the special character by enclosing it in brackets ("[]"). This tells the system you wish to use the character as a character instead of by its special search functionality. So, for example, /d[.]d/ returns only the exact string “d.d” whereas /d.d/ returns any three character string that begins and ends with d and has any character in between. 

Using these operators and components in tandem allow you to create complex regex searches. Here are some useful examples: 

Number Pattern: Assume the numbers you are searching for follow this pattern: XXX.XXX.XXXX. Here is a regex search that matches this number pattern: /[0-9]{3}[.][0-9]{3}[.][0-9]{4}/

Regex_Examples_-_phone_numbers.png

While it may look complicated at first glance, it is much easer to understand once decomposed into its component parts. You can adapt this to fit any unique number pattern, whether it be phone numbers, social id numbers, credit cards, etc. Keep in mind that this search could be under-inclusive in the sense that it only captures patterns where the three chunks of numbers are separated by precisely one dot. A more inclusive search could be constructed as follows: /[0-9]{3}/ /[0-9]{3}/ /[0-9]{4}/. However, this may be too over-inclusive as it would match any of the three chunks independently of the others. Please note that many non-letter characters (such as hyphens) are not indexed by Everlaw, and therefore cannot be searched as such. 

Email addresses:

Regex_Examples_-_email__1_.png

The logic of this search (/[a-z0-9._!%+]+@[a-z0-9]+[.][a-z0-9]{2,4}/) is that there must be at least one character before the [at] symbol for the username, at least one character between the [at] and [dot] symbols for the domain name, and at least two (but no more than four) characters for the top-level domain. You can modify this template regex search to fit your particular needs. For example, if you believe that the top-level domain can have more than four characters, you can relax this constraint. You may also already know the domain and top-level domain, or there may be other special symbols that could appear in the username. 

Common passport patterns

Here are some regular expressions you can use to search for common passport number patterns. These patterns are not extremely specific, and you are likely to find some false positives amongst the hits:

  • United States: /[a-z0-9][0-9]{8}/
  • United Kingdom: /[a-z0-9}[0-9]{8}/
  • Canada:/[a-z]{2}[0-9]{6}/
  • Australia: /[NEDFACUX][0-9]{7}/ OR /R[A-Z][0-9]{7}/ OR /P[ABCDEFUWXZ][0-9]{7}/ 

Content searches:  

Regular expression can be used to create useful content searches. For example:

  • /[bcg]oat/  will find boat, coat, or goat. Either b, c, or g could occupy the first spot in the text string.
  • /gr[ae]y/ will find grey or gray. Either a or e could occupy the third spot in the text string. 
  • /.?oa.?/ will find words that are between 2 and 4 characters in length and contain “oa”. For example, this search will find words like loan, load, goad, oar, and boa.
  • /.*oa.*/ will find words that are between 2 and any arbitrary length long and contain “oa”. For example, this search will find words loan, load, goad, oar, boa, coats, and float.  

Prebuilt Searches

To help you find common content patterns for personally identifying information, Everlaw has prebuilt some searches that can be added as hit highlights by project admins. To add these prebuilt hit highlight searches, navigate to the General > Persistent Highlights section of Project Settings

Screen_Shot_2022-04-15_at_11.49.38_AM.png

 

Smart Expressions

Smart expressions are an easy way to search for personal identifiable information (PII) patterns. These expressions allow for faster and easier searching of certain patterns of numbers, letters, and symbols. In addition, smart expressions have built-in checks to reduce the number of false positives that one might see when running searches based on regular expressions. 

Social Security numbers: This search will return hits on three groups of numbers of the form three digits, two digits, four digits. If the string of numbers starts with “9”, “666”, “000,” the middle group is “00,” or the last group “0000,” this is not a valid Social Security number and will not be returned by searches for <ssn>. 
Phone numbers: This search will return hits for both US phone numbers and non-US phone numbers that contain a country code. 
   Email addresses: Searching for <email> returns email addresses that are found in documents. 
  Employer Identification Numbers (EIN): This search will return hits on two groups of numbers of the form two digits, seven digits. A string of numbers must have a valid EIN prefix to be considered a valid EIN.
International Bank Account Numbers (IBAN): IBAN searches will return hits on groups of four alphanumeric characters, with the possibility of the last group being one to three characters in length. For an IBAN to be valid, it must have a valid country code and pass a checksum.
Credit card numbers: This search will return valid credit card numbers. For a string to be considered a credit card number, it must be the appropriate length, as well as pass a checksum.

Before diving into some examples, we let's break down the key components of a basic smart expression search: 

Search boundary: Because smart expressions are a special type of content search, we must enclose the expression in angle brackets, (“<” and “>”). 

  • For example, “<ssn>“ is a smart expression search for Social Security numbers.

Search operators: Search operators allow you to add additional precision, constraints, or logic to your search. Some notable operators are described in the table below. 

 

Operator Description Example Matches
? Single character wildcard search <email=abc?@gmail.com> abca@gmail.com, abcb@gmail.com, abcd@gmail.com …
* Wildcard search

<phone=661*>;

<email=*@everlaw.com>

(661) 555-1234,
661 55502222,
661 123-4567;

johnb@everlaw.com,

alfredg@everlaw.com

 

You can use these search operators in a smart expression to find specific instances of a common pattern. For example, if you are looking for phone numbers, you can search just for <phone>. In addition, if you are searching for phone numbers that begin with a specific area code, you can use an equals sign to search for <phone=661*>. The wildcard in this search means that this search returns phone numbers with the area code of 661, regardless if the form is (661) 555-1234, 661-555-1234, or 661 555 1234.

 

 

Smart expressions in Persistent Highlights

Smart expressions are used to find patterns on the Persistent Highlights page. Persistent highlights set up before September 15, 2023 highlight terms based on regular expressions and are marked as (Legacy) terms. 

 

We recommend updating your persistent highlights to Smart expressions, as they are faster and return fewer false positives than the legacy personal information pattern searches. To update your persistent highlights to include Smart expressions, navigate to the Persistent Highlights page of your project settings, delete the legacy terms, and add the smart expressions you wish to use from the Unused or “Add more region-specific personal information patterns” sections.

Limiting your search based on context

You can also search for a certain word or phrase, while excluding specific contexts in which the word or phrase may occur. To do this, use NOT within your contents search. For example, if you want to do a wildcard search for “proto*” but don’t want your search to return any variant of “protocol”, you can build a search that looks like this:

proto_NOT.png

You can do the same for phrase searches. For example, if you want emails with the phrase “secret meeting” in them, but do not want emails returned where “secret meeting” only appears within the phrase “I have no interest in a secret meeting”, you can build a search that looks like this:

secret_meeting_NOT_no_secret_meeting.png

Limiting your search based on context also works for regular expression searches. For example, if you are looking for documents that contain the year 2017 in their contents, but you do not want your results to yield instances of “2017” that are part of phone numbers, you could construct a search that looks like this:

2017_NOT_phone_number.png

Finally, you may want to search for a certain term but ignore places where it occurs in a predictable common phrase, such as in the footer of many docs. You can add a hyphen (-) to your content search preceding the word or phrase you want to ignore. Do not add a space between the hyphen and the phrase you want to exclude.

For example, if you want to search for the word "privilege" but ignore instances where "privilege" is included in the phrase "attorney client privilege," build your search like this: 

"(privilege -"attorney client privilege")"

Screen_Shot_2020-05-08_at_1.50.49_PM.png 

Considerations regarding characters indexed by Everlaw

Everlaw's indexing methodology affects the type of content you can search for. Here are some key points to be aware of as you construct content searches:

  • Most non-letter characters are not indexed, and therefore are treated as spaces by Everlaw. The exceptions include numbers and the following characters: $#@&%€¢£¥₩₹฿₫₴₪
  • Some characters are treated as spaces when prepending or appending words, but are considered part of the word (and thus indexed) when they appear within a word. Examples of such characters include colons and dots. 
  • The apostrophe (') is indexed. This means, e.g., "John's" can be distinguished from "Johns." However, when an apostrophe occurs in a word in the middle of a phrase that is three words or longer (ex. "St. John's Road"), the search returns no results. If you need to run such a search, replace the apostrophe with a space (ex. "St. John s Road").  

Search for numbers and special characters in CSV files

In Excel spreadsheets uploaded prior to June 18, 2021 (that have not been reprocessed since then) or any CSV file, numbers in cells next to each other are combined in the search index. As a result, you should use the wildcard search (*) for all numeric or special character search queries over CSV files.