Why Use E5 Exact Data Matching?

When securing the modern enterprise with Microsoft E5 tooling, organizations rely on Sensitive Information Types (SITs) for their data governance. Many times SITs are presented as either regex patterns, keyword lists, or dictionaries, with little concern for the fact that a SIT can be a combination of those tools. However, more appalling than that, many times Exact Data Matching (EDM) SITs are completely lost in the discussion. This is very unfortunate, as EDM SITs present a very surgical tool for any security team, allowing near 100% confidence in the data being detected.

This is done by uploading a spreadsheet to the Microsoft Purview Exact Data Match creation wizard. In the spreadsheet, each column represents a field, while each row represents an entity. There are limits to the size of the data being uploaded, however, that will be discussed in a later post. First, I want to make sure we are both on the same page about what SITs are, and how EDM SITs differ.

A Brief Overview of SITs

Microsoft Purview's SITs are the atoms that build up the Purview toolset. SITs are regex patterns, keyword lists, dictionaries, and in the case of EDM SITs, spreadsheets, that allow an organization to detect sensitive information in documents, emails, and files. SITs can be used across multiple different platforms like SharePoint, OneDrive, Teams, Devices, and on-premise repositories. Purview comes pre-equipped with 305 SITs with new SITs being added nearly monthly. EDM SITs can be used in most, but not all, situations any other SIT can be used in. These include:

Microsoft Purview Data Loss Prevention
Auto-labeling (service and client side)
Microsoft Purview Insider Risk Management
Microsoft Purview eDiscovery
Microsoft Defender for Cloud Apps

The most notable tool missing from this list is Microsoft Purview Data Lifecycle Management, which means that EDM SITs can not be used to auto-apply retention labels, unlike normal SITs.

Now that we’re on the same page about what a SIT is, let’s discuss why one might want to use an EDM SIT.

Ignorance in Data Protection

Detecting sensitive data in a large organization can be difficult for many reasons, but fundamentally the largest challenge comes from the different forms data can take. Lets look at SSNs as an example.

There are millions of unique social security numbers, and very few publicized non-valid number combinations to ignore. Not only would it be technically challenging, but maintaining a database of all current SSNs presents a huge security concern. In this case, we are ignorant to the data that we might encounter in our system, meaning that we must look for patterns that suggest a number may be a SSN, and worth protecting. Microsoft does this with their out-of-the-box SIT for SSN detection by looking for nine digit numbers with either the proper SSN format (xxx-xx-xxxx or xxx xx xxxx) or without (xxxxxxxxx) as well as relevant keywords that suggest the number is an SSN (“SSN”, “social security number”, “SS#”, etc). Microsoft lists their complete definition of their out-of-the-box SIT here.

When we are ignorant to the exact data we may encounter in our systems, probabilistic matching is the best we can do. What do we do then when we are knowledgeable of the exact sensitive data we may encounter, especially sensitive data that is generated from our own systems? We use exact data matching (c’mon, that was obvious).

Knowledge in Data Protection

I have worked with many healthcare organizations, all of which generate their own medical record numbers (MRNs) and patient ID numbers (PIDs). These numbers are unique not only in their composition (no two are the same) but also in the sense that this is a very specific piece of data (due to it’s unique composition) that is generated by the organization (unlike a SSN) and must be protected by the organization. Generally, these organizations use tools like an Epic database to track all of these numbers. However, the database just maintains knowledge of the sensitive data, but we need to do something with that knowledge.

This is where EDM SITs make their much needed entrance.

EDM SITs allow us to detect these numbers across almost every Microsoft workload using a vast majority of Microsoft Purview’s tools. Suddenly, using EDM SITs, DLP policies can accurately detect MRNs in transit to unauthorized users. Auto-labeling policies can accurately detect PIDs and protect the documents they’re in. All because they know exactly what they’re looking for. Of course, understanding the concept is the easy part, and we must now implement Microsoft Purview Exact Data Matching.

Disclaimer: The opinions and content are my own and do not necessarily represent Edgile’s position or opinion.