Data privacy regulations, the continuing onslaught of data breaches, and the desire to protect sensitive data all work together to expand the requirements on organizations to protect their sensitive data. One technique for better protecting sensitive data is data masking. In this article, I will examine the issues increasing the need to better protect data, and then describe the benefits that can be achieved with data masking.
Risks to Sensitive Data Are Increasing
Data Breaches
Data breaches are one of the biggest reasons for the uptick in security concerns. Surreptitious access of sensitive data is a big problem that is nowhere close to being resolved. Even if you no longer pay attention to the continual reports of data breaches in the news, they are still occurring quite frequently.
The Privacy Rights Clearinghouse began keeping records on data breaches back on February 15, 2005. That is the date of the ChoicePoint data breach, which was one of the first major newsworthy data breaches. Since then, there have been over 9000 data breaches impacting over 10.5 billion total records containing sensitive personal information. This is an average of 12 data breaches a week.
Another data point is given by Gemalto, a digital security firm, which reports that more than six million records are lost or stolen every day. That means more than 4,500 records are stolen every minute, or 68 records every second.
There are numerous negative impacts of a data breach, including legal actions, losing customers, and money being lost. According to a recent IBM Security report, in 2022 the average cost of a data breach increased by 2.6% to over 4.3 million US dollars. The average cost the year before (2021) was around 4.25 million USD. But keep in mind these are worldwide averages. For the US, which is the country with the highest average cost, the number is more than double at $9.44 million.
So, avoiding and minimizing the impact of data breaches is a core data protection requirement.
Regulatory Compliance
An additional driver of the need to better protect data is the expanding number of industry and governmental regulations. There are many types of regulations that impact your data and the things you need to do to protect that data. There are Corporate Governance regulations that dictate the way companies are directed and controlled, such as Sarbanes Oxley (SOX), the Gramm-Leach-Bliley Act (GLBA), and the Fair and accurate Credit Transactions Act of 2003 (FACTA). Then there are Data Privacy & Protection regulations that dictate how data must be secured and protected from access. Some examples of these include The Health Insurance Portability and Accountability Act (HIPAA), The Payment Card Industry Data Security Standard (PCI-DSS), and the European General Data Protection Regulation (GDPR). Finally, there are Data Retention & Request regulations specifying the length of time that data must be maintained. And the longer the data must be kept the more controls that need to be in place to ensure that the data is protected. Examples of these regulations include the Federal Rules of Civil Procedure (FRCP), the US Food & Drug Administration (FDA) regulations, and the California Consumer Privacy Act (CCPA).
It is not the intention of this article to explain the intent and purpose of all of these regulations. But your organization must understand all of the regulations that apply to it and take steps to comply with them.
It can be costly to run afoul of regulatory compliance requirements. The penalty for non-compliance can be steep. For GDPR, fines can be up to 20 million euros or 4% of total worldwide annual turnover of the preceding year, whichever is higher.
Violations of the CCPA can seek civil penalties of $2,500 for each violation or $7,500 for each intentional violation after notice and a 30-day opportunity to cure have been provided. Violations of the HIPAA Privacy Rule can result in civil penalties ranging from fines of $100 per incident, up to $25,000 per person, and criminal penalties with fines up to $250,000 and imprisonment for up to 10 years.
Aside from financial penalties, many businesses will require their vendors to be fully compliant with pertinent regulations as a condition of doing business. And companies not in compliance may lose business in general as penalties make the news and potential clients avoid the newsmakers.
Data Masking to the Rescue
It makes sense then to put policies and procedures in place to control access to sensitive data that is under attack. Not only will it help you to avoid costly data breaches, but it will help you comply with mandatory regulations.
A popular technique for protecting sensitive data is data masking, a method that creates data that is structurally similar to production data but that is not the same as the actual data. After data is masked it can be used by application systems the same way as the actual, production data. But protected, sensitive data values are not exposed for all to see.
The term PII, or personally identifiable information, has been coined to describe the type of data that permits the identity of an individual to be directly or indirectly inferred, including any information that is linked or linkable to that individual. Examples of PII include names, phone numbers, account numbers, payment card numbers, birth dates, and more. There are many types of sensitive data that qualify as PII, and therefore need to be protected. This is the type of information where data masking can be a worthwhile practice to protect your data from prying eyes.
Data masking is sometimes referred to by other names, such as anonymization or pseudonymization or even data obfuscation. They all basically mean the same thing: concealing sensitive data by de-identifying or masking the data. This masking changes the data while keeping it useful for development, testing and perhaps training purposes.
Masking data replaces accurate data, with different, useful but inaccurate data to protect your PII. It thereby can thwart data breaches and improve your compliance efforts and projects.
When it comes to data masking, we are generally not talking about production data, which should be protected by other security and authorization protocols. Of course, some types of situational data masking of production data can make sense, but it is not the core functionality of data masking to protect production data.
Data masking is most appropriate when building test beds of data for developing and testing application programs. Many organizations simply copy production data to a test environment thereby creating a usable test bed of data. But simply copying data from prod to test exposes the PII. Some of the data is sensitive and should not be accessible by application developers. Furthermore, it is likely that test systems are less rigorously protected than production systems. Given that, additional precautions such as masking the data make sense. You don’t want to expose PII to your programmers, such as salary information and phone numbers of co-workers or customer contact information. Or worse yet, exposing customer credit card details to everybody.
But you can’t just generate fake data that is a bunch of gibberish either. You need to ensure that referential integrity is maintained in test, even as data values change. You need useful data for test cases. And you may also need to ensure consistent data conversions.
These are not inconsequential requirements.
When you mask data, valid production data is replaced with consistent, usable, referentially-intact, but not accurate data. After masking, the test data is usable just like production, but the information content is secure, because the actual values have been changed.
How Is Data Masked?
The general idea of data masking is to create reasonable test data that can be used like the production data, but without using and therefore exposing the sensitive information. Data masking protects the actual data, but provides a functional substitute for tasks that do not require actual data values.
Data masking is an important component of building any test bed of data — especially when data is copied from production. To comply with pertinent regulations, all PII must be masked or changed, and if it is changed, it should look plausible and work the same as the data it is masking. Think about what this means:
- Referential constraints must be maintained. If primary or foreign keys change — and they may have to if you can figure out the original data using the key — the data must be changed the same way in both the parent, and child tables.
- Unique constraints must be enforced. If a column, or group of columns, is supposed to be unique, then the masked version of the data must also be unique.
- The masked data must conform to the same validity checks that are used on the actual data. For example, a random number will not pass a credit card number check. The same is true of the social insurance number in Canada and the social security number in US, too (although both have different rules).
- And do not forget about related data. For example, city, state, and zip code values are correlated, meaning that a specific zip code aligns with a specific city and state. As such, the masked values should conform to the rules.
Furthermore, the masking process must not be trivial. There are three important qualities of a useful data masking procedure: the first is permanence. When the data is masked, or anonymized, it should not be able to be unmasked using the masked value. The second quality is that masking should be irreversible. Once the data is masked, it should not be reversible. And the final quality is that you should not be able to infer the unmasked value from the masked value. It must not be possible to infer or deduce the content of the original, unmasked data.
Furthermore, be aware that there are two different types of data masking at a high level: static and dynamic. Static data masking permanently replaces sensitive data by altering data at rest. Dynamic data masking replaces sensitive data in transit, leaving the original at-rest data intact and unaltered. Dynamic data masking is sometimes called on-the-fly data masking because the data on disk is unaltered but is modified on-the-fly when it is accessed.
Static data masking is more useful for creating test beds of data for use by developers, whereas dynamic is more useful if you want to give specific users access to data without exposing PII and without having to convert it on disk beforehand.
Finally, there is deterministic data masking, which can apply to either static or dynamic data masking techniques. Deterministic data masking ensures the consistency of the masking process. For example, a specific value (say “ABC”) always becomes another specific value (say “XQW”). Deterministic data masking is essential for maintaining the viability of referential integrity.
Guidance for Choosing a Data Masking Solution
A reliable method of automating the process of data masking that understands these issues and solves them is clearly needed. And this typically requires a tool to implement properly. Tools can utilize multiple different types and techniques of masking for different types of data. Some of the masking techniques that can be used include:
- Encryption
- Scrambling
- Shuffling
- Substitution
- Nulling Out
- Value Variance
- Date Aging
Many data masking tools support several, or even all of these techniques for different types of data and PII. Homegrown solutions typically implement basic masking using one or maybe only a few of the aforementioned techniques. Novices sometimes look at the problem and think it should be easy to mask or obfuscate data.
A robust data masking tool will offer multiple algorithms that can be used out-of-the-box as delivered, or with user-implemented customizations to handle specific types of data and use cases.
For example, functions should be provided that can be used to generate names, addresses, credit card numbers, social security numbers, and so on. The tool should allow these functions to be applied to specific columns that contain PII, and mask the values with plausible data, but not the actual data. For example, credit card numbers pass validity checks, addresses have matching street names, zip codes, cities, and states, and so on.
It is common for a quality data masking tool to use hashing functions and lookup tables to generate names and addresses. But the hashing function must be non-invertible so the process cannot be easily reversed, and the lookup tables need to be thorough, protected, and available for any language you need to use.
Some database management systems offer basic data masking capabilities, so you should always investigate the native masking functionality before embarking on using a tool. But most DBMS functionality will be limited, such as just a way of displaying a different value based on a rule for a specific column. In other words, most of the data masking functionality built in to the DBMS will be dynamic data masking with simple value replacement. This can be useful if you are just looking for a way to overwrite a payment card number with ####, but it is typically not a sufficient solution for protecting your PII.
You should be able to use the data masking tool to mask your data as it moves from one environment to another (such as from production to test for application development requirements), or to mask the data in place (such as in a test environment that already contains sensitive unmasked data).
Masked Data Is Protected Data — Make It a Priority
The goal should be to mask your sensitive data such that it works like the actual production data, but does not contain any actual data values (or any processing artifacts that make it possible to infer information about the actual data). Masked data is protected data. If it gets exposed, say via a data breach, it won’t matter. Because it was not the actual data that was exposed. Thieves won’t be able to use it because the actual values were masked permanently, irreversibly, and in a way that the real value cannot be inferred.
With the continuing growth of data breaches, and the growing requirements of industry and governmental regulations, deploying data masking to protect your data should be a top priority.