A database stores a lot of information. There is the information we explicitly put in there. However, an even bigger wealth of information can be found in the relationships between the records. If you select a product at Amazon, you will be presented with a list of "related" products: "Other people who viewed X also viewed these" and "Frequently Bought Together". That information was not directly stored in the data, but rather materialized by looking at the complete set of all records in the database.
This information is worth a lot of money. In 2007, Netflix ran a competition with the goal to improve their suggestion algorithm by just ten percent. This algorithm is used to suggest movies to their customers that they will probably like. That small improvement was worth one million dollar to Netflix at the time. (It took two years before someone could reach the goal.) While that story is interesting in itself, I mention it here because it shows how much value is hidden "between the lines" of your data.
If a company goes after hidden information in their own data, for example to gain a competitive edge, we call the process data mining. However, similar processes can be used to reveal information to a person who is not supposed to have access to that information. If used in a malicious context, this same process is called data inference.
Through data inference, "a competitor or adversary may be able to use data that in isolation appears to be properly protected to infer data that is highly sensitive." (Hinke et al, 1997, P. 1)
For example, if the adversary has legitimate access to a factory's purchase history, a sudden spike in the purchasing of a particular material can show that a new product is about to be produced. Famous examples for this type of information gathering you can find in the press in the weeks before Apple announces a new product. Indicators commonly used for Apple products are the availability of the prior generation of the new product, or even the availability of shipping space between China and the US. (Apple is known to buy up large portions of the entire air shipping capacity for the days before - and directly following - an announcement.)
Preventing data inference is extremely difficult. If you are allowing access to aggregated salary data for example, the salary of a single person could be inferred by using very selective filter criteria. You can prevent this particular scenario, by returning results only, if the filtered row set contains at least a set minimum number of records, say five.
The good thing about inference is that it requires deep domain knowledge to be executed successfully. But you need a similar deep understanding of your data relationships to be able to prevent it.
While it is difficult to execute, data inference is still one of the most common ways to exploit a vulnerable database. Therefore, you need to review your access permissions regularly. If you store information that is not supposed to be accessible by all users of the system, spend some extra time to think of ways that people with restricted access can use to gather secret information not intended for their eyes. Naturally, after identifying a gap, you need to close it. The time to start acting on this is now. But protecting against data inference is also an ongoing process. You are not done after a single system review.
Data inference is one of the most commonly encountered database vulnerabilities. In this series of posts, I discuss 10 of them. Below are the ones that are published so far: