Anonymizing smartphone data is no longer enough -- users can be identified with just a few details

Vast amounts of data from users are available to smartphone companies. Companies ensure us that this data is anonymized — devoid of personal indicators that could pinpoint individual users. But these insurances are hollow, a new study claims: a skilled attacker can identify individuals in anonymous datasets.

When the pandemic started and lockdowns were enforced, the world seemed to grind to a halt. You could see that easily just by looking around, but the data also confirmed it. For instance, mobility trends published by the likes of Apple and Google showed that a significant part of the population had stopped commuting to work, and people were increasingly using more cars and less public transit.

At first, users were understandably spooked by the data. Do tech companies know where I go and what I do? That’s not how it goes, the companies assured us. The data is anonymized — they know a user went somewhere and did something, but they don’t know who that user is. Other apps also scoop vast quantities of data from your smartphone, either for ad targeting or for other purposes, though in many cases, they are still legally mandated to make the data anonymized, removing all identifiable bits like names and phone numbers.

But that’s no longer enough. With just a few details (like for instance, how they communicate with an app like WhatsApp), researchers were able to identify many users from anonymized data. Yves-Alexandre de Montjoye, associate professor at Imperial College London and one of the study authors, told AFP it’s time to “reinvent what anonymisation means”.

What is anonymous?

The researchers started by looking at anonymized data from around 40,000 smartphone users, mostly gathered from messaging apps. They then “attacked” the data — mimicking a process a malicious actor would do. Essentially, this involved searching for patterns in the data to see whether it could be figured out who individual users are.

With only the direct contacts included in the dataset, they were able to pinpoint individual users 15% of the time. When, in addition, further interactions between those primary contacts were included, they were able to identify 52% of the users.

This doesn’t mean that we should give up on anonymization, the researchers explain. However, we should strengthen what this anonymization means, making sure that the data is indeed anonymous.

“Our results provide evidence that disconnected and even re-pseudonymised interaction data remain identifiable even across long periods of time,” the researchers wrote. “These results strongly suggest that current practices may not satisfy the anonymisation standard set forth by (European regulators) in particular with regard to the linkability criteria.”
“Our results provide strong evidence that disconnected and even re-pseudonymized interaction data can be linked together,” the researchers conclude.

Researchers suggest restricting large datasets to simple questions-and-answers systems or using differential privacy systems that add arbitrary substitutions that ensure data privacy,

The study was published in Nature Communications.