Anonymization has long been promoted as a privacy safeguard that involves procedures such as removing direct identifiers, coarsening location and time, or adding noise so that data can be shared safely.
A mid-2025 re-identification attack against the YJMob100K mobility dataset shows this assumption is dangerously outdated and provides a concrete demonstration of the risks in sharing human mobility data.
The Case: YJMob100K Mobility Dataset
Publisher: LY Corporation (formerly Yahoo Japan)
Dataset: YJMob100K – city-scale human mobility trajectories for 100,000 users over 75 consecutive days
Goal: Enable research in urban mobility while preserving individual privacy
Original Anonymization Measures
According to the paper’s authors Mishra et al., 2025, the raw GPS traces were transformed before public release:
- Spatial coarsening: raw coordinates mapped to a 500 m × 500 m grid
- Temporal coarsening: timestamps rounded to 30-minute bins
- Context hiding: the dataset omitted the actual city name and the calendar dates
- Filtering: only users with sufficient data inside the anonymized bounding box were included
At first glance, these steps appeared to protect against direct identification.
The Attack: Breaking Anonymity at Scale
Mishra et al. demonstrated that these measures were not enough. They successfully re-identified both the hidden city (Nagoya) and the real calendar dates (Sept 15 – Nov 28 2019), and even showed that many individuals could be profiled by their inferred home and work locations.
Key attack components described in the paper:
Spatial inference with public data:
By correlating the anonymized grid’s population density with public census data for Japan’s ten largest cities, the researchers could align the dataset with real-world geography. Figure 2 in the paper shows that Nagoya had the highest Spearman correlation with the anonymized grid.Behavioral fingerprinting:
They examined recurring daily and weekly patterns, such as commuting peaks and weekend behavior, and matched them to known real-world cycles. This step was crucial for recovering the true weekday structure (five workdays followed by two weekend days).Event-based temporal alignment:
Using distinctive anomalies, such as the sharp drop in mobility on Day 27 caused by Typhoon Hagibis on Oct 12 2019, and spikes in activity during major events at Port Messe Nagoya (e.g., the Nagoya Motor Show), they inferred that the dataset started on Sept 15 2019 and spanned 75 consecutive days.Anchor-point leakage:
By applying a standard technique that identifies a user’s most common nighttime grid as “home” and the most common daytime grid as “work,” they found that 67,342 users had unique home–work pairs, making them highly vulnerable to linkage attacks.
Results
The researchers concluded that the YJMob100K anonymization was inefficient. Despite removing explicit identifiers and shifting reference points, the dataset retained enough structure to enable:
- City-level re-identification with near-perfect correlation
- Exact calendar-date reconstruction
- High-confidence inference of sensitive attributes like home and workplace
Why Conventional Anonymization Breaks Down
The study reinforces several lessons:
- Linkage attacks remain powerful: public records, event calendars, and environmental disruptions provide auxiliary signals that can link anonymized data back to real-world entities.
- Human mobility is uniquely identifying: as earlier work by de Montjoye et al., Unique in the Crowd, 2013 showed, just a few spatio-temporal points can often single out an individual.
- Residual structure leaks signal: even after coarsening or shifting, patterns such as commuting corridors, weekend vs weekday cycles, and activity spikes at specific venues survive and can be exploited.
- Utility–privacy trade-off is fundamental: to preserve analytical usefulness (e.g., for transportation or epidemiological studies), datasets usually retain precisely the structural patterns that enable re-identification.
Implications for Compliance and Privacy Programs
Organizations handling mobility or other high-dimensional behavioral data should recognize that simply “anonymizing” records rarely guarantees that the data is no longer personal data under regulations such as GDPR, HIPAA, CCPA, or APPI.
Recommended mitigations include:
- Apply true differential privacy (DP): allocate a bounded privacy budget (ε) and avoid releasing individual-level trajectories.
- Consider synthetic data: use generative models to capture macro-level statistics without preserving actual traces.
- Provide controlled, audited API access instead of static releases: answer queries with DP-protected aggregates.
- Use advanced privacy-preserving computation: for highly sensitive analytics, employ
Fully Homomorphic Encryption (FHE),
Secure Multi-Party Computation (MPC), or
Trusted Execution Environments (TEE) to enable computations on protected data without revealing raw inputs. - Remove or truncate outlier trajectories: highly distinctive patterns are the most vulnerable to re-identification.
- Perform adversarial red-team assessments: attempt internal re-identification before public release to measure risk.
- Enforce purpose limitation and minimize retention: reduce exposure by collecting only what is necessary and keeping it only as long as required.
The Compliance Takeaway
For compliance officers and data-sharing stakeholders, the YJMob100K case is a reminder that:
De-identification is not a safe harbor.
Data that retains enough structure for useful analytics often retains enough for re-identification.
To maintain compliance and safeguard trust, organizations must go beyond classic anonymization and combine strong privacy-preserving mechanisms with robust governance, technical controls, and contractual safeguards.
About DataHubz: DataHubz is an AI-powered compliance automation platform helping organizations achieve and maintain compliance with leading cybersecurity and privacy frameworks. Our platform continuously tracks evolving privacy challenges, including re-identification risks, to keep you ahead of threats.