by Kiran Wattamwar
The GDPR follows the 1995 EU Data Protection Directive (DPD), its consumer privacy predecessor. The Directive came before smart phones, connected devices and AI assistants, and before the Internet reshaped social norms. With strict controls and nearly a global reach on impacted companies, the GDPR may lead to an enduring higher standard for data privacy and protection.
In the aftermath of data scandals, from Equifax to Cambridge Analytica, we are reminded of the value of personal dataIt is worth considering the material impact it will have on both consumers as well as software and data companies – the modern institutions of memory.
A deep dive – the right to be forgotten
The GDPR offers several provisions, but the implementation requirements of them remain ambiguous. One example stems from the challenges of data deletion, a requirement for implementing the “right to be forgotten.”
A provision from Section 17 of the GDPR, it allows EU consumers to request that online search results for their whole name that are irrelevant, inadequate, or incorrect be removed. Companies like Google are required to “erase” and “rectify” personal data for valid requests “without undue delay,” although the regulation is ambiguous on what is valid, and to what extent erasure should be completed. With ambiguity embedded in nearly every requirement of this provision, how do companies implement it?
Implementing the right to be forgotten
What happens when a company decides to move forward with an erasure claim? Imagine your data is stored with Company A, a large corporation that handles global data at scale. You file a claim with Company A because this information is inaccurate, which is found to be valid. Company A would need to (1) identify the relevant data and (2) delete it. How do these steps actually work (Figure 1)?
(1) Tracking data (and its trails)
If a claim is valid, Company A must identify any information linked to this case (Figure 2). If Company A is a social network that tags people in pictures, and a claim is filed on an image, it is clear that the image needs to be taken down. But what about the other information linked to the image? Activity like tags, likes and comments on the image might also be removed. Behind the scenes, Company A processed the picture with machine learning to generate tags. In doing so, it stored computational data derived from the image. Whenever the image was loaded, uploaded or altered, it stored logs in several places to keep track of system health, putting trace amounts of image-related data related in millions of log files over time. Should this data also be identified and removed? The extent to which the right to be forgotten applies to personal data is unclear.
Figure 2. The many forms of objectionable data Data linked to a claim can exist in several forms, and formally defining what constitutes “linked data” remains challenging. Some examples of linked data are user-generated data (a picture or information a user might supply themself), platform-generated data (likes, comments, information related to the use of the source data), derived data (data generated as a result of processing on the source data), and log data (information pertaining to the storage and access of the source data).
Identifying relevant data is half the process. Finding it in Company A’s infrastructure to carry out deletion is next. When you provide your data to Company A, it will probably get stored across more than one database. The record of your data might be replicated in multiple disks – a process known as mirroring. Mirroring might sound redundant, but it’s what allows companies to retrieve information quickly, especially when they are managing huge datasets at scale. Mirroring also creates backups, in case other copies are lost or corrupted. After mirroring, your data could potentially be stored in different cities, states, or countries. Because companies themselves might be operationally fragmented, databases might not be owned by a single department or maintained in a streamlined way.
Because maintaining redundant and quickly accessible data is expensive, Company A might shuffle your information into different places depending on how frequently it needs to be accessed. Old pictures, for example, may not be accessed for months or years while recent emails might need quick retrieval. The old pictures may be consigned to “cold backups,” which are held safely offline but cannot be accessed 24/7. The newer data may be kept online in “hot” or “warm backups”, where it’s easier to pull up efficiently but costlier to store and maintain. As such, all of these various backup repositories must also be interrogated to find all traces of a given piece of personal data.
(2) Forgetting digital memories
Once relevant personal data is identified and located, there are multiple types of deletion that can be performed on it (Figure 3). One common form of deletion is termed soft deletion. This is what happens when you delete a file from your Recycle Bin (or Trash Bin if you’re an Apple person) on a personal computer. Databases work somewhat like a table of contents. They maintain a record of where to find data, and then in a separate location, store the data itself. Soft deletes remove that data’s location in the table of contents (a “pointer” to the data), but do not actually get rid of the data itself. Instead, that data is marked with permissions to overwrite it when new data needs to be stored in its place. While one advantage of this process is that it is reversible, the data is not actually deleted and may remain indefinitely. In contrast, “hard deletes” go one step further by overwriting the data with random information, multiple times, to make sure that it is truly unrecoverable.
Figure 3. Data deletion Here’s a quick look at deleting “image.jpg” from our system. In the case of no deletion, we see that the location of the image in the table of contents, and the image itself in the pages are preserved. With a soft delete, we lose the location of the image data, but the image still technically exists in the pages. Finally, with hard deletion, the location and image data are lost.
What does this all mean?
While this provision of the GPDR can strengthen users’ rights, it may also pave the way for dubious claims to be made against software and data companies. Identifying whether content violates a user’s rights is complex, and the way platforms will react to claims right now is uncertain.
Companies across industries handling personal data will be affected in different ways. For example, financial services, which handle personal data constantly, will need to maintain a legally defensible strategy for data maintenance. Advertising agencies, which don’t directly interact with the consumers they collect data from, will need direct consent to collect and use personal data. Small companies will need to establish their own strategies to remain compliant, while large corporations may need to devote more resources for their own information governance.
Institutions of memory continue to serve as the gatekeepers of information we interact with in our digital world. Unlike human memory, digital memory sediments and collects over time. Managing this information has become an increasingly important and uncharted task. The recent advent of the EU Privacy Shield and GDPR frame a growing question: who owns our digital information?
Kiran Wattamwar is a Master’s student at the Harvard University Graduate School of Design.
For more information:
- To learn more about reputation bankruptcy, check out this article
This article is part of the 2018 Special Edition — Tomorrow’s Technology: Silicon Valley and Beyond