Practical tools for quantitative deidentification and return of results

Practical tools for quantitative deidentification and return of results


Author(s): Lauren Marie Harmon,Wanding Zhou,Samantha Lent,Tim Triche

Affiliation(s): Van Andel Institute



Epigenomic data, whether from chromatin immunoprecipitation assays, DNA methylation microarrays, or protocols such as ATAC-seq produce results that inhabit a continuum from completely deidentified (and completely opaque, often to study participants as well as investigators) to completely identifiable (as with release of raw sequencing reads). DNA methylation arrays are particularly nettlesome; manufacturers typically include high minor allele frequency SNP probes to help detect sample swaps, but the arrays are implemented as a specific type of genotyping array, which incidentally detect genetic variation at cytosines assayed for methylation. This becomes more problematic as atlas-scale data is released for minimally invasive assays and liquid biopsy applications; DNA is a durable molecule with remarkable stability. Any data which provides locus-level results is potentially reidentifiable, while existing deidentification schemes hamper return of results to participants, disincentivizing participation and consent. The majority of position papers addressing this situation have proposed either siloed access (with little or no vetting of researchers, leading to disastrous results such as the UKBB sexual preferences debacle) or fully open access to raw data (which is not only incompatible with GDPR regulations, but also reckless towards participants). These issues are not unique to epigenomic assays. However, despite the ready availability of public-key cryptographic schemes routinely used in (among other applications) financial and commercial applications, we find no quantitative assessments of recoverable stochastic noise injection with key exchange provisions in the literature. The `sesame` and `rehash` packages provide a toolchain suitable for both quantitative deidentification (uniquely testable in subject-versus-tissue signal decomposition) and public-key-based reidentification for return of results. We have implemented tools for stochastic randomization of EWAS and cancer genomics data, which can operate on data formats from raw IDAT files to sequencing reads to summarized intensity ratios or blocked modification rates. Previous work has suggested excluding reidentifiable information from such data, which (given the tens of thousands of human subject results already deposited in NCI, NHGRI, ENA, and GEO) implies either attempting to recall all such data ever deposited, or injecting sufficient noise into existing raw and processed data to allow reconstruction of replicable results with keyed exchange for unaltered return of results. We propose that the latter solution is both more practical and more appropriate for virtually all genomic data deposited presently and can be extended from mRNA expression assays (where eQTLs and so-called 'memory genes' provide a degree of reidentifiability) to whole-genome sequencing (where we extend existing efforts relying upon stochastic approximation and 'digital twins' to provide a more general and quantitatively calibrated scheme for secure lossless encoding of the original data, suitable for direct return to participants). Our results provide a means to balance replication of experimental results with participant privacy while enabling return of results in a secure and traceable fashion, without resource-intensive immutable logs (chains) or opaque schemes based upon nonlinear encoding. Experimenters and clinicians can specify an anticipated risk of reidentification at a given level of noise injection, much as a clinical trial can specify anticipated statistical power at a given effect size, and both IRBs and (if approved) participants providing informed consent can base their decisions upon the specified levels of expected risk. Moreover, the testable rates of recovery before and after noise injection in large cohorts of rare disease patients provide a quantifiable benchmark for improvements in encoding schemes with guaranteed, cryptographically secured provisions for return of lossless results to the participant upon indemnification.