TY - GEN
T1 - The effects of applying cell-suppression and perturbation to aggregated genetic data
AU - Antoniades, Athos
AU - Keane, John
AU - Aristodimou, Aristos
AU - Philipou, Christa
AU - Constantinou, Andreas
AU - Georgousopoulos, Christos
AU - Tozzi, Federica
AU - Kyriacou, Kyriacos
AU - Hadjisavvas, Andreas
AU - Loizidou, Maria
AU - Demetriou, Christiana
AU - Pattichis, Constantinos
PY - 2012
Y1 - 2012
N2 - The key test for confidence in any association discovered within the medical domain is replication testing. That is, the ability of the association to be detected in independent populations. At the same time, in order to increase the likelihood of discovering statistically significant associations there is a clear need to increase the statistical power of any given study. A key methodology for increasing statistical power is through the use of as many subjects as possible that match a study's inclusion criteria. Thus many have attempted to merge data from multiple independent sources/sites/studies that contain the same inclusion criteria for subjects as a way of creating a much larger study with significantly more statistical power. For these approaches to work though data from multiple sites need to be made available to a single analysis. This practice is significantly limited by the need to respect legal and ethical requirements that are often complicated, ambiguous and inconsistent across different countries. The common approach to achieve merging of data is by sharing aggregated data rather than subject's personal data. Aggregated data however may still in some cases be reverse engineered, therefore traditionally cells within the aggregated data with small values were suppressed, and some or all of the aggregated data were perturbed in order to add noise inhibiting any attempts at identifying personal information of a specific person or sub-group in the original data. In this paper we study the effects of cell-suppression and perturbation on the results of the data analysis. Each approach is looked at by itself as well as in combination using the typical settings documented in the literature. The tests are based on a real dataset that looks for associations between phenotypes and genetic markers. This work is part of the Linked2Safety project that aims to dynamically interconnect distributed patients' data to better enable medical research efforts, whilst respecting patients' anonymity, as well as European and national legislation.
AB - The key test for confidence in any association discovered within the medical domain is replication testing. That is, the ability of the association to be detected in independent populations. At the same time, in order to increase the likelihood of discovering statistically significant associations there is a clear need to increase the statistical power of any given study. A key methodology for increasing statistical power is through the use of as many subjects as possible that match a study's inclusion criteria. Thus many have attempted to merge data from multiple independent sources/sites/studies that contain the same inclusion criteria for subjects as a way of creating a much larger study with significantly more statistical power. For these approaches to work though data from multiple sites need to be made available to a single analysis. This practice is significantly limited by the need to respect legal and ethical requirements that are often complicated, ambiguous and inconsistent across different countries. The common approach to achieve merging of data is by sharing aggregated data rather than subject's personal data. Aggregated data however may still in some cases be reverse engineered, therefore traditionally cells within the aggregated data with small values were suppressed, and some or all of the aggregated data were perturbed in order to add noise inhibiting any attempts at identifying personal information of a specific person or sub-group in the original data. In this paper we study the effects of cell-suppression and perturbation on the results of the data analysis. Each approach is looked at by itself as well as in combination using the typical settings documented in the literature. The tests are based on a real dataset that looks for associations between phenotypes and genetic markers. This work is part of the Linked2Safety project that aims to dynamically interconnect distributed patients' data to better enable medical research efforts, whilst respecting patients' anonymity, as well as European and national legislation.
KW - Aggregated Data
KW - Anonymi-sation
KW - Cell-suppression
KW - Noise
KW - Perturbation
UR - http://www.scopus.com/inward/record.url?scp=84872858404&partnerID=8YFLogxK
U2 - 10.1109/BIBE.2012.6399777
DO - 10.1109/BIBE.2012.6399777
M3 - Conference contribution
AN - SCOPUS:84872858404
SN - 9781467343589
T3 - IEEE 12th International Conference on BioInformatics and BioEngineering, BIBE 2012
SP - 644
EP - 649
BT - IEEE 12th International Conference on BioInformatics and BioEngineering, BIBE 2012
T2 - 12th IEEE International Conference on BioInformatics and BioEngineering, BIBE 2012
Y2 - 11 November 2012 through 13 November 2012
ER -