Recently, we learned about a project which shows how the principles of Knowledge Sharing can be applied to the scientific domain, specifically to genomics data. DNAdigest is a Not-for-Profit Organisation founded and located in Cambridge, UK, by a group of individuals from diverse backgrounds who all want to see genomics used to its full potential to aid medical research. The objective of DNAdigest is to provide a simple, secure and effective mechanism for sharing genomics data for research without compromising the data privacy of the individual contributors.
From the beginning, this concept sounded very appealing to us. That’s why we contacted Fiona Nielsen, founder of this great initiative, to talk about the goals of the project, its approach on making use of such sensitive data and the current status of data sharing within the scientific community.
DNAdigest is still in the development process but already shows a promising future. Not only they have been selected for the Wayra UnLtd accelerator programme for social entrepreneurs, they are also working hard on building a community around the idea, organising events like hack days and workshops. Since, no one can describe the project better than its creator, we invite you to discover more about it through the following sequence of questions and answers.
1) Fiona, could you first introduce yourself and DNAdigest?
I am a bioinformatics scientist turned entrepreneur. I used to work in a biotech company where I was developing tools for interpretation of next-generation sequencing data and I took part in a number of projects where I was doing the data analysis of cancer sequencing samples. During my work, I realised how difficult it is to find and get access to genomics data for research.
DNAdigest was founded as an entity to provide a novel mechanism for sharing of data, aligning the interests of patients and researchers through a data broker mechanism, enabling easy access to anonymised aggregated data.
2) Why it is important to share genomics data? Quoting your website, the current state of sharing this information is embarrassingly limited. How does DNAdigest address this problem?
The human genome is very complex. Made up of 3 billion base pairs and varying from individual from individual, it is equivalent to looking for a needle in a haystack when you as a researcher attempt to nail down the genetic variation that is causing a genetic disease. The only way to narrow your search is by filtering out genetic variation that has been seen before in healthy individuals and annotate the variation that is left by what disease(s) the variation occurs in. This type of comparative analysis requires looking at variants from as many samples as possible. Ideally you will need to compare to tens of thousands of samples to make your comparison approach statistical significance. Accessing thousands of samples today is not only difficult in terms of permissions, but also in terms of mere storage and network capacity it is not practical to download huge datasets for every team that wants to do a comparison. DNAdigest is developing a data broker which will allow the researcher to submit queries for specific variants and only the aggregated information about the selected variants is returned as a result. For example, examining a specific mutation in cancer, the query could be “what is the frequency of this mutation in cancer samples?” and the result would be returned as a frequency, e.g. 3%. The aim of DNAdigest is to reduce the time to discover, access and retrieve the data relevant to genomic comparison.
3) It seems that your idea looks quite revolutionary and actually very needed. How was the reaction of the scientific community towards your initiative so far? Are the principles behind sharing and opening data something new for scientists?
Similar approaches have been suggested and a handful of approaches have been prototyped within the academic community before. However, all of the projects for sharing data in an academic setting have ultimately faced the same problems: They do not have the resources to scale up their solution to work for the entire community, and even if they should have the ambition to scale up the solution, they would find that it is extremely difficult to find funding for infrastructure projects from traditional research funding. In general, there is a positive attitude towards data sharing in research. However, the immediate concerns of researchers revolves around writing papers and not so much towards building common infrastructure.
Based on this knowledge of the community, I realised that a separate entity is needed to take initiative for developing a solution, drawing on the knowledge generated in academia, and building an organisation that can do independent fundraising and collaborate across institutions. We have registered DNAdigest as a charity so that we can function as an independent and trusted third party to provide the community with a feasible solution.
4) What do researchers have to do in order to access genomics data on DNAdigest.org? Can individuals share their genomics information directly on the platform?
We are still designing and developing the platform, so I can not yet give you the exact user guide. Our objective is not to store entire datasets, but to connect to existing data repositories and data management systems with a common API that allows queries into the metadata to select samples, and for the samples for which patient consent is available, to query into the genetic data to provide aggregated statistics collected across datasets.
We have no plans at this point to make storage capacity for individual genomic data, currently for this purpose, an individual would have to find an associated repository, for example through their patient community, which will allow storage of their genomic data.
5) Sharing such private information is a big concern for many people nowadays. How do you approach the privacy issue? What is your solution for this?
Our approach to privacy is to provide anonymization through aggregation. We will provide an API from which it is possible to query for summary statistics over selections of the available data. For example, for a researcher interpreting a specific mutation for a patient with a genetic disease, the associated query for DNAdigest would be “what is the frequency of this mutation for patients with this genetic disease?”. The query could be also be used to look for mutation frequencies in healthy individuals or for patients with related diseases.
6) Which kind of projects could profit from DNAdigest.org?
DNAdigest is still at an early stage and we have a lot of work still to do in designing and implementing the secure query platform. The projects that are most likely to benefit from the resource of data that DNAdigest will make easily accessible are data analysis and interpretation of genetic variants in connection with rare diseases and other genetics research. In the bigger picture, a future of genomic medicine where diagnosis from genome sequencing is commonplace will only be possible if the means for interpretation, namely data access across patient groups and across repositories, becomes available.
7) We read from your blog that DNAdigest.org has been selected for the WAYRA UnLtd Accelerator. Congratulations for that! Do you benefit from other support sources? And, in general, how far are investors supporting social enterprises and non-profit-oriented ideas?
We are very happy that we were selected for the Wayra UnLtd accelerator at this early stage of our project. The accelerator is not just an office space, but a community of startups and business-savvy people helping each other develop sustainable businesses. So far, DNAdigest has been bootstrapping our initiative with volunteer participation and charitable donations.
8) You also have organised a hack day in Cambridge and even workshops, thus building an expanding community. How does DNAdigest.org benefit from this encounters? How are the results so far?
Engaging the community in our project is essential if we want to develop a new mechanism to change the existing culture and structure of data sharing. The stakeholders from academia, industry and patient groups all have very different priorities with regards to sharing of data. Through our hack day, we arrived at more complete understanding of the stakeholder interests and the potential sustainable development models and technical implementation that may be feasible on the short and the long term.
9) As you might know, we are particularly interested in Open Data. By accessing open information, developers are creating apps which are solving certain problems. Is there already any app using open genomics data? If not, how could such an app look like?
Sensitive information like medical records and genetics sequences are unlikely to be released as Open Data, however, the knowledge generated from the data, such as statistics can and should be made both public and easily available for the scientific community to build on. In addition, the metadata describing existing datasets currently residing at research repositories could be made openly available at no risk to privacy. However, a common problem in the research community is that it is difficult to provide incentives for researchers to spend time and effort to register their data in public repositories. Luckily, there is an increasing push from funding agencies to require that data produced with public funding should be made publicly available.
Regarding apps: in the bioinformatics community there are many many tools being developed to analyse proprietary data and many tools are developed to make use of data made openly available through public databases. For two such sources of public data (but not patient data), see the UCSC Genome Browser and the Ensembl Genome Browser.
10) In your opinion, how can the scientific community take profit from Open Data?
It would be ideal if there could be a real shift in research practices that researchers would register the existence of datasets even before publication (ie. Making the metadata Open Data), so that other researchers would have every opportunity to find and identify potential collaborators and sources of data for their research. For sensitive data, such as the genetic information and medical health record details for individual patients, we believe that a common interface is needed to make use of the wealth of data that is being produced today. We propose DNAdigest can provide such an alternative data access by working as the discovery and aggregation mechanism that will let you query across sensitive datasets.
Read more about DNAdigest and sign up for the newsletter at DNAdigest.org