Exploring Open Science n°4: DNAdigest interviews Nowomics

This week I would like to introduce you to Richard Smith, founder and software developer of Nowomics. He kindly agreed to answer some questions for our post blog series and here it is – first hand information on Nowomics. Keep reading to find out more about this company.

richard_smith

Richard Smith, founder and software developer of Nowomics

1. Could you please give us a short introduction to Nowomics (goals, interests, mission)?

Nowomics is a free website to help life scientists keep up with the latest papers and data relevant to their research. It lets researchers ‘follow’ genes and keywords to build their own news feed of what’s new and popular in their field. The aim is to help scientists discover the most useful information and avoid missing important journal articles, but without spending a lot of their time searching websites.

2. What makes Nowomics unique?

Nowomics tracks new papers, but also other sources of curated biological annotation and experimental data. It can tell you if a gene you work on has new annotation added or has been linked to a disease in a recent study. The aim is to build knowledge of these biological relationships into the software to help scientists navigate and discover information, rather than recommending papers simply by text similarity.

3. When did you realise that a tool such as Nowomics would be of a great help to the genomic research community?

I’ve been building websites and databases for biologists for a long time and have heard from many scientists how hard it is to keep up with the flood of new information. There are around 20,000 biomedical journal articles published every week and hundreds of sources of data online, receiving lots of emails with lists of paper titles isn’t a great solution. In social media interactive news feeds that adapt to an individual are now commonly used as an excellent way to consume large amounts of new information, I wanted to apply these principles to tracking biology research.

4. Which part of developing the tool you found most challenging?

As with a lot of software, making sure Nowomics is as useful as possible to users has been the hardest part. It’s quite straightforward to identify a problem and build some software, but making sure the two are correctly aligned to provide maximum value to users has been the difficult part. It has meant trying many things, demonstrating ideas and listening to a lot of feedback. Handling large amounts of data and writing text mining software to identify thousands of biological terms is simple by comparison!

5. What are your plans for the future of Nowomics? Are you working on adding new features/apps?

There are lots of new features planned. Currently Nowomics focuses on genes/proteins and selected organisms. We’ll soon make this much broader, so scientists will be able to follow diseases, pathways, species, processes and many other keywords. We’re working on how these terms can be combined together for fine grained control of what appears in news feeds. It’s also important to make sharing with colleagues and recommending research extremely simple.

6. Can you think of examples of how Nowomics supports data access and knowledge dissemination within the genomics community?

The first step to sharing data sets and accessing research is for the right people to know they exist. This is exactly what Nowomics was set up to achieve, to benefit both scientists who need to be alerted to useful information and for those generating or funding research to reach the best possible audience. Hopefully Nowomics will also alert people to relevant shared genomics data in future.

7. What does ethical data sharing mean to you?

For data that can advance scientific and medical research the most ethical thing to do is to share it with other researchers to help make progress. This is especially true for data resulting from publicly funded research. However, with medical and genomics data the issues of confidentiality and privacy must take priority, and individuals must be aware what their information may be used for.

8. What are the most important things that you think should be done in the field of genetic data sharing?

The challenge is to find a way to unlock the huge potential of sharing genomics data for analysis while respecting the very real privacy concerns. A platform that enables sharing in a secure, controlled manner which preserves privacy and anonymity seems essential, I’m very interested in what DNADigest are doing in this regard.

Bildschirmfoto vom 2015-01-12 15:45:52

Exploring Open Science n°2: DNAdigest interviews SolveBio

DNAdigest continues with the series of interviews. Here we would like to introduce you to Mr Mark Kaganovich, CEO of SolveBio, who agreed on an interview with us. He shared a lot about what SolveBio does and discussed with us the importance of genomic data sharing.

Mark

Mark Kaganovich, CEO of SolveBio

1) Could you describe what SolveBio does?

SolveBio delivers the critical reference data used by hospitals and companies to run genomic applications. These applications use SolveBio’s data to predict the effects of slight DNA variants on a person’s health. SolveBio has designed a secure platform for the robust delivery of complex reference datasets. We make the data easy to access so that our customers can focus on building clinical grade molecular diagnostics applications, faster.

2) How did you come up with the idea of building a system that integrates genomic reference data into diagnostic and research applications? And what was the crucial moment when you realised the importance of creating it?

As a graduate student I spent a lot of time parsing, re-formatting, and integrating data just to answer some basic questions in genomics. At the same time (this was about two years ago) it was becoming clear that genomics was going to be an important industry with a yet unsolved IT component. David Caplan (SolveBio’s CTO) and I started hacking away at ways to simplify genome analysis in the anticipation that interpreting DNA would be a significant problem in both research and the clinic. One thing we noticed was that there were no companies or services out there to help out guys like us – people that were programming with genomic data. There were a few attempts at kludgy interfaces for bioinformatics and a number of people were trying to solve the read mapping computing infrastructure problem, but there were no “developer tools” for integrating genomic data. In part, that was because a couple years ago there wasn’t that much data out there, so parsing, formatting, cleaning, indexing, updating, and integrating data wasn’t as big of a problem as it is now (or will be in a few years). We set out to build an API to the world’s genomic data so that other programmers could build amazing applications with the data without having to repeat painful meaningless tasks.

As we started talking to people about our API we realized how valuable a genomic data service is for the clinic. Genomics is no longer solely an academic problem. When we started talking to hospitals and commercial diagnostic labs, that’s when we realized that this is a crucial problem. That’s also when we realized that an API to public data is just the tip of the iceberg. Access to clinical genomic information that can be used as reference data is the key to interpreting DNA as a clinical metric.

3) After the molecular technology revolution made it possible for us to collect large amounts of precise medical data at low cost, another problem appeared to take over. How do you see the solution of the problem that the data are not in a language doctors can understand?

The molecular technology revolution will make it possible to move from “Intuitive Medicine” to “Precision Medicine”, in the language of Clay Christensen and colleagues in “Innovator’s Prescription”. Molecular markers are much closer to being unique fingerprints of the individual than whatever can be expressed by the English language in a doctor’s note. If these markers can be conclusively associated with diagnosis and treatment, medicine will be an order of magnitude better, faster, cheaper than it is now. Doctors can’t possibly be expected to read the three billion base pairs or so that make up the genome of every patient and recall which diagnosis and treatment is the best fit in light of the genetic information. This is where the digital revolution – i.e. computing – comes in. Aggregating silo’ed data while maintaining the privacy of the patients using bleeding edge software will allow doctors to use clinical genomic data to better medicine.

4) What are your plans for the future of SolveBio? Are you working on developing more tools/apps?

Our goal is to be the data delivery system for genomic medicine. We’ve built the tools necessary to integrate data into a genomic medical application, such as a diagnostic tool or variant annotator. We are now building some of these applications to make life easier for people running genetic tests.

5) Do you recognise the problem of limited sharing of genomics data for research and diagnosis? Can you think of an example of how the work of SolveBio supports data access and knowledge sharing within the genomics community?

The information we can glean from DNA sequence is only as good as the reference data that is used for research and diagnostic applications. We are particularly interested in genomics data from the perspective of how linking data from different sources creates the best possible reference for clinical genomics. This is, in a way, a data sharing problem.

I would add though that a huge disincentive to distributing data is the privacy, security, liability, and branding concern that clinical and commercial outfits are right to take into account. As a result, we are especially tailoring our platform to address those concerns.

However, even the data that is currently being “shared” openly, largely as a product of the taxpayer funded academic community, is very difficult and costly to access. Open data isn’t free. It involves building and maintaining substantial infrastructure to make sure the data is up-to-date and to verify quality. SolveBio solves that problem. Developers building DNA interpretation tools no longer have to worry about setting up their data infrastructure. They can integrate data with a few lines of code through SolveBio.

6) Which is the most important thing that should be done in the field of genetic data sharing and what does ethical data sharing mean to you?

Ethical data sharing means keeping patient data private and secure. If data is used for research or diagnostic purposes and needs to be transferred among doctors, scientists, or engineers then privacy and security is a key concern. Without privacy and security controls genomic data will never benefit from the aggregate knowledge of programmers and clinicians because patients will be rightly opposed to measuring, let alone distributing, their genomic information. Patient data belongs to the patient. Sometimes clinicians and researchers forget that. I definitely think the single most important thing to get right is the data privacy and security standard. The entire field depends upon it.

logo-SolveBio

Open Spending: Tracking Financial Data worldwide

If you have followed the activites of the OKFN these last years, you probably already know Open Spending, the community-driven project initiated in 2007 and which has considerably grown since then. First, the idea started with Where Does My Money Go?, a database for UK public financial data, financed by the 4IP (4 Innovation for the Public) fund of the British channel 4. Few years later in 2011, the initiative has been internationalized and Open Spending was born, a worldwide platform which has largely gone beyond the British borders. Today, the site shows data from 73 countries from Bosnia to Uganda and the visualisation tool Spending Stories could be developed at the same time, thanks a grant from the Knight Foundation. Talking about funding, not to forget the Open Society Foundations which supports the community building work and the Omidyar Network which funded the research behind the report “Technology for Transparent and Accountable Public Finance”. You guessed it? Everything is Open Source.

OpenSpending_web

Open Spending consists not only in aggregating worldwide public financial data as budgets, spending, balance sheets, procurement or employees salaries; giving information on how public money has been spent all over the world and in your own city. It allows users to visualise directly the available data via Spending Stories and add new datasets as well. The community members making use of the tools and developing them show various backgrounds and every one is invited to join. Additionally, articles are regularly posted on the blog to incite to share knowledge each other.

The results so far are very good since numerous administrations and media have already used the visualisations, as the city of Berlin and the Guardian for instance. But besides them, independent journalists, activists from the civil society, students and engaged citizens take also avantage of the datasets, allowing a better understanding on public money.

Bildschirmfoto vom 2014-12-03 18:19:44           TheGuardian

DNAdigest Symposium: A tour in Open Science in human genomics research

This past weekend, DNAdigest organized a Symposium on the topic “Open Science in human genomics research – challenges and inspirations”. The event brought together very interested in the topic and enthusiastic people along with the DNAdigest team. We are very pleased to say that this day turned out to be a success, where both participants and organizers enjoyed the amazing talks of our speaker and the discussion sessions.

The day started with a short introduction on the topic by Fiona Nielsen.

DNAdigestSummit1

Then our first speaker, Manuel Corpas was a source of inspiration to all participants, talking us through the process he experienced in order to fully sequence the whole genomes of his family and himself and to share this data widely with the whole world.  Here is a link to the presentation he introduced on the day.

The Symposium was organized in the format of Open Space conference, where everybody got to suggest different topics related to Open Science or choose to join one which sounds most interesting. Again, we used HackPad to take notes and interesting thoughts throughout the discussions. You can take a look at it here.

DNAdigestSummit2

We had three more speakers invited to our Symposium: Tim Hubbard (slides) talked about how Genomics England gets to engaged the research community, in the face of genomic scientists and patient communities, to collaborate on both data generation and data analysis of the 100k Genomes Project for the public benefit. Julia Wilson (slides) came as a representative of the Global Alliance. She introduced us to the GA4GH and explained how their work helps to implement standards for data sharing across genomics and health. Last, but not least was Nick Sireau (slides). He walked us through an eight-step process to show us how exactly the scientific community and the patient community can engage in collaborations, and how Open Science (sharing of hypotheses, methods and results, throughout the science process) may be either beneficial or challenging in this context.

DNAdigest Symposium

The event came to its end with a summary of learning points and a rounding up by Fiona Nielsen.

We have also made a storify summary where you can find a collection of all the tweets and most of the photos covering the duration of the day.  Also there is a gallery including all pictures taken by our team members.

Now to all former and future participants, If you enjoy participating in these events please donate to DNAdigest by texting DNAD14 £10 to 70070, so that we can continue organizing more of these interactive and exciting events in the future. You can also buy some of our cool DNAdigest T-shirts and Mugs from our website shop.

It was great to see you all, and we look forward to welcoming you again for our next events!

DNAdigest team: Fiona, Adrian, Margi, Francis, Sebastian, Xocas and Tim

This event would not have been possible without the contributions of our generous sponsors:

DNAdigestSummit_sponsor3

DNAdigestSummit_sponsor

DNAdigestSummit_sponsor2

Exploring Open Science: DNAdigest interviews Aridhia

As promised last week in the DNAdigest’s newsletter, we are giving life to our first blog post interview. Be introduced to Mr Rodrigo Barnes, part of the Aridia team. He kindly agreed to answer our questions about Aridhia and their views on genomic data sharing.

rodrigo-barnes-300x198

Mr Rodrigo Barnes, CTO of Aridhia

1. You are a part of the Aridhia team. Please, tell us what the goals and the interests of the company are?

Aridhia started with the objective of using health informatics and analytics to improve efficiency and service delivery for healthcare providers, support the management of chronic disease and personalised medicine, and ultimately improve patient outcomes.

Good outcomes had already started to emerge in diabetes and other chronic diseases, through some of the work undertaken by the NHS in Scotland and led by one of our founders, Professor Andrew Morris. This included providing clinicians and patients with access to up-to-date, rich information from different parts of the health system.

Aridhia has since developed new products and services to solve informatics challenges in the clinical and operational aspects of health. As a commercial organisation, we have worked on these opportunities in collaboration with healthcare providers, universities, innovation centres and other industry partners, to ensure that the end products are fit for purpose, and the benefits can be shared between our diverse stakeholders. We have always set high standards for ourselves, not just technically, but particularly when it comes to respecting people’s privacy and doing business with integrity.

2. What is your role in the organisation and how does your work support the mission of the company?

Although my background is in mathematics, I’ve worked as a programmer in software start-ups for the majority of my career. Since joining Aridhia as one of its first employees, I have designed and developed software for clinical data, often working closely with NHS staff and university researchers. This has been great opportunity to work on (ethically) good problems and participate in multidisciplinary projects with some very smart, committed and hard-working people.

In the last year, I took on the CTO (Chief Technology Officer) role, which means I have to take a more strategic perspective on the business of health informatics. But I still work directly with customers and enjoy helping them develop new products.

3. What makes Aridhia unique?

We put collaboration at the very heart of everything we do. We work really hard to understand the different perspectives and motivations people bring to a project, and acknowledge expertise in others, but we’re also happy to assert our own contribution. We have also been lucky to have investors who recognise the challenges in this market and support our vision for addressing them.

4. Aridhia have recently won a competition for helping businesses develop new technology to map and analyse genes and more specifically to support the efforts of NHS to map whole genomes of patients with rare diseases or cancer. On which phase are you now and have you developed an idea (or even a prototype) that you can tell us more about?

It’s a little early to say too much about our product plans, but we have identified a number of aspects within genomic medicine that we feel need to be addressed. Based on our extensive experience in the health field, we think a one size fits all approach won’t work when it comes to annotating genomes and delivering that information usefully into the NHS (and similar healthcare settings). There will be different user needs, of course, but there are also IT procurement and deployment challenges to tackle before any smart solution can become common practice in the NHS.

We strongly believe that there is a new generation of annotation products and services waiting to emerge from academic/health collaborations. We believe that clinical groups have the depth of knowledge and the databases of cases that are needed to provide real insight into complex diseases with genetic factors, and we are keen to help these SMEs and spin outs validate their technology and get them ‘to market’ in the NHS and healthcare settings around the world.

Overall our initial objective is to help take world class annotations out of research labs and into operational use in the NHS. Both of these goals are very much in line with Genomic England‘s mandate to improve health and wealth in the UK.

5. Aridhia is a part of The Kuwait Scotland eHealth Innovation Network (KSeHIN). Can you tell us something more about this project and what your plans for further development are?

Kuwait has one of the highest rates of obesity and diabetes in the world, and the Kuwait Ministry of Health has responsibility for tackling this important issue. We’ve worked with the Dasman Diabetes Centre in Kuwait and the University of Dundee to bring informatics, education and resources to improve diabetes care. The challenge from the initial phase is to scale up to a national system. We think there are good opportunities to work with the Ministry of Health in Kuwait to achieve their goals as well as working with the Dasman’s own genomics and research programmes. This project is an excellent example of the combination of skills and resources needed to make an impact on the burden of chronic disease.

6. Do you recognise the problem of limited sharing of genomics data for research and diagnosis? How does the work of Aridhia support data access and knowledge sharing within the genomics community?

This is a sensitive subject of course, and we have to acknowledge that this is data that can’t readily be anonymised. Sharing, if it’s permissible, won’t follow the patterns we are used to with other types of data. That’s why we took an interest in the work DNA Digest is doing.

Earlier in the year, Aridhia launched its collaborative data science platform, AnalytiXagility which takes a tiered approach to the managed sharing of sensitive data. We make sure that we offer data owners and controllers what they need to ensure they feel comfortable in sharing data. AnalytiXagility delivers a protocol for negotiation and sharing, backed by a ‘life-cycle’ or ‘lease’ approach to the sharing and audit systems to verify compliance. This has been primarily used for clinical, imaging and genomics data to date.

In a ‘Research Safe Haven’ model, the analysts come to the data, and have access to that for the intended purpose and duration of their project. This system is in place at the Stratified Medicine Scotland – Innovation Centre, which already supports projects using genomic and clinical data. The model we are developing for genomic data extends that paradigm of bringing computing to the data. We are taking this step by step and working with partners and customers to strengthen the system.

From a research perspective, the challenges are likely to be related to having enough linked clinical data, but also having enough samples and controls to get a meaningful result. So we think we will see standards emerging for federated models – research groups will try to apply their analysis against raw genomic data at multiple centres using something like the Global Alliance 4 Genomics and Health API, and then collate results for analysis under a research safe haven model. We recently joined the Global Alliance and will bring our experience of working with electronic patient records and clinical informatics to the table.

7. What are your thought on the most important thing that should be done in the field of genetic data sharing?

Trust and transparency are important factors. I am interested in seeing what could be done to establish protocols and accreditations that would give participants visibility of how data is being used and how the benefits are shared.

aridhia_logo-300x231

Giving research data the credit it’s due

In many ways, the currency of the scientific world is publications. Published articles are seen as proof – often by colleagues and future employers – of the quality, relevance and impact of a researcher’s work. Scientists read papers to familiarize themselves with new results and techniques, and then they cite those papers in their own publications, increasing the recognition and spread of the most useful articles. However, while there is undoubtedly a role for publishing a nicely-packaged, (hopefully) well-written interpretation of one’s work, are publications really the most valuable product that we as scientists have to offer one another?

As biology moves more and more towards large-scale, high-throughput techniques – think all of the ‘omics – an increasingly large proportion of researchers’ time and effort is spent generating, processing and analyzing datasets. In genomics, large sequencing consortia like the Human Genome Project or ENCODE  were funded in part to generate public resources that could serve as roadmaps to guide future scientists. However, in smaller labs, all too often after a particular set of questions is answered, large datasets end up languishing on a dusty server somewhere. Even for projects whose express purpose is to create a resource for the community, the process of curating, annotating and making data available is a time-consuming and often thankless task.

images

Current genomics data repositories like GEO and ArrayExpress serve an important role in making datasets available to the public, but they typically contain data that is already described in a published article; citing the dataset is typically secondary to citing the paper. If more, easier-to-use platforms existed for publishing datasets themselves, alongside methods to quantify the use and impact of these datasets, it might help drive a shift away from the mindset of ascribing value purely to journal articles towards a more holistic approach where the actual products of research projects – including datasets as well as code or software tools used to analyse them, in addition to articles – are valued. Such a shift could bring benefits to all levels of biological research, from ensuring that students who toiled for years to produce a dataset get adequate credit for their work, to encouraging greater sharing and reuse of data that might not have made it into a paper but still has the potential to yield scientific insights.

Tools and platforms to do just this are gradually emerging and gaining recognition in the biological community. Figshare is a particularly promising platform that allows for the sharing and discovery of many types of research outputs, including datasets as well as papers, posters and various media formats. Importantly, items uploaded to Figshare are assigned a Digital Object Identifier (DOI), which provides a unique and persistent link to each item and allows it to be easily cited. This is analogous to the treatment of articles on preprint servers such as arXiv and bioRxiv, whose use is also growing in biological disciplines; however, Figshare is more flexible in terms of the types of research output it accepts. In addition to the space and ability to share and cite data, the research community could benefit from better quantification of data citation and impact. Building on the altmetrics movement, which attempts to provide alternative measures of the impact of scientific articles besides the traditional journal impact factor, a new Data-Level Metrics pilot project has recently been announced as a collaboration between PLOS, the California Digital Library and DataONE. The goal of this project is to create a new set of metrics that quantify usage and impact of shared datasets.

Although slow at times, the biological research community is gradually adapting to the new needs and possibilities that come along with high-throughput datasets. Particularly in the field of genomics, I hope that researchers will continue to push for and embrace innovative ways of sharing their data. If data citation becomes the new standard, it could facilitate collaboration and reproducibility while helping to diversify the range of outputs that scientists consider valuable. Hopefully, the combination of easy-to-use platforms and metrics that capture the impact of non-traditional research outputs will provide incentives to researchers to make their data available and encourage the continued growth of sharing, recognizing and citing biological datasets.

Interview with Fiona Nielsen, DNAdigest.org, Cambridge, UK

logoRecently, we learned about a project which shows how the principles of Knowledge Sharing can be applied to the scientific domain, specifically to genomics data. DNAdigest is a Not-for-Profit Organisation founded and located in Cambridge, UK, by a group of individuals from diverse backgrounds who all want to see genomics used to its full potential to aid medical research. The objective of DNAdigest is to provide a simple, secure and effective mechanism for sharing genomics data for research without compromising the data privacy of the individual contributors.

fionaFrom the beginning, this concept sounded very appealing to us. That’s why we contacted Fiona Nielsen, founder of this great initiative, to talk about the goals of the project, its approach on making use of such sensitive data and the current status of data sharing within the scientific community.

DNAdigest is still in the development process but already shows a promising future. Not only they have been selected for the Wayra UnLtd accelerator programme for social entrepreneurs, they are also working hard on building a community around the idea, organising events like hack days and workshops. Since, no one can describe the project better than its creator, we invite you to discover more about it through the following sequence of questions and answers.


1) Fiona, could you first introduce yourself and DNAdigest?

I am a bioinformatics scientist turned entrepreneur. I used to work in a biotech company where I was developing tools for interpretation of next-generation sequencing data and I took part in a number of projects where I was doing the data analysis of cancer sequencing samples. During my work, I realised how difficult it is to find and get access to genomics data for research.

DNAdigest was founded as an entity to provide a novel mechanism for sharing of data, aligning the interests of patients and researchers through a data broker mechanism, enabling easy access to anonymised aggregated data.

2) Why it is important to share genomics data? Quoting your website, the current state of sharing this information is embarrassingly limited. How does DNAdigest address this problem?

The human genome is very complex. Made up of 3 billion base pairs and varying from individual from individual, it is equivalent to looking for a needle in a haystack when you as a researcher attempt to nail down the genetic variation that is causing a genetic disease. The only way to narrow your search is by filtering out genetic variation that has been seen before in healthy individuals and annotate the variation that is left by what disease(s) the variation occurs in. This type of comparative analysis requires looking at variants from as many samples as possible. Ideally you will need to compare to tens of thousands of samples to make your comparison approach statistical significance. Accessing thousands of samples today is not only difficult in terms of permissions, but also in terms of mere storage and network capacity it is not practical to download huge datasets for every team that wants to do a comparison. DNAdigest is developing a data broker which will allow the researcher to submit queries for specific variants and only the aggregated information about the selected variants is returned as a result. For example, examining a specific mutation in cancer, the query could be “what is the frequency of this mutation in cancer samples?” and the result would be returned as a frequency, e.g. 3%. The aim of DNAdigest is to reduce the time to discover, access and retrieve the data relevant to genomic comparison.

3) It seems that your idea looks quite revolutionary and actually very needed. How was the reaction of the scientific community towards your initiative so far? Are the principles behind sharing and opening data something new for scientists?

Similar approaches have been suggested and a handful of approaches have been prototyped within the academic community before. However, all of the projects for sharing data in an academic setting have ultimately faced the same problems: They do not have the resources to scale up their solution to work for the entire community, and even if they should have the ambition to scale up the solution, they would find that it is extremely difficult to find funding for infrastructure projects from traditional research funding. In general, there is a positive attitude towards data sharing in research. However, the immediate concerns of researchers revolves around writing papers and not so much towards building common infrastructure.
Based on this knowledge of the community, I realised that a separate entity is needed to take initiative for developing a solution, drawing on the knowledge generated in academia, and building an organisation that can do independent fundraising and collaborate across institutions. We have registered DNAdigest as a charity so that we can function as an independent and trusted third party to provide the community with a feasible solution.

4) What do researchers have to do in order to access genomics data on DNAdigest.org? Can individuals share their genomics information directly on the platform?

We are still designing and developing the platform, so I can not yet give you the exact user guide. Our objective is not to store entire datasets, but to connect to existing data repositories and data management systems with a common API that allows queries into the metadata to select samples, and for the samples for which patient consent is available, to query into the genetic data to provide aggregated statistics collected across datasets.

We have no plans at this point to make storage capacity for individual genomic data, currently for this purpose, an individual would have to find an associated repository, for example through their patient community, which will allow storage of their genomic data.

5) Sharing such private information is a big concern for many people nowadays. How do you approach the privacy issue? What is your solution for this?

Our approach to privacy is to provide anonymization through aggregation. We will provide an API from which it is possible to query for summary statistics over selections of the available data. For example, for a researcher interpreting a specific mutation for a patient with a genetic disease, the associated query for DNAdigest would be “what is the frequency of this mutation for patients with this genetic disease?”. The query could be also be used to look for mutation frequencies in healthy individuals or for patients with related diseases.

6) Which kind of projects could profit from DNAdigest.org?

DNAdigest is still at an early stage and we have a lot of work still to do in designing and implementing the secure query platform. The projects that are most likely to benefit from the resource of data that DNAdigest will make easily accessible are data analysis and interpretation of genetic variants in connection with rare diseases and other genetics research. In the bigger picture, a future of genomic medicine where diagnosis from genome sequencing is commonplace will only be possible if the means for interpretation, namely data access across patient groups and across repositories, becomes available.

7) We read from your blog that DNAdigest.org has been selected for the WAYRA UnLtd Accelerator. Congratulations for that! Do you benefit from other support sources? And, in general, how far are investors supporting social enterprises and non-profit-oriented ideas?

We are very happy that we were selected for the Wayra UnLtd accelerator at this early stage of our project. The accelerator is not just an office space, but a community of startups and business-savvy people helping each other develop sustainable businesses. So far, DNAdigest has been bootstrapping our initiative with volunteer participation and charitable donations.

8) You also have organised a hack day in Cambridge and even workshops, thus building an expanding community. How does DNAdigest.org benefit from this encounters? How are the results so far?

Engaging the community in our project is essential if we want to develop a new mechanism to change the existing culture and structure of data sharing. The stakeholders from academia, industry and patient groups all have very different priorities with regards to sharing of data. Through our hack day, we arrived at more complete understanding of the stakeholder interests and the potential sustainable development models and technical implementation that may be feasible on the short and the long term.

9) As you might know, we are particularly interested in Open Data. By accessing open information, developers are creating apps which are solving certain problems. Is there already any app using open genomics data? If not, how could such an app look like?

Sensitive information like medical records and genetics sequences are unlikely to be released as Open Data, however, the knowledge generated from the data, such as statistics can and should be made both public and easily available for the scientific community to build on. In addition, the metadata describing existing datasets currently residing at research repositories could be made openly available at no risk to privacy. However, a common problem in the research community is that it is difficult to provide incentives for researchers to spend time and effort to register their data in public repositories. Luckily, there is an increasing push from funding agencies to require that data produced with public funding should be made publicly available.

Regarding apps: in the bioinformatics community there are many many tools being developed to analyse proprietary data and many tools are developed to make use of data made openly available through public databases. For two such sources of public data (but not patient data), see the UCSC Genome Browser and the Ensembl Genome Browser.

10) In your opinion, how can the scientific community take profit from Open Data?

It would be ideal if there could be a real shift in research practices that researchers would register the existence of datasets even before publication (ie. Making the metadata Open Data), so that other researchers would have every opportunity to find and identify potential collaborators and sources of data for their research. For sensitive data, such as the genetic information and medical health record details for individual patients, we believe that a common interface is needed to make use of the wealth of data that is being produced today. We propose DNAdigest can provide such an alternative data access by working as the discovery and aggregation mechanism that will let you query across sensitive datasets.

Many thanks!

Read more about DNAdigest and sign up for the newsletter at DNAdigest.org