Exploring Open Science n°4: DNAdigest interviews Nowomics

This week I would like to introduce you to Richard Smith, founder and software developer of Nowomics. He kindly agreed to answer some questions for our post blog series and here it is – first hand information on Nowomics. Keep reading to find out more about this company.

richard_smith

Richard Smith, founder and software developer of Nowomics

1. Could you please give us a short introduction to Nowomics (goals, interests, mission)?

Nowomics is a free website to help life scientists keep up with the latest papers and data relevant to their research. It lets researchers ‘follow’ genes and keywords to build their own news feed of what’s new and popular in their field. The aim is to help scientists discover the most useful information and avoid missing important journal articles, but without spending a lot of their time searching websites.

2. What makes Nowomics unique?

Nowomics tracks new papers, but also other sources of curated biological annotation and experimental data. It can tell you if a gene you work on has new annotation added or has been linked to a disease in a recent study. The aim is to build knowledge of these biological relationships into the software to help scientists navigate and discover information, rather than recommending papers simply by text similarity.

3. When did you realise that a tool such as Nowomics would be of a great help to the genomic research community?

I’ve been building websites and databases for biologists for a long time and have heard from many scientists how hard it is to keep up with the flood of new information. There are around 20,000 biomedical journal articles published every week and hundreds of sources of data online, receiving lots of emails with lists of paper titles isn’t a great solution. In social media interactive news feeds that adapt to an individual are now commonly used as an excellent way to consume large amounts of new information, I wanted to apply these principles to tracking biology research.

4. Which part of developing the tool you found most challenging?

As with a lot of software, making sure Nowomics is as useful as possible to users has been the hardest part. It’s quite straightforward to identify a problem and build some software, but making sure the two are correctly aligned to provide maximum value to users has been the difficult part. It has meant trying many things, demonstrating ideas and listening to a lot of feedback. Handling large amounts of data and writing text mining software to identify thousands of biological terms is simple by comparison!

5. What are your plans for the future of Nowomics? Are you working on adding new features/apps?

There are lots of new features planned. Currently Nowomics focuses on genes/proteins and selected organisms. We’ll soon make this much broader, so scientists will be able to follow diseases, pathways, species, processes and many other keywords. We’re working on how these terms can be combined together for fine grained control of what appears in news feeds. It’s also important to make sharing with colleagues and recommending research extremely simple.

6. Can you think of examples of how Nowomics supports data access and knowledge dissemination within the genomics community?

The first step to sharing data sets and accessing research is for the right people to know they exist. This is exactly what Nowomics was set up to achieve, to benefit both scientists who need to be alerted to useful information and for those generating or funding research to reach the best possible audience. Hopefully Nowomics will also alert people to relevant shared genomics data in future.

7. What does ethical data sharing mean to you?

For data that can advance scientific and medical research the most ethical thing to do is to share it with other researchers to help make progress. This is especially true for data resulting from publicly funded research. However, with medical and genomics data the issues of confidentiality and privacy must take priority, and individuals must be aware what their information may be used for.

8. What are the most important things that you think should be done in the field of genetic data sharing?

The challenge is to find a way to unlock the huge potential of sharing genomics data for analysis while respecting the very real privacy concerns. A platform that enables sharing in a secure, controlled manner which preserves privacy and anonymity seems essential, I’m very interested in what DNADigest are doing in this regard.

Bildschirmfoto vom 2015-01-12 15:45:52

Exploring Open Science n°3: DNAdigests interviews NGS logistics

NGS logistics is the next project featured in our blog interviews. We have interviewed Amin Ardeshirdavani who is a PhD student involved in the creation of this web-based application. Take a look at the interview to find why this tool has become very popular within KU Leuven.

NGSlogistics

1. What is NGS logistics?

NGS-Logistics is a web-based application, which accelerates the federated analysis of Next Generation Sequencing data across different centres. NGS-Logistics acts as a real logistics company: you order something from the Internet; the owner processes your request and then ships it through a safe and trustful logistics company. In this of NGS-Logistics, the goods are human sequence data and researchers ask for possible variations and their frequency among the whole population. We try to deliver the answers in the fastest and safest possible way.

2. What is your part in NGS logistics?

Right now I am a PhD student at KU Leuven and the whole idea of my PhD project is designing and developing new data structures for analysing of massive amount of data produced by Next Generation Sequencing machines. NGS logistics is exactly that. I have done the whole design and development of the application and database. Hereby I would also like to acknowledge all the people from the KU Leuven, ESAT IT Dept., UZ Leuven IT Dept., and UZ Genomics core Dept. who assisted me on this project and for their kind support, especially Erika Souche.

3. When did you first start working on the idea of creating NGS logistics and what made you think it would be something useful?

It was almost three years ago when I had a meeting with my promotor Professor Yves Moreau, and he had an idea to somehow connect sequencing centres and query their data without moving them into one repository. As a person with an IT background it wasn’t that difficult for me to develop an application but there were lots of practical issues that needed to be taken care of. The majority of these issues are related to protecting the privacy of the individuals, because the data we deal with are coming from human genome sequencing experiments and people are rightfully worried about how this data will be used and protected. At the time of my first meeting there was no system in place to share this data but many people understood the need for this kind of structure and for us to start working on it. As we know, information can be a true scientific goldmine and by having access to more data we are able to produce more useful information. The novelty of the data, the possibility of sharing this wealth of information, and the complexity of this kind of applications make me so eager to work on this project.

4. How does your open source tool work and who it is designed for?

NGS-Logistics has three modules: Web Interface, Access control list and the Query manager. The source code of each one of these modules plus the database structure behind them is available upon simple request. As the modules are being upgraded continuously, I have not made any public repository for the source code yet. However, if someone would be interested to gain access to the source code it will be our pleasure to give it to them while I do think that the whole idea of the Data sharing is more important than the source code itself. Anyhow, it is our pleasure to share our experience with different problems and issues that we had to tackle during the past three years with others. In general, NGS-Logistics is designed to help researchers to save time when they need to have access to more data. It will help them to get a better overview of their questions and if they need to have access to the actual data, it will help them get the most useful data sets that match their cases.

5. Who has access to the system and how do you manage access permissions?

Researchers with a valid email address and affiliation are welcome to register and use the application. This means that we need to know who is querying the data to prevent structural queries, which may lead to identify an individual. I spent almost 20 months on the Access Control List (ACL) module. Most of the tasks are controlled and automatically updated by the system itself. Center Admins will be responsible for updating the list of samples they want to share with the others. PIs and their power users are responsible to group the samples as data sets and assign them to the users and groups. ACL has a very rich and user-friendly interface that makes it very easy to learn and use.

6. In what way do you think data sharing should be further improved?

Because of all the concerns around the term “Data Sharing”, I prefer to use the term “Result Sharing”. In our framework, we mostly try to answer very high-level questions like “The prevalence of a certain mutation in different populations”, preventing any private information from leaking out. By having more access to data we can gain more insight and produce more useful information; as Aristotle said: “The whole is greater than the sum of its parts.” On the other hand we always have to be careful about the consequences of sharing.

7. What does ethical data sharing mean to you?

It means everything and nothing. Why? Because ethics really depends on the subject and the location we are talking about. If we talk about sharing weather forecast data, I would say it is not important and it does not have any meaning. But when we talk about the data produced based on human genomes then we have to be careful. Legal frameworks differ a lot between many countries. Some of them are very restrictive when it comes to dealing with sensitive and private data whereas others are much less restrictive. Mostly this is because they have different definitions of private data. In most cases, any information that allows us to uniquely identify a person is defined as private information and as we know there is a possibility to identify a person by his or her genome sequence. Therefore, I feel that it is very important to keep track of what data is being used by who, when, at which level and for what reason.

NGS

Amin Ardeshirdavani et al, has published his work in Genome Medicine 6:71 : “NGS-Logistics: federated analysis of NGS sequence variants across multiple locations”. You can take a look at it here.

Exploring Open Science n°2: DNAdigest interviews SolveBio

DNAdigest continues with the series of interviews. Here we would like to introduce you to Mr Mark Kaganovich, CEO of SolveBio, who agreed on an interview with us. He shared a lot about what SolveBio does and discussed with us the importance of genomic data sharing.

Mark

Mark Kaganovich, CEO of SolveBio

1) Could you describe what SolveBio does?

SolveBio delivers the critical reference data used by hospitals and companies to run genomic applications. These applications use SolveBio’s data to predict the effects of slight DNA variants on a person’s health. SolveBio has designed a secure platform for the robust delivery of complex reference datasets. We make the data easy to access so that our customers can focus on building clinical grade molecular diagnostics applications, faster.

2) How did you come up with the idea of building a system that integrates genomic reference data into diagnostic and research applications? And what was the crucial moment when you realised the importance of creating it?

As a graduate student I spent a lot of time parsing, re-formatting, and integrating data just to answer some basic questions in genomics. At the same time (this was about two years ago) it was becoming clear that genomics was going to be an important industry with a yet unsolved IT component. David Caplan (SolveBio’s CTO) and I started hacking away at ways to simplify genome analysis in the anticipation that interpreting DNA would be a significant problem in both research and the clinic. One thing we noticed was that there were no companies or services out there to help out guys like us – people that were programming with genomic data. There were a few attempts at kludgy interfaces for bioinformatics and a number of people were trying to solve the read mapping computing infrastructure problem, but there were no “developer tools” for integrating genomic data. In part, that was because a couple years ago there wasn’t that much data out there, so parsing, formatting, cleaning, indexing, updating, and integrating data wasn’t as big of a problem as it is now (or will be in a few years). We set out to build an API to the world’s genomic data so that other programmers could build amazing applications with the data without having to repeat painful meaningless tasks.

As we started talking to people about our API we realized how valuable a genomic data service is for the clinic. Genomics is no longer solely an academic problem. When we started talking to hospitals and commercial diagnostic labs, that’s when we realized that this is a crucial problem. That’s also when we realized that an API to public data is just the tip of the iceberg. Access to clinical genomic information that can be used as reference data is the key to interpreting DNA as a clinical metric.

3) After the molecular technology revolution made it possible for us to collect large amounts of precise medical data at low cost, another problem appeared to take over. How do you see the solution of the problem that the data are not in a language doctors can understand?

The molecular technology revolution will make it possible to move from “Intuitive Medicine” to “Precision Medicine”, in the language of Clay Christensen and colleagues in “Innovator’s Prescription”. Molecular markers are much closer to being unique fingerprints of the individual than whatever can be expressed by the English language in a doctor’s note. If these markers can be conclusively associated with diagnosis and treatment, medicine will be an order of magnitude better, faster, cheaper than it is now. Doctors can’t possibly be expected to read the three billion base pairs or so that make up the genome of every patient and recall which diagnosis and treatment is the best fit in light of the genetic information. This is where the digital revolution – i.e. computing – comes in. Aggregating silo’ed data while maintaining the privacy of the patients using bleeding edge software will allow doctors to use clinical genomic data to better medicine.

4) What are your plans for the future of SolveBio? Are you working on developing more tools/apps?

Our goal is to be the data delivery system for genomic medicine. We’ve built the tools necessary to integrate data into a genomic medical application, such as a diagnostic tool or variant annotator. We are now building some of these applications to make life easier for people running genetic tests.

5) Do you recognise the problem of limited sharing of genomics data for research and diagnosis? Can you think of an example of how the work of SolveBio supports data access and knowledge sharing within the genomics community?

The information we can glean from DNA sequence is only as good as the reference data that is used for research and diagnostic applications. We are particularly interested in genomics data from the perspective of how linking data from different sources creates the best possible reference for clinical genomics. This is, in a way, a data sharing problem.

I would add though that a huge disincentive to distributing data is the privacy, security, liability, and branding concern that clinical and commercial outfits are right to take into account. As a result, we are especially tailoring our platform to address those concerns.

However, even the data that is currently being “shared” openly, largely as a product of the taxpayer funded academic community, is very difficult and costly to access. Open data isn’t free. It involves building and maintaining substantial infrastructure to make sure the data is up-to-date and to verify quality. SolveBio solves that problem. Developers building DNA interpretation tools no longer have to worry about setting up their data infrastructure. They can integrate data with a few lines of code through SolveBio.

6) Which is the most important thing that should be done in the field of genetic data sharing and what does ethical data sharing mean to you?

Ethical data sharing means keeping patient data private and secure. If data is used for research or diagnostic purposes and needs to be transferred among doctors, scientists, or engineers then privacy and security is a key concern. Without privacy and security controls genomic data will never benefit from the aggregate knowledge of programmers and clinicians because patients will be rightly opposed to measuring, let alone distributing, their genomic information. Patient data belongs to the patient. Sometimes clinicians and researchers forget that. I definitely think the single most important thing to get right is the data privacy and security standard. The entire field depends upon it.

logo-SolveBio

DNAdigest Symposium: A tour in Open Science in human genomics research

This past weekend, DNAdigest organized a Symposium on the topic “Open Science in human genomics research – challenges and inspirations”. The event brought together very interested in the topic and enthusiastic people along with the DNAdigest team. We are very pleased to say that this day turned out to be a success, where both participants and organizers enjoyed the amazing talks of our speaker and the discussion sessions.

The day started with a short introduction on the topic by Fiona Nielsen.

DNAdigestSummit1

Then our first speaker, Manuel Corpas was a source of inspiration to all participants, talking us through the process he experienced in order to fully sequence the whole genomes of his family and himself and to share this data widely with the whole world.  Here is a link to the presentation he introduced on the day.

The Symposium was organized in the format of Open Space conference, where everybody got to suggest different topics related to Open Science or choose to join one which sounds most interesting. Again, we used HackPad to take notes and interesting thoughts throughout the discussions. You can take a look at it here.

DNAdigestSummit2

We had three more speakers invited to our Symposium: Tim Hubbard (slides) talked about how Genomics England gets to engaged the research community, in the face of genomic scientists and patient communities, to collaborate on both data generation and data analysis of the 100k Genomes Project for the public benefit. Julia Wilson (slides) came as a representative of the Global Alliance. She introduced us to the GA4GH and explained how their work helps to implement standards for data sharing across genomics and health. Last, but not least was Nick Sireau (slides). He walked us through an eight-step process to show us how exactly the scientific community and the patient community can engage in collaborations, and how Open Science (sharing of hypotheses, methods and results, throughout the science process) may be either beneficial or challenging in this context.

DNAdigest Symposium

The event came to its end with a summary of learning points and a rounding up by Fiona Nielsen.

We have also made a storify summary where you can find a collection of all the tweets and most of the photos covering the duration of the day.  Also there is a gallery including all pictures taken by our team members.

Now to all former and future participants, If you enjoy participating in these events please donate to DNAdigest by texting DNAD14 £10 to 70070, so that we can continue organizing more of these interactive and exciting events in the future. You can also buy some of our cool DNAdigest T-shirts and Mugs from our website shop.

It was great to see you all, and we look forward to welcoming you again for our next events!

DNAdigest team: Fiona, Adrian, Margi, Francis, Sebastian, Xocas and Tim

This event would not have been possible without the contributions of our generous sponsors:

DNAdigestSummit_sponsor3

DNAdigestSummit_sponsor

DNAdigestSummit_sponsor2

Exploring Open Science: DNAdigest interviews Aridhia

As promised last week in the DNAdigest’s newsletter, we are giving life to our first blog post interview. Be introduced to Mr Rodrigo Barnes, part of the Aridia team. He kindly agreed to answer our questions about Aridhia and their views on genomic data sharing.

rodrigo-barnes-300x198

Mr Rodrigo Barnes, CTO of Aridhia

1. You are a part of the Aridhia team. Please, tell us what the goals and the interests of the company are?

Aridhia started with the objective of using health informatics and analytics to improve efficiency and service delivery for healthcare providers, support the management of chronic disease and personalised medicine, and ultimately improve patient outcomes.

Good outcomes had already started to emerge in diabetes and other chronic diseases, through some of the work undertaken by the NHS in Scotland and led by one of our founders, Professor Andrew Morris. This included providing clinicians and patients with access to up-to-date, rich information from different parts of the health system.

Aridhia has since developed new products and services to solve informatics challenges in the clinical and operational aspects of health. As a commercial organisation, we have worked on these opportunities in collaboration with healthcare providers, universities, innovation centres and other industry partners, to ensure that the end products are fit for purpose, and the benefits can be shared between our diverse stakeholders. We have always set high standards for ourselves, not just technically, but particularly when it comes to respecting people’s privacy and doing business with integrity.

2. What is your role in the organisation and how does your work support the mission of the company?

Although my background is in mathematics, I’ve worked as a programmer in software start-ups for the majority of my career. Since joining Aridhia as one of its first employees, I have designed and developed software for clinical data, often working closely with NHS staff and university researchers. This has been great opportunity to work on (ethically) good problems and participate in multidisciplinary projects with some very smart, committed and hard-working people.

In the last year, I took on the CTO (Chief Technology Officer) role, which means I have to take a more strategic perspective on the business of health informatics. But I still work directly with customers and enjoy helping them develop new products.

3. What makes Aridhia unique?

We put collaboration at the very heart of everything we do. We work really hard to understand the different perspectives and motivations people bring to a project, and acknowledge expertise in others, but we’re also happy to assert our own contribution. We have also been lucky to have investors who recognise the challenges in this market and support our vision for addressing them.

4. Aridhia have recently won a competition for helping businesses develop new technology to map and analyse genes and more specifically to support the efforts of NHS to map whole genomes of patients with rare diseases or cancer. On which phase are you now and have you developed an idea (or even a prototype) that you can tell us more about?

It’s a little early to say too much about our product plans, but we have identified a number of aspects within genomic medicine that we feel need to be addressed. Based on our extensive experience in the health field, we think a one size fits all approach won’t work when it comes to annotating genomes and delivering that information usefully into the NHS (and similar healthcare settings). There will be different user needs, of course, but there are also IT procurement and deployment challenges to tackle before any smart solution can become common practice in the NHS.

We strongly believe that there is a new generation of annotation products and services waiting to emerge from academic/health collaborations. We believe that clinical groups have the depth of knowledge and the databases of cases that are needed to provide real insight into complex diseases with genetic factors, and we are keen to help these SMEs and spin outs validate their technology and get them ‘to market’ in the NHS and healthcare settings around the world.

Overall our initial objective is to help take world class annotations out of research labs and into operational use in the NHS. Both of these goals are very much in line with Genomic England‘s mandate to improve health and wealth in the UK.

5. Aridhia is a part of The Kuwait Scotland eHealth Innovation Network (KSeHIN). Can you tell us something more about this project and what your plans for further development are?

Kuwait has one of the highest rates of obesity and diabetes in the world, and the Kuwait Ministry of Health has responsibility for tackling this important issue. We’ve worked with the Dasman Diabetes Centre in Kuwait and the University of Dundee to bring informatics, education and resources to improve diabetes care. The challenge from the initial phase is to scale up to a national system. We think there are good opportunities to work with the Ministry of Health in Kuwait to achieve their goals as well as working with the Dasman’s own genomics and research programmes. This project is an excellent example of the combination of skills and resources needed to make an impact on the burden of chronic disease.

6. Do you recognise the problem of limited sharing of genomics data for research and diagnosis? How does the work of Aridhia support data access and knowledge sharing within the genomics community?

This is a sensitive subject of course, and we have to acknowledge that this is data that can’t readily be anonymised. Sharing, if it’s permissible, won’t follow the patterns we are used to with other types of data. That’s why we took an interest in the work DNA Digest is doing.

Earlier in the year, Aridhia launched its collaborative data science platform, AnalytiXagility which takes a tiered approach to the managed sharing of sensitive data. We make sure that we offer data owners and controllers what they need to ensure they feel comfortable in sharing data. AnalytiXagility delivers a protocol for negotiation and sharing, backed by a ‘life-cycle’ or ‘lease’ approach to the sharing and audit systems to verify compliance. This has been primarily used for clinical, imaging and genomics data to date.

In a ‘Research Safe Haven’ model, the analysts come to the data, and have access to that for the intended purpose and duration of their project. This system is in place at the Stratified Medicine Scotland – Innovation Centre, which already supports projects using genomic and clinical data. The model we are developing for genomic data extends that paradigm of bringing computing to the data. We are taking this step by step and working with partners and customers to strengthen the system.

From a research perspective, the challenges are likely to be related to having enough linked clinical data, but also having enough samples and controls to get a meaningful result. So we think we will see standards emerging for federated models – research groups will try to apply their analysis against raw genomic data at multiple centres using something like the Global Alliance 4 Genomics and Health API, and then collate results for analysis under a research safe haven model. We recently joined the Global Alliance and will bring our experience of working with electronic patient records and clinical informatics to the table.

7. What are your thought on the most important thing that should be done in the field of genetic data sharing?

Trust and transparency are important factors. I am interested in seeing what could be done to establish protocols and accreditations that would give participants visibility of how data is being used and how the benefits are shared.

aridhia_logo-300x231

Giving research data the credit it’s due

In many ways, the currency of the scientific world is publications. Published articles are seen as proof – often by colleagues and future employers – of the quality, relevance and impact of a researcher’s work. Scientists read papers to familiarize themselves with new results and techniques, and then they cite those papers in their own publications, increasing the recognition and spread of the most useful articles. However, while there is undoubtedly a role for publishing a nicely-packaged, (hopefully) well-written interpretation of one’s work, are publications really the most valuable product that we as scientists have to offer one another?

As biology moves more and more towards large-scale, high-throughput techniques – think all of the ‘omics – an increasingly large proportion of researchers’ time and effort is spent generating, processing and analyzing datasets. In genomics, large sequencing consortia like the Human Genome Project or ENCODE  were funded in part to generate public resources that could serve as roadmaps to guide future scientists. However, in smaller labs, all too often after a particular set of questions is answered, large datasets end up languishing on a dusty server somewhere. Even for projects whose express purpose is to create a resource for the community, the process of curating, annotating and making data available is a time-consuming and often thankless task.

images

Current genomics data repositories like GEO and ArrayExpress serve an important role in making datasets available to the public, but they typically contain data that is already described in a published article; citing the dataset is typically secondary to citing the paper. If more, easier-to-use platforms existed for publishing datasets themselves, alongside methods to quantify the use and impact of these datasets, it might help drive a shift away from the mindset of ascribing value purely to journal articles towards a more holistic approach where the actual products of research projects – including datasets as well as code or software tools used to analyse them, in addition to articles – are valued. Such a shift could bring benefits to all levels of biological research, from ensuring that students who toiled for years to produce a dataset get adequate credit for their work, to encouraging greater sharing and reuse of data that might not have made it into a paper but still has the potential to yield scientific insights.

Tools and platforms to do just this are gradually emerging and gaining recognition in the biological community. Figshare is a particularly promising platform that allows for the sharing and discovery of many types of research outputs, including datasets as well as papers, posters and various media formats. Importantly, items uploaded to Figshare are assigned a Digital Object Identifier (DOI), which provides a unique and persistent link to each item and allows it to be easily cited. This is analogous to the treatment of articles on preprint servers such as arXiv and bioRxiv, whose use is also growing in biological disciplines; however, Figshare is more flexible in terms of the types of research output it accepts. In addition to the space and ability to share and cite data, the research community could benefit from better quantification of data citation and impact. Building on the altmetrics movement, which attempts to provide alternative measures of the impact of scientific articles besides the traditional journal impact factor, a new Data-Level Metrics pilot project has recently been announced as a collaboration between PLOS, the California Digital Library and DataONE. The goal of this project is to create a new set of metrics that quantify usage and impact of shared datasets.

Although slow at times, the biological research community is gradually adapting to the new needs and possibilities that come along with high-throughput datasets. Particularly in the field of genomics, I hope that researchers will continue to push for and embrace innovative ways of sharing their data. If data citation becomes the new standard, it could facilitate collaboration and reproducibility while helping to diversify the range of outputs that scientists consider valuable. Hopefully, the combination of easy-to-use platforms and metrics that capture the impact of non-traditional research outputs will provide incentives to researchers to make their data available and encourage the continued growth of sharing, recognizing and citing biological datasets.

The value of sharing your know-how openly

In June 2010 I graduated from the University of Sheffield with a Doctor of Philosophy Degree in Electronic Engineering and quickly embarked up on the typical academic career trajectory: I participated in conferences in the US and Asia, and took part in the race to publish papers in the best regarded academic journals in my field. Over time I achieved a respectable standing amongst my peers but I could not shake the feeling that there was more to be done to propel my career and give it a stronger aim.

fb12

How I discovered academic papers are not the only solution to progress my career

Somewhere in the fall of 2012 I attended a workshop aimed at helping scientists to promote themselves called: “making the most of your Postdoc”. Amongst the various advices offered to us, one particularly stuck with me: “raise your profile by creating a profile”. The person leading the workshop gave the example of a fellow researcher who had created an “about me” profile page that stated his area of interest and listed some useful information such as past publications, presentations and grants he obtained.

A few days later I was wondering how I could best advertise some of my non-peer review and equally important practical knowhow such as “troubleshooting problems in order to keep my research equipment operational” or “knowing what every single wire does inside that hardware rack”. Actually I had acquired a vast amount of non-peer reviewed knowledge in order to successfully create my peer-reviewed output. The act of designing, building, re-designing, fixing and improving things had become so blasé I hardly noticed how impressive it was to the outsider until I started trying to explain what I was doing to my first PhD students. By the time my third PhD student had arrived, and I was explaining the same concepts and ideas, I realised my knowledge could well be extremely useful to others as well. And there it struck me: why not create a blog to share all those bits of knowledge with those who might find them useful? This ”Eureka!” moment led to the inception of my blog, which was inaugurated in November 2012 with my first series of knowhow posts.

Blogging allowed me to reach a whole new level of recognition among my peers

When I started http://faebianbastiman.wordpress.com/I had of course expected some interest from my fellow colleagues and PhD students. However, the positive reaction was truly a surprise to me: my visitors climbed steadily over the first few months and by mid-2013 I was getting 400 unique visitors per month. I also started to get comments on my posts as well as questions from other researchers from academia and industry.  I answered those questions dutifully and wrote a new series of articles to cover the missing content. It was not long before the first consulting requests reached my mailbox. It occurred to me that my knowledge was not only useful to others, that usefulness gave it an inherent value.

Now, not only am I able to direct a regular income from my consultation services but I am on the way of doubling my previous income as an academic researcher with consulting alone. Additionally, working with industry provides me with a pleasant break from my closeted research existence and the opportunity to meet many interesting new people who recognize me for my expertise.

Faebian Bastiman

HackYourPhd: reporting on Open Science from the US @ Boca Raton/Paris, USA/France

carte-voyage-HYPhDUS_rev2A summer trip through the US to discover and document Open Science projects? When we first heard about HackYourPhd, we were excited to notice how similar is the concept of their research with our own. The idea was initiated last year by two young french researchers, Célya Gruson-Daniel & Guillaume Dumas, and “aims to bring more collaboration, transparency, and openness in the current practices of research.” Célya travelled during 3 months from Boca Raton (Florida) to Washington DC, gathering information and meeting people and groups active in the Open Science scene.

While this roundtrip in the US is now over, HackYourPhd is still active and has become an online community where the research continues. Read below the interview with the two persons behind this fantastic initiative and discover how the idea came to life, the insights of the trip and what is coming next.

1) Hi Célya & Guillaume, you both co-founded HackYourPhd, a community focused on Open Science which gave a globetrotter-initiative in the US last year. We are really curious how did you get this idea and to know more about it. Don’t forget to introduce yourself and the concept of Open Science too!

Hi Margo & Alex, thanks for this interview. We discovered a few months ago your great project. Now, we are much happy to help you since it is a lot related to what we tried to do last summer with “HackYourPhD aux States”. But before speaking about this Open Science tour across the USA, let’s us remind first the genesis and the aim of HackYourPhD in general. HackYourPhD is a community which gathers young researchers, PhD and master students, designers, social entrepreneurs, etc. around the issues raised by the Open Science movement. We co-founded this initiative a year ago. The idea of this community emerged from our mutual interest to research and its current practices. Guillaume is indeed postdoc in cognitive science and complex systems. He is also involved in art-science collaborative projects and scientific outreach. Célya is specialized in science communication. After two years as community manager for a scientific social network based on Open Access, she is now working in science communication for different projects related to MOOCs and higher education. We are both strong advocator for Open Science and that mainly why we came up with HackYourPhD. While Guillaume has tried to integrate Open Science in his practice, Célya wanted to explore the different facets with a PhD. But before, she wanted to meet the multiple actors behind this umbrella word. This is what motivated “HackYourPhD aux States,” the globetrotter-initiative per-see.

2) Why did it make sense especially in the US to follow and report Open Science projects? Could you imagine yourself doing it in other countries? What about France?

Because this was in the English speaking country that the Open Science movement has been started. That is thus also there that it is the most developed to date, from Open Access (e.g. PLoS) to the hackerspaces (e.g. noisebridge). There is also a big network of entrepreneurs in Open Science, which is specifically an aspect we were interested in. Célya thus decided to first look at the source of the movement and take time (three month) before doing a similar exploration in Europe with shorter missions (e.g. one week). Concerning France, we have still begun to monitor what is taking off, from citizen science to open data and open access. While we have certainly a better vision, the movement is still embryonic. But the movement will also take other forms and that is also what we are interested in. Célya is thinking to make her PhD in a research action mode, being observer and actor in this dynamical construction of the French Open Science movement.

3) From our experience, we could schedule our encounters and events both before starting the journey and on the way. Is that the same for you? How did you select your stops, the projects documented and persons interviewed? Is Open Science a widespread topic or it was actually difficult to find cases for your research?

Célya had already a blueprint of the big cities and the main path to follow. With the help of the HackYourPhD community, she gathered many contacts and constitute a first database of locations to visits and people to meet. Before starting, the first step—San Diego and the bay area—was almost scheduled. Then, the rest of the trip was set up on the way. Few important meetings were already scheduled of course (e.g. the Center for Open Science, the Mozilla Science Lab, etc.) but across the travel, new contact were given spontaneously by the people interviewed. Serendipity is your friend there! Regarding difficulties to find cases, this is quite function of the city. While San Francisco was really easy, Boston for example, which is full of nice projects, was nevertheless more challenging.

4) We know it is difficult to point out just one of them … but could you tell us what is your favourite or one of the most relevant Open Science initiatives you have discovered?

When Célya was in Cambridge, she visited the Institute for Quantitative Social Science. She met the director of the Data Science, Mercè Crosas and her team. Célya discovered the Dataverse Network project. It is one of the most relevant Open Science initiatives she discovered. Indeed, this project combines multiple facets of Open Science. It consists in building a platform allowing any researcher to archive, share and cite his data. It has many functionalities cleverly linking it to other aspects of Open Science (open access journal with OJS, citation, alt-metrics..). Here are the interview Mercè Crosas

5) As we discussed previously with Fiona Nielsen, sharing knowledge in the scientific domain has a positive impact. After your research, why does Open Science matter and how does it change the way scientists have been working till now?

Open Science provides many ways to increase efficiency in scientific practices. For example, Open Data allows research to better collaborate; while this solution seems obvious to many, it appears as a necessity when it comes to big science (e.g. CERN, ENCODE, Blue Brain, etc.) Open Data means also more transparency, which is critical to solve the lack of reproducibility or even frauds.

Open Access presents several advantages but the main one remains the guarantee to access scientific papers to everyone. As a journalist, Célya faced many times the issue of paywalls, and this is always frustrating. Last but not least, Open Science opens up new possibilities for collaboration between academia and other spheres (entrepreneurs, civil societies, NGO, etc.) Science is a social and collective endeavour, it thus needs contact with society and leave its ivory tower. The Open Science movement is profoundly going in that direction, and that why it matters.

6) As you know, Open Steps focuses on Open Data related projects. Quoting you, “In Seattle, I noticed a strong orientation of Open Science issues around Open Data.”, could you tell us more about this relation and the current situation in the US? Could you point us to any relevant Open Data initiative that we might want to document?

Open Data depends on scientific fields. Indeed, Seattle was a rich environment on that topic, but this is certainly caused by the software culture in the city (Amazon, Microsoft, etc.) The Open Data topic is related to Big Data. Thus, the key domains are genetics, neuroscience, and health in general. Lot of projects are interesting. We already mentioned the Dataverse Network, but you may also enjoy the Delsa Global Project (interview with Eugene Kolker) or Sage Bionetwork.

7) There are a lot of sponsors supporting you. Was it easy to convince them? Is that how you finance 100% of the project or do you have others sources of income?

All the sponsors were done thanks to the crowdfunding campaign on KissKissBankBank. This is not a question of convincing them, they just demonstrated the need of covering the topic of Open Science in France. Their financial help represents 36% of the total amount collected.

Their were no other source of income. The travel was not expensive since Célya used the collaborative economy solutions (couchsurfing, carpooling, etc.)

8) Now the trip is over …. but HackYourPhd still running. How does it go on now?

We are pursuing the daily collaborative curation, with almost a thousand people on our Facebook group. We are also organizing several events, mainly in Paris but with a growing network with other cities and even countries. The community is self-organized but needs some structure. We are currently thinking about this specific issue and hope 2014 will be a great year for the project!

Merci à vous deux!

Interview with Fiona Nielsen, DNAdigest.org, Cambridge, UK

logoRecently, we learned about a project which shows how the principles of Knowledge Sharing can be applied to the scientific domain, specifically to genomics data. DNAdigest is a Not-for-Profit Organisation founded and located in Cambridge, UK, by a group of individuals from diverse backgrounds who all want to see genomics used to its full potential to aid medical research. The objective of DNAdigest is to provide a simple, secure and effective mechanism for sharing genomics data for research without compromising the data privacy of the individual contributors.

fionaFrom the beginning, this concept sounded very appealing to us. That’s why we contacted Fiona Nielsen, founder of this great initiative, to talk about the goals of the project, its approach on making use of such sensitive data and the current status of data sharing within the scientific community.

DNAdigest is still in the development process but already shows a promising future. Not only they have been selected for the Wayra UnLtd accelerator programme for social entrepreneurs, they are also working hard on building a community around the idea, organising events like hack days and workshops. Since, no one can describe the project better than its creator, we invite you to discover more about it through the following sequence of questions and answers.


1) Fiona, could you first introduce yourself and DNAdigest?

I am a bioinformatics scientist turned entrepreneur. I used to work in a biotech company where I was developing tools for interpretation of next-generation sequencing data and I took part in a number of projects where I was doing the data analysis of cancer sequencing samples. During my work, I realised how difficult it is to find and get access to genomics data for research.

DNAdigest was founded as an entity to provide a novel mechanism for sharing of data, aligning the interests of patients and researchers through a data broker mechanism, enabling easy access to anonymised aggregated data.

2) Why it is important to share genomics data? Quoting your website, the current state of sharing this information is embarrassingly limited. How does DNAdigest address this problem?

The human genome is very complex. Made up of 3 billion base pairs and varying from individual from individual, it is equivalent to looking for a needle in a haystack when you as a researcher attempt to nail down the genetic variation that is causing a genetic disease. The only way to narrow your search is by filtering out genetic variation that has been seen before in healthy individuals and annotate the variation that is left by what disease(s) the variation occurs in. This type of comparative analysis requires looking at variants from as many samples as possible. Ideally you will need to compare to tens of thousands of samples to make your comparison approach statistical significance. Accessing thousands of samples today is not only difficult in terms of permissions, but also in terms of mere storage and network capacity it is not practical to download huge datasets for every team that wants to do a comparison. DNAdigest is developing a data broker which will allow the researcher to submit queries for specific variants and only the aggregated information about the selected variants is returned as a result. For example, examining a specific mutation in cancer, the query could be “what is the frequency of this mutation in cancer samples?” and the result would be returned as a frequency, e.g. 3%. The aim of DNAdigest is to reduce the time to discover, access and retrieve the data relevant to genomic comparison.

3) It seems that your idea looks quite revolutionary and actually very needed. How was the reaction of the scientific community towards your initiative so far? Are the principles behind sharing and opening data something new for scientists?

Similar approaches have been suggested and a handful of approaches have been prototyped within the academic community before. However, all of the projects for sharing data in an academic setting have ultimately faced the same problems: They do not have the resources to scale up their solution to work for the entire community, and even if they should have the ambition to scale up the solution, they would find that it is extremely difficult to find funding for infrastructure projects from traditional research funding. In general, there is a positive attitude towards data sharing in research. However, the immediate concerns of researchers revolves around writing papers and not so much towards building common infrastructure.
Based on this knowledge of the community, I realised that a separate entity is needed to take initiative for developing a solution, drawing on the knowledge generated in academia, and building an organisation that can do independent fundraising and collaborate across institutions. We have registered DNAdigest as a charity so that we can function as an independent and trusted third party to provide the community with a feasible solution.

4) What do researchers have to do in order to access genomics data on DNAdigest.org? Can individuals share their genomics information directly on the platform?

We are still designing and developing the platform, so I can not yet give you the exact user guide. Our objective is not to store entire datasets, but to connect to existing data repositories and data management systems with a common API that allows queries into the metadata to select samples, and for the samples for which patient consent is available, to query into the genetic data to provide aggregated statistics collected across datasets.

We have no plans at this point to make storage capacity for individual genomic data, currently for this purpose, an individual would have to find an associated repository, for example through their patient community, which will allow storage of their genomic data.

5) Sharing such private information is a big concern for many people nowadays. How do you approach the privacy issue? What is your solution for this?

Our approach to privacy is to provide anonymization through aggregation. We will provide an API from which it is possible to query for summary statistics over selections of the available data. For example, for a researcher interpreting a specific mutation for a patient with a genetic disease, the associated query for DNAdigest would be “what is the frequency of this mutation for patients with this genetic disease?”. The query could be also be used to look for mutation frequencies in healthy individuals or for patients with related diseases.

6) Which kind of projects could profit from DNAdigest.org?

DNAdigest is still at an early stage and we have a lot of work still to do in designing and implementing the secure query platform. The projects that are most likely to benefit from the resource of data that DNAdigest will make easily accessible are data analysis and interpretation of genetic variants in connection with rare diseases and other genetics research. In the bigger picture, a future of genomic medicine where diagnosis from genome sequencing is commonplace will only be possible if the means for interpretation, namely data access across patient groups and across repositories, becomes available.

7) We read from your blog that DNAdigest.org has been selected for the WAYRA UnLtd Accelerator. Congratulations for that! Do you benefit from other support sources? And, in general, how far are investors supporting social enterprises and non-profit-oriented ideas?

We are very happy that we were selected for the Wayra UnLtd accelerator at this early stage of our project. The accelerator is not just an office space, but a community of startups and business-savvy people helping each other develop sustainable businesses. So far, DNAdigest has been bootstrapping our initiative with volunteer participation and charitable donations.

8) You also have organised a hack day in Cambridge and even workshops, thus building an expanding community. How does DNAdigest.org benefit from this encounters? How are the results so far?

Engaging the community in our project is essential if we want to develop a new mechanism to change the existing culture and structure of data sharing. The stakeholders from academia, industry and patient groups all have very different priorities with regards to sharing of data. Through our hack day, we arrived at more complete understanding of the stakeholder interests and the potential sustainable development models and technical implementation that may be feasible on the short and the long term.

9) As you might know, we are particularly interested in Open Data. By accessing open information, developers are creating apps which are solving certain problems. Is there already any app using open genomics data? If not, how could such an app look like?

Sensitive information like medical records and genetics sequences are unlikely to be released as Open Data, however, the knowledge generated from the data, such as statistics can and should be made both public and easily available for the scientific community to build on. In addition, the metadata describing existing datasets currently residing at research repositories could be made openly available at no risk to privacy. However, a common problem in the research community is that it is difficult to provide incentives for researchers to spend time and effort to register their data in public repositories. Luckily, there is an increasing push from funding agencies to require that data produced with public funding should be made publicly available.

Regarding apps: in the bioinformatics community there are many many tools being developed to analyse proprietary data and many tools are developed to make use of data made openly available through public databases. For two such sources of public data (but not patient data), see the UCSC Genome Browser and the Ensembl Genome Browser.

10) In your opinion, how can the scientific community take profit from Open Data?

It would be ideal if there could be a real shift in research practices that researchers would register the existence of datasets even before publication (ie. Making the metadata Open Data), so that other researchers would have every opportunity to find and identify potential collaborators and sources of data for their research. For sensitive data, such as the genetic information and medical health record details for individual patients, we believe that a common interface is needed to make use of the wealth of data that is being produced today. We propose DNAdigest can provide such an alternative data access by working as the discovery and aggregation mechanism that will let you query across sensitive datasets.

Many thanks!

Read more about DNAdigest and sign up for the newsletter at DNAdigest.org