7 Predictions for “Open Data” in 2015

What’s going to happen to the “open data” movement in 2015?  Here are Dennis D. McDonald‘s predictions:7predictionsOD2015

  1. Some high profile open data web sites are going to die. At some sites the lack of updates and lack of use will catch up with them.  Others will see highly publicized discussions of errors and omissions.  For some in the industry this will be black eye.  For others it will be an “I told you so” moment causing great soul-searching and a re-emphasis on the need for effective program planning.
  2. Greater attention paid to cost, governance, and sustainability. In parallel with the above there will be more attention paid to open data costs, governance, and program sustainability.  Partly this will be in response to the issues raised in (1) and partly because the “movement” is maturing.  As people move beyond the low-hanging-fruit and cherry-picking stage they will be giving more thought to what it takes to manage an open data program effectively.
  3. Greater emphasis on standards, open source, and APIs. This is another aspect of the natural evolution of the movement. Much of the open data movement has relied on “bottom up” innovation and the enthusiasm of a developer community accustomed to operating on the periphery of the tech establishment. Some of this is generational as younger developers move into positions of authority. Some is due to the ease with which data and tools can be obtained and combined by individuals and groups working remotely and collaborating via systems like GitHub.
  4. More focus on economic impacts of open data in developed and developing countries alike. While many open data programs have been justified on the basis of laudable goals such as “transparency” and “civic engagement,” sponsors will inevitably ask questions about “impact” as update costs begin to roll in.  Some of the most important questions are also the simplest to ask but the hardest to answer, such as, “Are the people we hoped would use the data actually using the data?” and “Is using the data doing any good?”
  5. More blurring of the distinctions between public sector and private sector data. One of the basic ideas behind making government data “open” is to allow the public and entrepreneurs to use and combine public data with other data in new and useful ways. It is inevitable that private sector data will come into the mix. When public and private data are combined some interesting intellectual property, ownership, and pricing questions will be raised. Managers must be ready to address questions such as, “Why should I have to pay for a product that contains data I paid to collect via my tax dollars?”
  6. Inclusion of open data features in mainstream ERP, database, middleware, and CRM products. Just as vendors have incorporated social networking and collaboration features with older products, so too will open data features be added to mainstream enterprise products to enable access via file downloads, visualization, and documented APIs. Such features will be justified by the extra utility and engagement they support. Some vendors will incorporate monetization features to make it easier to track and charge for data the new tools expose.
  7. Continued challenges to open data ROI and impact measurement. As those experienced with usage metrics will tell you it’s not just usage that’s important it’s the impact of usage that really counts. In the coming year this focus on open data impact measurement will continue to grow. I take that as a good sign.  I also predict that open data impact measurement will continue to be a challenge.  Just as in the web site world it’s easier to measure pageviews than measure the impacts of the information communicated via the pageviews, so too will it continue to be easier to measure data file downloads and API calls than the impacts the use of the data thus obtained will have.

By Dennis D. McDonald, Ph.D.

Exploring Open Science n°4: DNAdigest interviews Nowomics

This week I would like to introduce you to Richard Smith, founder and software developer of Nowomics. He kindly agreed to answer some questions for our post blog series and here it is – first hand information on Nowomics. Keep reading to find out more about this company.

richard_smith

Richard Smith, founder and software developer of Nowomics

1. Could you please give us a short introduction to Nowomics (goals, interests, mission)?

Nowomics is a free website to help life scientists keep up with the latest papers and data relevant to their research. It lets researchers ‘follow’ genes and keywords to build their own news feed of what’s new and popular in their field. The aim is to help scientists discover the most useful information and avoid missing important journal articles, but without spending a lot of their time searching websites.

2. What makes Nowomics unique?

Nowomics tracks new papers, but also other sources of curated biological annotation and experimental data. It can tell you if a gene you work on has new annotation added or has been linked to a disease in a recent study. The aim is to build knowledge of these biological relationships into the software to help scientists navigate and discover information, rather than recommending papers simply by text similarity.

3. When did you realise that a tool such as Nowomics would be of a great help to the genomic research community?

I’ve been building websites and databases for biologists for a long time and have heard from many scientists how hard it is to keep up with the flood of new information. There are around 20,000 biomedical journal articles published every week and hundreds of sources of data online, receiving lots of emails with lists of paper titles isn’t a great solution. In social media interactive news feeds that adapt to an individual are now commonly used as an excellent way to consume large amounts of new information, I wanted to apply these principles to tracking biology research.

4. Which part of developing the tool you found most challenging?

As with a lot of software, making sure Nowomics is as useful as possible to users has been the hardest part. It’s quite straightforward to identify a problem and build some software, but making sure the two are correctly aligned to provide maximum value to users has been the difficult part. It has meant trying many things, demonstrating ideas and listening to a lot of feedback. Handling large amounts of data and writing text mining software to identify thousands of biological terms is simple by comparison!

5. What are your plans for the future of Nowomics? Are you working on adding new features/apps?

There are lots of new features planned. Currently Nowomics focuses on genes/proteins and selected organisms. We’ll soon make this much broader, so scientists will be able to follow diseases, pathways, species, processes and many other keywords. We’re working on how these terms can be combined together for fine grained control of what appears in news feeds. It’s also important to make sharing with colleagues and recommending research extremely simple.

6. Can you think of examples of how Nowomics supports data access and knowledge dissemination within the genomics community?

The first step to sharing data sets and accessing research is for the right people to know they exist. This is exactly what Nowomics was set up to achieve, to benefit both scientists who need to be alerted to useful information and for those generating or funding research to reach the best possible audience. Hopefully Nowomics will also alert people to relevant shared genomics data in future.

7. What does ethical data sharing mean to you?

For data that can advance scientific and medical research the most ethical thing to do is to share it with other researchers to help make progress. This is especially true for data resulting from publicly funded research. However, with medical and genomics data the issues of confidentiality and privacy must take priority, and individuals must be aware what their information may be used for.

8. What are the most important things that you think should be done in the field of genetic data sharing?

The challenge is to find a way to unlock the huge potential of sharing genomics data for analysis while respecting the very real privacy concerns. A platform that enables sharing in a secure, controlled manner which preserves privacy and anonymity seems essential, I’m very interested in what DNADigest are doing in this regard.

Bildschirmfoto vom 2015-01-12 15:45:52

Exploring Open Science n°3: DNAdigests interviews NGS logistics

NGS logistics is the next project featured in our blog interviews. We have interviewed Amin Ardeshirdavani who is a PhD student involved in the creation of this web-based application. Take a look at the interview to find why this tool has become very popular within KU Leuven.

NGSlogistics

1. What is NGS logistics?

NGS-Logistics is a web-based application, which accelerates the federated analysis of Next Generation Sequencing data across different centres. NGS-Logistics acts as a real logistics company: you order something from the Internet; the owner processes your request and then ships it through a safe and trustful logistics company. In this of NGS-Logistics, the goods are human sequence data and researchers ask for possible variations and their frequency among the whole population. We try to deliver the answers in the fastest and safest possible way.

2. What is your part in NGS logistics?

Right now I am a PhD student at KU Leuven and the whole idea of my PhD project is designing and developing new data structures for analysing of massive amount of data produced by Next Generation Sequencing machines. NGS logistics is exactly that. I have done the whole design and development of the application and database. Hereby I would also like to acknowledge all the people from the KU Leuven, ESAT IT Dept., UZ Leuven IT Dept., and UZ Genomics core Dept. who assisted me on this project and for their kind support, especially Erika Souche.

3. When did you first start working on the idea of creating NGS logistics and what made you think it would be something useful?

It was almost three years ago when I had a meeting with my promotor Professor Yves Moreau, and he had an idea to somehow connect sequencing centres and query their data without moving them into one repository. As a person with an IT background it wasn’t that difficult for me to develop an application but there were lots of practical issues that needed to be taken care of. The majority of these issues are related to protecting the privacy of the individuals, because the data we deal with are coming from human genome sequencing experiments and people are rightfully worried about how this data will be used and protected. At the time of my first meeting there was no system in place to share this data but many people understood the need for this kind of structure and for us to start working on it. As we know, information can be a true scientific goldmine and by having access to more data we are able to produce more useful information. The novelty of the data, the possibility of sharing this wealth of information, and the complexity of this kind of applications make me so eager to work on this project.

4. How does your open source tool work and who it is designed for?

NGS-Logistics has three modules: Web Interface, Access control list and the Query manager. The source code of each one of these modules plus the database structure behind them is available upon simple request. As the modules are being upgraded continuously, I have not made any public repository for the source code yet. However, if someone would be interested to gain access to the source code it will be our pleasure to give it to them while I do think that the whole idea of the Data sharing is more important than the source code itself. Anyhow, it is our pleasure to share our experience with different problems and issues that we had to tackle during the past three years with others. In general, NGS-Logistics is designed to help researchers to save time when they need to have access to more data. It will help them to get a better overview of their questions and if they need to have access to the actual data, it will help them get the most useful data sets that match their cases.

5. Who has access to the system and how do you manage access permissions?

Researchers with a valid email address and affiliation are welcome to register and use the application. This means that we need to know who is querying the data to prevent structural queries, which may lead to identify an individual. I spent almost 20 months on the Access Control List (ACL) module. Most of the tasks are controlled and automatically updated by the system itself. Center Admins will be responsible for updating the list of samples they want to share with the others. PIs and their power users are responsible to group the samples as data sets and assign them to the users and groups. ACL has a very rich and user-friendly interface that makes it very easy to learn and use.

6. In what way do you think data sharing should be further improved?

Because of all the concerns around the term “Data Sharing”, I prefer to use the term “Result Sharing”. In our framework, we mostly try to answer very high-level questions like “The prevalence of a certain mutation in different populations”, preventing any private information from leaking out. By having more access to data we can gain more insight and produce more useful information; as Aristotle said: “The whole is greater than the sum of its parts.” On the other hand we always have to be careful about the consequences of sharing.

7. What does ethical data sharing mean to you?

It means everything and nothing. Why? Because ethics really depends on the subject and the location we are talking about. If we talk about sharing weather forecast data, I would say it is not important and it does not have any meaning. But when we talk about the data produced based on human genomes then we have to be careful. Legal frameworks differ a lot between many countries. Some of them are very restrictive when it comes to dealing with sensitive and private data whereas others are much less restrictive. Mostly this is because they have different definitions of private data. In most cases, any information that allows us to uniquely identify a person is defined as private information and as we know there is a possibility to identify a person by his or her genome sequence. Therefore, I feel that it is very important to keep track of what data is being used by who, when, at which level and for what reason.

NGS

Amin Ardeshirdavani et al, has published his work in Genome Medicine 6:71 : “NGS-Logistics: federated analysis of NGS sequence variants across multiple locations”. You can take a look at it here.