- Some high profile open data web sites are going to die.Â At some sites the lack of updates and lack of use will catch up with them.Â Others will see highly publicized discussions of errors and omissions.Â For some in the industry this will be black eye.Â For others it will be an â€œI told you soâ€ moment causing great soul-searching and a re-emphasis on the need for effective program planning.
- Greater attention paid to cost, governance, and sustainability.Â In parallel with the above there will be more attention paid to open data costs, governance, and program sustainability.Â Partly this will be in response to the issues raised in (1) and partly because the â€œmovementâ€ is maturing.Â As people move beyond the low-hanging-fruit and cherry-picking stage they will be giving more thought to what it takes to manage an open data program effectively.
- Greater emphasis on standards, open source, and APIs. This is another aspect of the natural evolution of the movement. Much of the open data movement has relied on â€œbottom upâ€ innovation and the enthusiasm of a developer community accustomed to operating on the periphery of the tech establishment. Some of this is generational as younger developers move into positions of authority. Some is due to the ease with which data and tools can be obtained and combined by individuals and groups working remotely and collaborating via systems like GitHub.
- More focus on economic impacts of open data in developed and developing countries alike.Â While many open data programs have been justified on the basis of laudable goals such as â€œtransparencyâ€ and â€œcivic engagement,â€ sponsors will inevitably ask questions about â€œimpactâ€ as update costs begin to roll in.Â Some of the most important questions are also the simplest to ask but the hardest to answer, such as, â€œAre the people we hoped would use the data actually using the data?â€ and â€œIs using the data doing any good?â€
- More blurring of the distinctions between public sector and private sector data.Â One of the basic ideas behind making government data â€œopenâ€ is to allow the public and entrepreneurs to use and combine public data with other data in new and useful ways. It is inevitable that private sector data will come into the mix. When public and private data are combined some interesting intellectual property, ownership, and pricing questions will be raised. Managers must be ready to address questions such as, â€œWhy should I have to pay for a product that contains data I paid to collect via my tax dollars?â€
- Inclusion of open data features in mainstream ERP, database, middleware, and CRM products.Â Just as vendors have incorporated social networking and collaboration features with older products, so too will open data features be added to mainstream enterprise products to enable access via file downloads, visualization, and documented APIs. Such features will be justified by the extra utility and engagement they support. Some vendors will incorporate monetization features to make it easier to track and charge for data the new tools expose.
- Continued challenges to open data ROI and impact measurement.Â As those experienced with usage metrics will tell you itâ€™s not just usage thatâ€™s important itâ€™s the impact of usage that really counts. In the coming year this focus on open data impact measurement will continue to grow. I take that as a good sign.Â I also predict that open data impact measurement will continue to be a challenge.Â Just as in the web site world itâ€™s easier to measure pageviews than measure the impacts of the information communicated via the pageviews, so too will it continue to be easier to measure data file downloads and API calls than the impacts the use of the data thus obtained will have.
This week I would like to introduce you to Richard Smith, founder and software developer of Nowomics. He kindly agreed to answer some questions for our post blog series and here it is â€“ first hand information on Nowomics. Keep reading to find out more about this company.
Richard Smith, founder and software developer of Nowomics
1. Could you please give us a short introduction to Nowomics (goals, interests,Â mission)?
Nowomics is a free website to help life scientists keep up with the latest papers and data relevant toÂ their research. It lets researchers â€˜followâ€™ genes and keywords to build their own news feed ofÂ whatâ€™s new and popular in their field. The aim is to help scientists discover the most usefulÂ information and avoid missing important journal articles, but without spending a lot of their timeÂ searching websites.
2. What makes Nowomics unique?
Nowomics tracks new papers, but also other sources of curated biological annotation andÂ experimental data. It can tell you if a gene you work on has new annotation added or has beenÂ linked to a disease in a recent study. The aim is to build knowledge of these biological relationshipsÂ into the software to help scientists navigate and discover information, rather than recommendingÂ papers simply by text similarity.
3. When did you realise that a tool such as Nowomics would be of a great help toÂ the genomic research community?
Iâ€™ve been building websites and databases for biologists for a long time and have heard from manyÂ scientists how hard it is to keep up with the flood of new information. There are around 20,000Â biomedical journal articles published every week and hundreds of sources of data online, receivingÂ lots of emails with lists of paper titles isnâ€™t a great solution. In social media interactive news feedsÂ that adapt to an individual are now commonly used as an excellent way to consume large amounts ofÂ new information, I wanted to apply these principles to tracking biology research.
4. Which part of developing the tool you found most challenging?
As with a lot of software, making sure Nowomics is as useful as possible to users has been theÂ hardest part. Itâ€™s quite straightforward to identify a problem and build some software, but makingÂ sure the two are correctly aligned to provide maximum value to users has been the difficult part. ItÂ has meant trying many things, demonstrating ideas and listening to a lot of feedback. Handling largeÂ amounts of data and writing text mining software to identify thousands of biological terms is simpleÂ by comparison!
5. What are your plans for the future of Nowomics? Are you working on adding newÂ features/apps?
There are lots of new features planned. Currently Nowomics focuses on genes/proteins and selectedÂ organisms. Weâ€™ll soon make this much broader, so scientists will be able to follow diseases,Â pathways, species, processes and many other keywords. Weâ€™re working on how these terms canÂ be combined together for fine grained control of what appears in news feeds. Itâ€™s also important toÂ make sharing with colleagues and recommending research extremely simple.
6. Can you think of examples of how Nowomics supports data access andÂ knowledge dissemination within the genomics community?
The first step to sharing data sets and accessing research is for the right people to know they exist.Â This is exactly what Nowomics was set up to achieve, to benefit both scientists who need to beÂ alerted to useful information and for those generating or funding research to reach the best possibleÂ audience. Hopefully Nowomics will also alert people to relevant shared genomics data in future.
7. What does ethical data sharing mean to you?
For data that can advance scientific and medical research the most ethical thing to do is to share itÂ with other researchers to help make progress. This is especially true for data resulting from publiclyÂ funded research. However, with medical and genomics data the issues of confidentiality and privacyÂ must take priority, and individuals must be aware what their information may be used for.
8. What are the most important things that you think should be done in the field ofÂ genetic data sharing?
The challenge is to find a way to unlock the huge potential of sharing genomics data for analysisÂ while respecting the very real privacy concerns. A platform that enables sharing in a secure,Â controlled manner which preserves privacy and anonymity seems essential, Iâ€™m very interested inÂ what DNADigest are doing in this regard.
NGS logistics is the next project featured in our blog interviews. We have interviewed AminÂ Ardeshirdavani who is aÂ PhD student involved in the creation of this web-based application. Take a look at the interview to find why this tool has become very popular within KU Leuven.
1. What is NGS logistics?
NGS-Logistics is a web-based application, which accelerates the federated analysis of Next Generation Sequencing data across different centres. NGS-Logistics acts as a real logistics company: you order something from the Internet; the owner processes your request and then ships it through a safe and trustful logistics company. In this of NGS-Logistics, the goods are human sequence data and researchers ask for possible variations and their frequency among the whole population. We try to deliver the answers in the fastest and safest possible way.
2. What is your part in NGS logistics?
Right now I am a PhD student at KU Leuven and the whole idea of my PhD project is designing and developing new data structures for analysing of massive amount of data produced by Next Generation Sequencing machines. NGS logistics is exactly that.Â I have done the whole design and development of the application and database. Hereby I would also like to acknowledge all the people from theÂ KU Leuven, ESAT IT Dept., UZ Leuven IT Dept., and UZ Genomics core Dept. who assisted me on this project and for their kind support, especially Erika Souche.
3. When did you first start working on the idea of creating NGS logistics and what made you think it would be something useful?
It was almost three years ago when I had a meeting with my promotor Professor Yves Moreau, and he had an idea to somehow connect sequencing centres and query their data without moving them into one repository. As a person with an IT background it wasnâ€™t that difficult for me to develop an application but there were lots of practical issues that needed to be taken care of. The majority of these issues are related to protecting the privacy of the individuals, because the data we deal with are coming from human genome sequencing experiments and people are rightfully worried about how this data will be used and protected. At the time of my first meeting there was no system in place to share this data but many people understood the need for this kind of structure and for us to start working on it. As we know, information can be a true scientific goldmine and by having access to more data we are able to produce more useful information. The novelty of the data, the possibility of sharing this wealth of information, and the complexity of this kind of applications make me so eager to work on this project.
4. How does your open source tool work and who it is designed for?
NGS-Logistics has three modules: Web Interface, Access control list and the Query manager. The source code of each one of these modules plus the database structure behind them is available upon simple request. As the modules are being upgraded continuously, I have not made any public repository for the source code yet. However, if someone would be interested to gain access to the source code it will be our pleasure to give it to them while I do think that the whole idea of the Data sharing is more important than the source code itself. Anyhow, it is our pleasure to share our experience with different problems and issues that we had to tackle during the past three years with others. In general, NGS-Logistics is designed to help researchers to save time when they need to have access to more data. It will help them to get a better overview of their questions and if they need to have access to the actual data, it will help them get the most useful data sets that match their cases.
5. Who has access to the system and how do you manage access permissions?
Researchers with a valid email address and affiliation are welcome to register and use the application.Â This means that we need to know who is querying the data to prevent structural queries, which may lead to identify an individual. I spent almost 20 months on the Access Control List (ACL) module. Most of the tasks are controlled and automatically updated by the system itself. Center Admins will be responsible for updating the list of samples they want to share with the others. PIs and their power users are responsible to group the samples as data sets and assign them to the users and groups. ACL has a very rich and user-friendly interface that makes it very easy to learn and use.
6. In what way do you think data sharing should be further improved?
Because of all the concerns around the term â€œData Sharingâ€, I prefer to use the term â€œResult Sharingâ€. In our framework, we mostly try to answer very high-level questions like â€œThe prevalence of a certain mutation in different populationsâ€, preventing any private information from leaking out. By having more access to data we can gain more insight and produce more useful information; as Aristotle said: â€œThe whole is greater than the sum of its parts.â€ On the other hand we always have to be careful about the consequences of sharing.
7. What does ethical data sharing mean to you?
It means everything and nothing. Why? Because ethics really depends on the subject and the location we are talking about. If we talk about sharing weather forecast data, I would say it is not important and it does not have any meaning. But when we talk about the data produced based on human genomes then we have to be careful. Legal frameworks differ a lot between many countries. Some of them are very restrictive when it comes to dealing with sensitive and private data whereas others are much less restrictive. Mostly this is because they have different definitions of private data. In most cases, any information that allows us to uniquely identify a person is defined as private information and as we know there is a possibility to identify a person by his or her genome sequence. Therefore, I feel that it is very important to keep track of what data is being used by who, when, at which level and for what reason.
Amin Ardeshirdavani et al, has published his work inÂ Genome MedicineÂ 6:71Â : â€œNGS-Logistics: federated analysis of NGS sequence variants across multiple locationsâ€. You can take a look at itÂ here.