We have a huge emerging problem centered on big data. Massive amounts of agricultural data are being collected around the world by scientists. But planning and infrastructure to catalog and make data accessible for future research and synthesis are needed.
Data permeate all facets of agricultural research and agricultural production. Properly curated data can be used beyond initial experiments. Data provide inexpensive yet valuable opportunities to improve breeding and management knowledge from experiments that were never planned.
Not many people are thinking about it but it will be a real pain to try to go back and find data later; we need to address it now. Right now people generate data, keep it on their computer, write a paper and then put the data somewhere. That “somewhere” remains unclear in many cases, and there’s no organized effort being made to anticipate it.
Think of the world of genomics. The data are accessible through the National Center of Biotechnology Information where genomic data about any species in this world can be found. We should really learn from the genomics field – it’s the gold standard in publicly pooling data. When you look at the genomics-data explosion, one big lesson was to be reasonably prepared for the amount collected.
Think data use and reuse with unmanned-aerial vehicles. Data collection is the easiest part of the process. The more-difficult part follows with data analysis – stitching together images like a panorama feature on cameras, making three-dimensional point clouds, and extracting data for each plot before exporting information from global-information-system software to statistical software.
We need to save it all because in 10 years algorithms will improve and we can do a better job of extracting useful data. There’s a need for developing data-sharing standards, incentivizing researchers to share data and building a data-sharing infrastructure within agricultural research.
I participate in the “Genomes to Fields” program, which involves 35 professors growing the same genetic populations of corn and collecting data in exactly the same way. We’ve evaluated more than 180,000 plots as a team – 2,500 hybrids in 162 environments. We consider how to deal with that much data. First hire a program coordinator and agree upon standards because there will be multiple terabytes of data to be dealt with quickly.
With coordination data are available and can be used in corn-improvement studies. That’s especially true with unmanned-aerial systems that automate routine measurements such as plant height to estimate grain yield or diseases, identify the most promising varieties or identify stress signatures for farmer management.
This will mean collaborating with a data scientist from the beginning. It will require focus as a need and not as an afterthought. And it will take money. It will mean taking part of a research budget and putting it into communicating data, and maybe hiring people who can use some of the old data to make decisions. People are synthesizing data that were never designed for the experiment in which they were collected. Synthesis studies will make the most reliable and impactful findings going forward.