An investigation of how researchers in data-intensive scientific fields use, process and curate digital data.

Yvonne Nobis

A recent editorial (February 2011) in the journal Science highlighted several potential problems involved in dealing with data as we have now passed the point where more data is being collected than we can physically store.

To date, most information science/library based research on the data outputs of scientific research has concentrated on the issues relating to the storage of, and access to scientific data, and the preservation issues arising.

In this vein, ‘What researchers want’ -- a comprehensive review of literature in the area -- drew the conclusion that “it makes sense to invest in better data management during the research phase because doing so will improve data preservation once the research phase has ended”. Several other studies have looked at potential methodologies for handling data and whether these should be made mandatory by the Research Councils. A 2008 RIN study draws attention to the gap between the specialist roles of informaticians, statisticians, modellers and curators, and the information skills of life sciences researchers. It concluded that engagement with information professionals could add to the efficiency and effectiveness of research in the life sciences. The JISC funded Incremental project currently running at Cambridge and Glasgow is aiming to identify current practices in managing digital research data, and to assist researchers in managing their data in the future.

However none of these studies has specifically considered the techniques used by researchers in handling data and the potential re-use of these techniques by researchers in other disciplines.

The emphasis of the current proposal is somewhat different: the aim is to examine how research data is currently handled by research groups in large scientific collaborations, and to question at what stage (if any) input from information professionals would be of use. The final result of much scientific research is most often a published paper. However to get to this stage, many techniques will be employed in the manipulation of raw technique, and often these are not recorded, or if recorded are not made available outside the specific research field, thus limiting the possibility of multidisciplinary application. This is increasingly an acknowledged problem. For example, the journal Science is extending its data access requirements to include computer codes involved in the creation or analysis of data.

Such re-use of techniques employed in analysing data is occurring but in only in a limited way. An example is PathGrid: a collaboration between astronomers at Cambridge’s Institute of Astronomy and Cancer Research UK’s Cambridge Research Institute where image-analysis software developed for astronomy is being used to automate the study of pathology slides. In this case the relevant researchers met at a local e-science forum. Many such other opportunities for collaboration must exist, if the data handling techniques being used are recorded and made available.