Grid computing, in which clusters of computers or vast distributed networks, often connected through links far faster than conventional internet pipes are now allowing scientists, engineers, clinicians, designers, and others to access distributed databases, powerful computing resources and instrumentation and so creating opportunities for faster, better or different approaches to research.
Grid computing allows scientists to collaborate more effectively than ever before, sharing, data and knowledge. Until recently, however, much of the developments had focused on storage, computation, and resource management. The myGrid project concentrates on the higher-level components and services that will support the scientific research process itself as well as enabling core scientific collaboration. It will allow dynamic groupings to tackle research problems and provide a workbench for the e-scientist that will underpin the scientific process for many researchers.
Writing today in the International Journal of Bioinformatics Research and Applications (2007, 3, 303-325), Katy Wolstencroft, Pinar Alper, Duncan Hull, Robert Stevens, and Carole Goble of The School of Computer Science, at the University of Manchester, Christopher Wroe of BT Global Services, London (previously at Manchester), and Philip Lord of The School of Computing Science, at the University of Newcastle upon Tyne explain how myGrid supports in silicolife sciences research, computational experiments, in other words. The project essentially focuses on bioinformatics, which uses the rich data of genomes and gene function, to help scientists understand life.
Bioinformaticians chain database searches and analytical tools using often complex scripts or workflows to extract knowledge and information. These “in silico” experiments use different interfaces and data formats but myGrid is using a special workflow language (SCUFL) to capture data and bring disparate resources together. It does this by presenting them directly as Web Services or as components within a single framework. Workflows can then be shared, reused and adapted by collaborating users. A single query will meet a user’s needs once all the relevant resources are Grid-enabled.
In addition, the myGrid Information Repository (mIR) is a central component that allows a team to preserve, organize and share their data and metadata – data about data. Principal among the data are the laboratory results that will comprise the input to workflows, the workflows themselves and their outputs. Provenance information – the “who, how, what, where, and when” information – is just as important as workflows and their results. Such information is routinely collected in bench experiments but now myGrid allows users to capture provenance information as metadata stored in the mIR for their in silico experiments. The workflow enactor builds a provenance record identifying precise service details, date and time of execution and intermediate results.
As a practical application of myGrid, the team worked with researchers at Newcastle University’s Centre for Life to investigate the genetic background to Graves’ disease, the leading cause of hyperthyroidism. This study of Graves’ disease involves a combination of experiments at the laboratory and bioinformatics experiments, in the form of workflows performed in silico. The myGrid team has developed a set of workflows that take the results of bench experiments and help the researchers choose follow-up experiments.
In the current research paper, the team points out that there are about 3000 services (including web services) offering programmatic access to bioinformatics resources. “The distribution and frequent lack of documentation, however, creates the requirement for easy service discovery,” they explain. In other words there are thousands of services available for solving problems in the life sciences but most of the scientists who might make use of them in their research do not even know that many of them even exist. “If services are available but unknown to the user, the advantages gained from using web service technology could be lost,” Wolstencroft and colleagues explain, “This drives the need for the myGrid ontology of services. In computing, an ontology is basically an annotated list of everything and how each item is related to the others.
The team has designed the myGrid ontology to help life scientists find all these various web services in the area of bioinformatics. The ontology has evolved since inception and new ways to exploit and adapt it have been discovered along the way. The main conclusion, the researchers draw from their efforts is that ontologies are an essential component for service discovery and interpretation. Without ontologies, many of the resources and services out there will remain hidden and essentially unused by the life sciences community.
In practice, the automated discovery of services usually requires a fine-grained description, ontology reasoning, and ranking of discovered matches. In contrast, a scientist doing the search requires course-grained matches and a short list of services that they can then investigate further until they find the best hit for their research project. Underpinning the success is expert curation of the services and key to the success of that is the development of a way to standardise the annotations associated with each entry in the ontologies, or at the very least the development of a standardised way of interpreting those annotations. “Our experiences have shown that using an ontology for service discovery is not a luxury, but a requirement,” the researchers conclude.
Further Reading: Workflows for E-science: Scientific Workflows for Grids