Home > Scientific Workflow > An Introduction to the Analytic Web

An Introduction to the Analytic Web

The way in which science is done has been changed irrevocably by the Internet. Huge volumes of data, stored on computers all over the world, are now available to scientists anywhere. As a result, observations taken around the globe can be accessed rapidly by scientists, raising the prospect of accelerated formulations and validation of scientific hypotheses. Extensive computing power, mass storage, and fast Internet access seem poised to foster rapid acceleration of scientific knowledge.

But there are risks associated with this attractive scenario. The ways in which scientists acquire and process data must be understood by those who use it. Failure to take thisinformation into account can lead to misuses that, in turn, can lead to misleading or incorrect results. Raw data sets, for example, must often be cleansed to remove data items that are anomalous or tainted. But some techniques for cleansing data render the resulting datasets unsuitable for certain kinds of subsequent statistical analyses. In the UMass Computer Science Department, David Jensen and Tim Oates have made substantial contributions to the understanding of statistical incompatibility, identifying cases where these incompatibilities have led to erroneous scientific results.

The stakes here are huge. Major public policy debates, such as those over global warming and risk levels from various carcinogens, must be informed by scientific results inferred from data that has undergone long chains of cleansing and statistical inferences. Process and statistical incompatibilities could potentially result in misinformation, leading public policy makers to make decisions that could have disastrous environmental or public health consequences. Although these risks have always existed, the increased availability of data and scientific processing via the Internet has exacerbated the problem.

We observe that to manage these risks, it is not enough to simply make scientific datasets more readily available. It is also necessary to accompany each dataset with metadata (i.e.,data describing the data) that documents the processes by which it was created. Going further, it is also important to evaluate these processes to determine if unsound analyses, such as those documented by Jensen and Oates, have been inadvertently applied thereby undermining the validity of the results.

The Analytic Web project, funded by the NSF ITR program, is investigating computer support for web-based scientific processes. This project brings together researchers fromthe Computer Science Department and from Harvard Forest to explore automated support for defining, analyzing, and automating scientific processes. Lee Osterweil and Lori Clarke are leading the software engineering effort, David Jensen is providing statistical analysis expertise, and Ed Riseman, Al Hanson, and Howard Schultz are working closely with the Harvard Forest researchers on collecting data and carefully defining ecological processes.

The ecologists from Harvard Forest, Emery Boose, Aaron Ellison, David Foster, and Julian Hadley, are concerned with measuring and predicting forest carbon dioxide sequestration.The ecologists gather data from a flux tower, a 10-meter structure located in Petersham, MA, in the midst of Harvard Forest. The flux tower sucks in ambient air and measures the percentage of carbon dioxide in the air five times per second. These measurements are effected by various natural phenomena, such as temperature, wind speed, and tree species (identified by aerial photographs that are evaluated by the Vision lab). Thus, there are a number of cleansing, estimation, and statistical processes that are applied and evaluated. Ultimately, using such processes, ecologists hope to determine a model of forest carbon dioxide sequestration. It is clear that such findings can have a substantial impact upon policies aimed at addressing the control of greenhouse gases, which lead to global warming. These processes also serve as excellent case studies for our investigation into support for the Analytic Web.

Central to this investigation is a careful study of the models needed to represent scientific processes effectively. This aspect of the work builds upon Lee Osterweil’s ongoing research aimed at developing languages for the specification of processes. Originally focused on languages for defining software development processes, this work has recently widened its focus to address processes in such diverse areas as medical procedures, government functions, and electronic commerce. This work has led to the development of Little-JIL, a graphical language that incorporates representations of such semantic issues as exception management, resource utilization, timing constraints, and concurrency control. These are all essential to the articulate definition of processes, but most are absent from current process definition languages. Thus our intention is to use Little-JIL as a starting point in our efforts to model scientific processes, expecting that experience will point the way towards modifications and enhancements needed to support working scientists.

The above shows a small part of a Little-JIL process for cleansing the carbon dioxide data collected from a flux tower. One of our first findings has been the need to compliment the process model with a derivation model. A derivation model is similar to a data-flow model or state diagram, in that it shows how types of data are processed, but it must also distinguish data instances, as illustrated below.

The derivation model and process model together carefully document the processing applied to various instances of the datasets. The description is adequate to be used as the basis for execution. Thus, in documenting their processes, scientists are provided with an execution framework. Although there is considerable work to be done on the models and on the user interface for such a framework, the ecologists have already found it preferable to their current programming environment (don’t ask!). In the future, we plan to investigate using such models to support automatic rederivation and configuration management.

Another central theme in this project is analyzing the soundness of scientific processes. Lori Clarke is leading this effort, building upon her previous research in finite-state verification. In this research, Clarke and her colleagues are developing an analyzer, called FLAVERS, capable of determining whether or not user-specified properties, describing desirable (or undesirable) sequences of events, can occur on any execution of a concurrent system. In this project, we are investigating how such analysis techniques can be applied to Little-JIL process models. Eventually we would like to build upon the work of Jensen and Oates to specify and detect unreliable statistical processes. We also are exploring the consistency relationships between the process and derivation models.

Although we are still in the early stages of this project, we have successfully defined and automated a few of the carbon dioxide sequestration measurement processes.Visualizations, executions, and easy modification of these processes have been demonstrated at an ecological conference. Work on improving the model representations and the associated analyses are underway. Eventually we hope to make the Analytic Web framework available to the general scientific community. Through this framework, we hope to provide support for defining, executing, and analyzing scientific processes that should foster safe reuse of data and processes and facilitate scientific discovery. Ultimately we hope to see these scientific processes made available to students in universities, colleges, and high schools, in order to bring the challenges and excitement of scientific discovery into laboratories and classrooms around the country and the world.

 

This site is maintained by the Laboratory for Advanced Software Engineering Research.
© 2006 University of Massachusetts AmherstSite Policies