An Introduction to the Analytic Web
The way in which science is done has been changed irrevocably by the
Internet. Huge volumes of data, stored on computers all over the world,
are now available to scientists anywhere. As a result, observations
taken around the globe can be accessed rapidly by scientists, raising
the prospect of accelerated formulations and validation of scientific
hypotheses. Extensive computing power, mass storage, and fast Internet
access seem poised to foster rapid acceleration of scientific knowledge.
But there are risks associated with this attractive scenario. The
ways in which scientists acquire and process data must be understood
by those who use it. Failure to take thisinformation into account can
lead to misuses that, in turn, can lead to misleading or incorrect
results. Raw data sets, for example, must often be cleansed to remove
data items that are anomalous or tainted. But some techniques for cleansing
data render the resulting datasets unsuitable for certain kinds of
subsequent statistical analyses. In the UMass Computer Science Department,
David Jensen and Tim Oates have made substantial contributions to the
understanding of statistical incompatibility, identifying cases where
these incompatibilities have led to erroneous scientific results.
The stakes here are huge. Major public policy debates, such as those
over global warming and risk levels from various carcinogens, must
be informed by scientific results inferred from data that has undergone
long chains of cleansing and statistical inferences. Process and statistical
incompatibilities could potentially result in misinformation, leading
public policy makers to make decisions that could have disastrous environmental
or public health consequences. Although these risks have always existed,
the increased availability of data and scientific processing via the
Internet has exacerbated the problem.
We observe that to manage these risks, it is not enough to simply
make scientific datasets more readily available. It is also necessary
to accompany each dataset with metadata (i.e.,data describing the data)
that documents the processes by which it was created. Going further,
it is also important to evaluate these processes to determine if unsound
analyses, such as those documented by Jensen and Oates, have been inadvertently
applied thereby undermining the validity of the results.
The Analytic Web project, funded by the NSF ITR program, is investigating
computer support for web-based scientific processes. This project brings
together researchers fromthe Computer Science Department and from Harvard
Forest to explore automated support for defining, analyzing, and automating
scientific processes. Lee Osterweil and Lori Clarke are leading the
software engineering effort, David Jensen is providing statistical
analysis expertise, and Ed Riseman, Al Hanson, and Howard Schultz are
working closely with the Harvard Forest researchers on collecting data
and carefully defining ecological processes.
The ecologists from Harvard Forest, Emery Boose, Aaron Ellison, David
Foster, and Julian Hadley, are concerned with measuring and predicting
forest carbon dioxide sequestration.The ecologists gather data from
a flux tower, a 10-meter structure located in Petersham, MA, in the
midst of Harvard Forest. The flux tower sucks in ambient air and measures
the percentage of carbon dioxide in the air five times per second.
These measurements are effected by various natural phenomena, such
as temperature, wind speed, and tree species (identified by aerial
photographs that are evaluated by the Vision lab). Thus, there are
a number of cleansing, estimation, and statistical processes that are
applied and evaluated. Ultimately, using such processes, ecologists
hope to determine a model of forest carbon dioxide sequestration. It
is clear that such findings can have a substantial impact upon policies
aimed at addressing the control of greenhouse gases, which lead to
global warming. These processes also serve as excellent case studies
for our investigation into support for the Analytic Web.
Central to this investigation is a careful study of the models needed
to represent scientific processes effectively. This aspect of the work
builds upon Lee Osterweil’s ongoing research aimed at developing
languages for the specification of processes. Originally focused on
languages for defining software development processes, this work has
recently widened its focus to address processes in such diverse areas
as medical procedures, government functions, and electronic commerce.
This work has led to the development of Little-JIL, a graphical language
that incorporates representations of such semantic issues as exception
management, resource utilization, timing constraints, and concurrency
control. These are all essential to the articulate definition of processes,
but most are absent from current process definition languages. Thus
our intention is to use Little-JIL as a starting point in our efforts
to model scientific processes, expecting that experience will point
the way towards modifications and enhancements needed to support working
scientists.
The above shows a small part of a Little-JIL process
for cleansing the carbon dioxide data collected from a flux tower.
One of our first findings has been the need to compliment the process
model with a derivation model. A derivation model is similar to a data-flow
model or state diagram, in that it shows how types of data are processed,
but it must also distinguish data instances, as illustrated below.
The derivation model and process model together carefully
document the processing applied to various instances of the datasets.
The description is adequate to be used as the basis for execution.
Thus, in documenting their processes, scientists are provided with
an execution framework. Although there is considerable work to be done
on the models and on the user interface for such a framework, the ecologists
have already found it preferable to their current programming environment
(don’t ask!). In the future, we plan to investigate using such
models to support automatic rederivation and configuration management.
Another central theme in this project is analyzing the
soundness of scientific processes. Lori Clarke is leading this effort,
building upon her previous research in finite-state verification. In
this research, Clarke and her colleagues are developing an analyzer,
called FLAVERS, capable of determining whether or not user-specified
properties, describing desirable (or undesirable) sequences of events,
can occur on any execution of a concurrent system. In this project,
we are investigating how such analysis techniques can be applied to
Little-JIL process models. Eventually we would like to build upon the
work of Jensen and Oates to specify and detect unreliable statistical
processes. We also are exploring the consistency relationships between
the process and derivation models.
Although we are still in the early stages of this project,
we have successfully defined and automated a few of the carbon dioxide
sequestration measurement processes.Visualizations, executions, and
easy modification of these processes have been demonstrated at an ecological
conference. Work on improving the model representations and the associated
analyses are underway. Eventually we hope to make the Analytic Web
framework available to the general scientific community. Through this
framework, we hope to provide support for defining, executing, and
analyzing scientific processes that should foster safe reuse of data
and processes and facilitate scientific discovery. Ultimately we hope
to see these scientific processes made available to students in universities,
colleges, and high schools, in order to bring the challenges and excitement
of scientific discovery into laboratories and classrooms around the
country and the world.