% Copyright 2019 by Till Tantau
%
% This file may be distributed and/or modified
%
% 1. under the LaTeX Project Public License and/or
% 2. under the GNU Free Documentation License.
%
% See the file doc/generic/pgf/licenses/LICENSE for more details.


\section{Introduction to Data Visualization}

\emph{Data visualization} is the process of converting \emph{data points,}
which typically consist of multiple numerical values, into a graphical
representation. Examples include the well-known function plots, but pie charts,
bar diagrams, box plots, or vector fields are also examples of data
visualizations.

The data visualization subsystem of \pgfname\ takes a general, open approach to
data visualization. Like everything else in \pgfname, there is a powerful, but
not-so-easy-to-use basic layer in the data visualization system and a less
flexible, but much simpler-to-use frontend layer. The present section gives an
overview of the basic ideas behind the data visualization system.


\subsection{Concept: Data Points}
\label{section-dv-intro-data-points}

The most important input for a data visualization is always raw data. This data
is typically present in different formats and the data visualization subsystem
provides methods for reading such formats and also for defining new input
formats. However, independently of the input format, we may ask what kind of
data the data visualization subsystem should be able to process. For
two-dimensional plots we need lists of pairs of real numbers. For a bar plot we
usually need a list of numbers, possibly together with some colors and labels.
For a surface plot we need a matrix of triples of real numbers. For a vector
field we need even more complex data.

The data visualization subsystem makes no assumption concerning which kind of
data is being processed. Instead, the whole ``rendering pipeline'' is centered
around a concept called the \emph{data point}. Conceptually, a data point is an
arbitrarily complex record that represents one piece of data that should be
visualized. Data points are \emph{not} just coordinates in the plane or the
numerical values that need to be visualized. Rather, they represent the basic
units of the data that needs to be visualized.

Consider the following example: In an experiment we drive a car along a road
and have different measurement instruments installed. We measure the position
of the car, the time, the speed, the direction the car is heading, the
acceleration, and perhaps some further values. A data point would consist of a
record consisting of a timestamp together with the current position of the car
(presumably two or three numbers), the speed vector (another two or three
numbers), the acceleration (another two or three numbers), and perhaps the
label text of the current experiment.

Data points should be ``information rich''. They might even contain more
information than what will actually be visualized. It is the job of the
rendering pipeline to pick out the information relevant to one particular data
visualization -- another visualization of the same data might pick different
aspects of the data points, thereby hopefully allowing new insights into the
data.

Technically, there is no special data structure for data points. Rather, when a
special macro called |\pgfdatapoint| is called, the ``totality'' of all
currently set keys with the |/data point/| prefix in the current scope forms
the data point. This is both a very general approach and quite fast since no
extra data structures need to be created.


\subsection{Concept: Visualization Pipeline}

The \emph{visualization pipeline} is a series of actions that are performed on
the to-be-visualized data. The data is presented to the visualization pipeline
in the form of a long stream of  complex data points. The visualization
pipeline makes several passes over this stream of data points. During the first
pass(es), called the \emph{survey phase(s)}, information is gathered about the
data points such as minimal and maximal values, which can be useful for
automatic fitting of the data into a given area. In the main pass over the
data, called the \emph{visualization phase}, the data points are actually
visualized, for instance in the form of lines or points.

Like as for data points, the visualization pipeline makes no assumptions
concerning what kind of visualization is desired. Indeed, one could even use it
to produce a plain-text table. This flexibility is achieved by extensive use of
objects and signals: When a data visualization starts, a number of signals (see
Section~\ref{section-signals} for an introduction to signals) are initialized.
Then, numerous ``visualization objects'' are created that listen to these
signals. These objects are all involved in processing the data points. For
instance, the job of an |interval mapper| object is to map one attribute of a
data point, such as a car's velocity, to another, such as the $y$-axis of a
plot. For each data point the different signals are raised in a certain order
and the different visualization objects now have a chance of preparing the data
point for the actual visualization. Continuing the above example, there might
be a second |interval mapper| that takes the computed $y$-position and applies
a logarithm to it, because a log-plot was requested. Then another mapper, this
time a |polar mapper| might be used to map everything to polar coordinates.
Following this, a |plot mark visualizer| might actually draw something at the
computed position.

The whole idea behind the rendering pipeline is that new kinds of data
visualizations can be implemented, ideally, just by adding one or two new
objects to the visualization pipeline. Furthermore, different kinds of plots
can be combined in novel ways in this manner, which is usually very
hard to do otherwise.
For instance, the visualization pipeline makes it easy to create, say,
polar-semilog-box-plots. At first sight, such new kinds of plots may seem
frivolous, but data visualization is all about gaining insights into the data
from as many different angles as possible.

Naturally, creating new classes and objects for the rendering pipeline is not
trivial, so most users will just use the existing classes, which should, thus,
be as flexible as possible. But even when one only intends to use existing
classes, it is still tricky to setup the pipeline correctly since the ordering
is obviously important and since things like axes and ticks need to be
configured and taken care of. For this reason, the frontend libraries provide
preconfigured rendering pipelines so that one can simply say that a data
visualization should look like a |line plot| with |school book axes| or with
|scientific axes|, which selects a certain visualization pipeline that is
appropriate for this kind of plot:
%
\begin{codeexample}[preamble={\usetikzlibrary{datavisualization.formats.functions}}]
\begin{tikzpicture}[scale=.7]
  \datavisualization [school book axes, visualize as smooth line]
  data [format=function] {
    var x : interval [-2:2];
    func y = \value x*\value x + 1;
  };
\end{tikzpicture}
\end{codeexample}
%
\begin{codeexample}[preamble={\usetikzlibrary{datavisualization.formats.functions}}]
\begin{tikzpicture}[scale=.7]
  \datavisualization [scientific axes, visualize as smooth line]
  data [format=function] {
    var x : interval [-2:2];
    func y = \value x*\value x + 1;
  };
\end{tikzpicture}
\end{codeexample}
%
One must still configure such a plot (choose styles and themes and also specify
which attributes of a data point should be used), but on the whole the plot is
quite simple to specify.


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "pgfmanual"
%%% End: