% Copyright 2019 by Till Tantau
%
% This file may be distributed and/or modified
%
% 1. under the LaTeX Project Public License and/or
% 2. under the GNU Free Documentation License.
%
% See the file doc/generic/pgf/licenses/LICENSE for more details.


\section{Providing Data for a Data Visualization}
\label{section-dv-formats}

\subsection{Overview}

The data visualization system needs a stream of data points as input. These
data points can be directly generated by repeatedly calling the |\pgfdatapoint|
command, but usually data is available in some special (text) format and one
would like to visualize this data. The present section explains how data in
some specific format can be fed to the data visualization system.

This section starts with an explanation of the main concepts. Then, the
standard formats are listed in the reference section. It is also possible to
define new formats, but this an advanced concept which requires an
understanding of some of the internals of the parsing mechanism, explained in
Section~\ref{section-dv-parsing}, and the usage of a rather low-level command,
explained in Section~\ref{section-dv-declaring-formats}.


\subsection{Concepts}

For the purposes of this section, let call a \emph{data format} some
standardized way of writing down a list of data points. A simple example of a
data format is the \textsc{csv} format (the acronym stands for \emph{comma
separated values}), where each line contains a data point, specified by values
separated by commas. A different format is the \emph{key--value format}, where
data points are specified by lists of key--value pairs. A far more complex
format is the \textsc{pdb}-format used by the protein database to describe
molecules.

The data visualization system does not use any specific format. Instead,
whenever data is read by the data visualization system, you must specify a
format parser (or it is chosen automatically for you). It is the job of the
parser to read (parse) the data lines and to turn them into data points, that
is, to setup appropriate subkeys of |/data point/|.

To give a concrete example, suppose a file contains the following
lines:
%
\begin{codeexample}[code only]
x, y, z
0, 0, 0
1, 1, 0
1, 1, 0.5
0, 1, 0.5
\end{codeexample}
%
This file is in the \textsc{csv}-format. This format can be read by the |table|
parser (which is called thus, rather than ``|csv|'', since it can also read
files in which the columns are separated by, say, a semicolon or a space). The
|table| format will then read the data and for each line of the data, except
for the headline of course, it will produce one data point. For instance, for
the last data point the key |/data point/x| will be set to |0|, the key
|/data point/y| will be set to |1|, and the key |/data point/z| will be set to
|0.5|.

All parsers are basically line-oriented. This means that, normally, each line
in the input data should contain one data point. This rule may not always
apply, for instance empty lines are typically ignored and sometimes a data
point may span several lines, but deviating from this ``one data point per
line'' rule makes parsers harder to program.


\subsection{Reference: Built-In Formats}

The following format is the default format, when no |format=...| is specified.

\begin{dataformat}{table}
    This format is used to parse data that is formatted in the following
    manner: Basically, each line consists of \emph{values} that are separated
    by a \emph{separator} like a comma or a space. The values are stored in
    different \emph{attributes}, that is, subkeys of |/data point| like
    |/data point/x|. In order to decide which attribute is chosen for a give
    value, the headline is important. This is the first non-empty line of a
    table. It is formatted in the same way as normal data lines (value
    separated by the separator), but the meaning of the values is different:
    The first value in the headline is the name of the attribute where the
    first values in the following lines should go each time. Similarly, the
    second value in the headline is the name of the attribute for the second
    values in the following lines, and so on.

    A simple example is the following:
    %
\begin{codeexample}[code only]
angle, radius
0, 1
45, 2
90, 3
135, 4
\end{codeexample}
    %
    The headline states that the values in the first column should be stored in
    the |angle| attribute (|/data point/angle| to be precise) and that the
    values in the second column should be stored in the |radius| attribute.
    There are four data points in this data set.

    The format will tolerate too few or too many values in a line. If there are
    less values in a line than in the headline, the last attributes will simply
    be empty. If there are more values in a line than in the headline, the
    values are stored in attributes called |/data point/attribute |\meta{column
    number}, where the first value of a line gets \meta{column number} equal to
    |1| and so on.

    The |table| format can be configured using the following options:
    %
    \begin{key}{/pgf/data/separator=\meta{character} (initially ,)}
        Use this key to change which character is used to separate values in
        the headline and in the data lines. To set the separator to a space,
        either set this key to an empty value or say |separator=\space|. Note
        that you must surround a comma by curly braces if you which to (re)set
        the separator character to a space.
        %
\begin{codeexample}[preamble={\usetikzlibrary{datavisualization}}]
\begin{tikzpicture}
  \datavisualization [school book axes, visualize as line]
    data [separator=\space] {
      x y
      0 0
      1 1
      2 1
      3 0
    }
    data [separator=;] {
      x; y; z
      3; 1; 0
      2; 2; 0
    };
\end{tikzpicture}
\end{codeexample}
    \end{key}
    %
    \begin{key}{/pgf/data/headline=\meta{headline}}
        When this key is set to a non-empty value, the value of \meta{headline}
        is used as the headline and the first line of the data is treated as a
        normal line rather than as a headline.
        %
\begin{codeexample}[preamble={\usetikzlibrary{datavisualization}}]
\begin{tikzpicture}
  \datavisualization [school book axes, visualize as line]
    data [headline={x, y}] {
      0, 0
      1, 1
      2, 1
      3, 0
    };
\end{tikzpicture}
\end{codeexample}
    \end{key}
\end{dataformat}

\begin{dataformat}{named}
    Basically, each line of the data must consist of a comma-separated sequence
    of attribute--values pairs like |x=5, lo=500|. This will cause the
    attribute |/data point/x| to be set to |5| and |/data point/lo| to be set
    to |500|.
    %
\begin{codeexample}[preamble={\usetikzlibrary{datavisualization}}]
\begin{tikzpicture}
  \datavisualization [school book axes, visualize as line]
    data [format=named] {
      x=0, y=0
      x=1, y=1
      x=2, y=1
      x=3, y=0
    };
\end{tikzpicture}
\end{codeexample}
    %
    However, instead of just specifying a single value for an attribute as in
    |x=5|, you may also specify a whole set of values as in |x={1,2,3}|. In
    this case, three data points will be created, one for each value in the
    list. Indeed, the |\foreach| statement is used to iterate over the list of
    values, so you can write things like |x={1,...,5}|.

    It is also permissible to specify lists of values for more than one
    attribute. In this case, a data point is created for each possible
    combination of values in the different lists:
    %
\begin{codeexample}[
    width=7cm,
    preamble={\usetikzlibrary{datavisualization}},
]
\tikz \datavisualization
  [scientific axes=clean,
   visualize as scatter/.list={a,b,c},
   style sheet=cross marks]
data [format=named] {
  x=0,       y={1,2,3},        set=a
  x={2,3,4}, y={3,4,5,7},      set=b
  x=6,       y={5,7,...,15},   set=c
};
\end{codeexample}
    %
\end{dataformat}

\begin{dataformat}{TeX code}
    This format will simply execute each line of the data, each of which should
    contain some normal TeX code. Note that at the end of each line control
    returns to the format handler, so for instance the arguments of a command
    may not be spread over several lines. However, not each line needs to
    produce a data point.
    %
\begin{codeexample}[preamble={\usetikzlibrary{datavisualization}}]
\begin{tikzpicture}
  \datavisualization [school book axes, visualize as line]
    data [format=TeX code] {
      \pgfkeys{/data point/.cd,x=0, y=0} \pgfdatapoint
      \pgfkeys{/data point/.cd,x=1, y=1} \pgfdatapoint
      \pgfkeys{/data point/x=2}          \pgfdatapoint
      \pgfkeyssetvalue{/data point/x}{3}
      \pgfkeyssetvalue{/data point/y}{0} \pgfdatapoint
    };
\end{tikzpicture}
\end{codeexample}
    %
\end{dataformat}


\subsection{Reference: Advanced Formats}

\begin{tikzlibrary}{datavisualization.formats.functions}
    This library defines the formats described in the following, which allow
    you to specify the data points indirectly, namely via a to-be-evaluated
    function.

    \begin{dataformat}{function}
        This format allows you to specify a function that is then evaluated in
        order to create the desired data points. In other words, the data lines
        do not contain the data itself, but rather a functional description of
        the data.

        The format used to specify the function works as follows: Each nonempty
        line of the data should contain at least one of either a \emph{variable
        declaration} or a \emph{function declaration}. A variable declaration
        signals that a certain attribute will range over a given interval. The
        function declarations will then, later, be evaluated for values inside
        this interval. The syntax for a variable declaration is one of the
        following:
        %
        \begin{enumerate}
            \item |var |\declare{\meta{variable}}| : interval[|\meta{low}|:|\meta{high}|]|
                \opt{|samples |\meta{number}}|;|
            \item |var |\declare{\meta{variable}}| : interval[|\meta{low}|:|\meta{high}%
                |] step |\meta{step}|;|
            \item |var |\declare{\meta{variable}}| : {|\meta{values}|};|
        \end{enumerate}
        %
        In the first case, if the optional |samples| part is missing, the
        number of |samples| is taken from the value stored  in the following
        key:
        %
        \begin{key}{/pgf/data/samples=\meta{number} (initially 25)}
            Sets the number of samples to be used when no sample number is
            specified.
        \end{key}
        %
        The meaning of declaring a variable declaration to range over an
        |interval| is that the attribute named \meta{variable}, that is, the
        key |/data point/|\meta{variable}, will range over the interval
        $[\meta{low},\meta{high}]$. If the number of |samples| is given
        (directly or indirectly), the interval is evenly divided into
        \meta{number} many points and the attribute is set to each of these
        values. Similarly, when a \meta{step} is specified, this stepping is
        used to increase \meta{low} iteratively up to the largest value that is
        still less or equal to \meta{high}.

        The meaning of declaring a variable using a list of \meta{values} is
        that the variable will simply iterate over the values using |\foreach|.

        You can specify more than one variable. In this case, each variable is
        varied independently of the other variables. For instance, if you
        declare an $x$-variable to range over the interval $[0,1]$ in $25$
        steps and you also declare a $y$-variable to range over the same
        interval, you get a total of $625$ value pairs.

        The variable declarations specify which (input) variables will take
        which values. It is the job of the \emph{function declarations} to
        specify how some additional attributes are to be computed. The syntax
        of a function declaration is as follows:
        %
        \begin{quote}
            |func |\declare{\meta{attribute}}| = |\meta{expression}|;|
        \end{quote}
        %
        The meaning of such a declaration is the following: For each setting of
        the input variables (the variables specified using the |var|
        declaration), evaluate the \meta{expression} using the standard
        mathematical parser of \tikzname. The resulting value is then stored in
        |/data point/|\meta{attribute}.

        Inside \meta{expression} you can reference data point attributes using
        the following command, which is only defined inside such an expression:
        %
        \begin{command}{\value\marg{variable}}
            This expands to the current value of the key
            |/data point/|\meta{variable}.
        \end{command}

        There can be multiple function declarations in a single data
        specification. In this case, all of these functions will be evaluated
        for each setting of input variables.
        %
\begin{codeexample}[preamble={\usetikzlibrary{datavisualization.formats.functions}}]
\tikz
  \datavisualization [school book axes, visualize as smooth line]
    data [format=function] {
      var x : interval [-1.5:1.5];

      func y = \value x * \value x;
    };
\end{codeexample}
        %
\begin{codeexample}[
    width=6cm,
    preamble={\usetikzlibrary{datavisualization.formats.functions}},
]
\tikz \datavisualization [
  school book axes,
  all axes={unit length=5mm, ticks={step=2}},
  visualize as smooth line]
data [format=function] {
  var t : interval [0:2*pi];

  func x = \value t * cos(\value t r);
  func y = \value t * sin(\value t r);
};
\end{codeexample}
        %
\begin{codeexample}[
    width=7cm,
    preamble={\usetikzlibrary{datavisualization.formats.functions}},
]
\tikz \datavisualization [
  scientific axes=clean,
  y axis={ticks={style={
        /pgf/number format/fixed,
        /pgf/number format/fixed zerofill,
        /pgf/number format/precision=2}}},
  x axis={ticks={tick suffix=${}^\circ$}},
  visualize as smooth line/.list={1,2,3,4,5,6},
  style sheet=vary hue]
data [format=function] {
  var set : {1,...,6};
  var x : interval [0:50];
  func y = sin(\value x * (\value{set}+10))/(\value{set}+5);
};
\end{codeexample}
    \end{dataformat}
\end{tikzlibrary}


\subsection{Advanced: The Data Parsing Process}
\label{section-dv-parsing}

Whenever data is fed to the data visualization system, it will be  handled by
the |\pgfdata| command, declared in the |datavisualization| module. The command
is both used to parse data stored in external sources (that is, in external
files or which is produced on the fly by calling an external command) as well
as data given inline. A data format does not need to know whether data comes
from a file or is given inline, the |\pgfdata| command will take care of this.

Since \TeX\ will always read files in a line-wise fashion, data is always fed
to data format parsers in such a fashion. Thus, even it would make more sense
for a format to ignore line-breaks, the parser must still handle data given
line-by-line.

Let us now have a look at how |\pgfdata| works.

\begin{command}{\pgfdata\opt{\oarg{options}\marg{inline data}}}
    This command is used to feed data to the visualization pipeline. This
    command can only be used when a data visualization object has been properly
    setup, see Section~\ref{section-dv-main-setup}.


    \medskip
    \textbf{Basic options.}
    The |\pgfdata| command may be followed by \meta{options}, which are
    executed with the path |/pgf/data/|. Depending on these options, the
    \meta{options} may either be followed by \meta{inline data} or,
    alternatively, no \meta{inline data} is present and the data is read from
    an external source.

    The first important option is \meta{source}, which governs which of these
    two alternatives applies:
    %
    \begin{key}{/pgf/data/read from file=\meta{filename} (initially \normalfont empty)}
        If you set the |read from file| attribute to a non-empty
        \meta{filename}, the data will be read from this file. In this case, no
        \meta{inline data} may be present, not even empty curly braces should
        be provided. If |read from file| is empty, the  data must directly
        follow as \meta{inline data}.
        %
\begin{codeexample}[code only]
% Data is read from two external files:
\pgfdata[format=table, read from file=file1.csv]
\pgfdata[format=table, read from file=file2.csv]
\end{codeexample}
        %
\begin{codeexample}[code only]
% Data is given inline:
\pgfdata[format=table]
{
  x, y
  1, 2
  2, 3
}
\end{codeexample}
    \end{key}
    %
    \begin{key}{/pgf/data/inline}
        This is a shorthand file |read from file={}|. You can add this to make
        it clear(er) to the reader that data follows inline.
    \end{key}
    %
    The second important key is |format|, which is used to specify the data
    format:
    %
    \begin{key}{/pgf/data/format=\meta{format} (initially table)}
        Use this key to locally set the format used for parsing the data. The
        \meta{format} must be a format that has been previously declared using
        the |\pgfdeclaredataformat| command. See the reference section for a
        list of the predefined formats.
    \end{key}
    %
    In case all your data is in a certain format, you may wish to generally set
    the above key somewhere at the beginning of your file. Alternatively, you
    can use the following style to setup the |format| key and possibly further
    keys concerning the data format:
    %
    \begin{stylekey}{/pgf/every data}
        This style is executed by |\pgfdata| before the \meta{options} are
        parsed.

        Note that the path of this key is just |/pgf/|, not |/pgf/data/|. Also
        note that \tikzname\ internally sets the value of this key up in such a
        way that the keys |/tikz/every data| and also
        |/tikz/data visualization/every data| are executed. The bottom line of
        this is that when using \tikzname, you should not set this key
        directly, set |/tikz/every data| instead.
    \end{stylekey}

    \medskip
    \textbf{Gathering of the data.}
    Once the data format and the source have been decided upon, the data is
    ``gathered''. During this phase the data is not actually parsed in detail,
    but just gathered so that it can later be parsed during the visualization.
    There are two different ways in which the data is gathered:
    %
    \begin{itemize}
        \item In case you have specified an external source, the data
            visualization object is told (by means of invoking the |add data|
            method) that it should (later) read data from  the file specified
            by the |source| key using the format specified by the |format| key.
            The file is not read at this point, but only later during the
            actual visualization.
        \item Otherwise, namely when data is given inline, depending on which
            format is used, some catcodes get changed. This is necessary since
            \TeX's special characters are often not-so-special in a certain
            format.

            Independently of the format, the end-of-line character (carriage
            return) is made an active character.

            Finally, the \meta{inline data} is then read as a normal argument
            and the data visualization object is told that later on it should
            parse this data using the given format parser. Note that in this
            case the data visualization object must store the whole data
            internally.
    \end{itemize}
    %
    In both cases the ``data visualization object'' is the object stored
    in the |/pgf/data visualization/obj| key.


    \medskip
    \textbf{Parsing of the data.}
    During the actual data visualization, all code that has been added to the
    data visualization object by means of the |add data| method is executed
    several times. It is the job of this code to call the |\pgfdatapoint|
    method for all data points present in the data.

    When the |\pgfdata| method calls |add data|, the code that is passed to the
    data visualization object is just a call to internal macros of |\pgfdata|,
    which are able to parse the data stored in an external file or in the
    inlined data. Independently of where the data is stored, these macros
    always do the following:
    %
    \begin{enumerate}
        \item The catcodes are setup according to what the data format
            requires.
        \item Format-specific startup code gets called, which can initialize
            internal variables of the parsing process. (The catcode changes are
            not part of the startup code since in order to read inline data
            |\pgfdata| must be able to setup to temporarily setup the catcodes
            needed later on by the parsers, but since no reading is to be done,
            no startup code should be called at this point.)
        \item For each line of the data a format-specific code handler, which
            depends on the data format, is called. This handler gets the
            current line as input and should call |\pgfdatapoint| once for each
            data point that is encoded by this line (a line might define
            multiple data points or none at all). Empty lines are handled by
            special format-specific code.
        \item At the end, format-specific end code is executed.
    \end{enumerate}
    %
    For an example of how this works, see the description of the
    |\pgfdeclaredataformat| command.


    \medskip
    \textbf{Data sets.}
    There are three options that allow you to create \emph{data sets}. Such a
    data set is essentially a macro that stores a pre-parsed set of data that
    can be used multiple times in subsequent visualizations (or even in the
    same visualization).
    %
    \begin{key}{/pgf/data/new set=\meta{name}}
        Creates an empty data set called \meta{name}. If a data set of the same
        name already exists, it is overwritten and made empty. Data sets are
        global.
    \end{key}
    %
    \begin{key}{/pgf/data/store in set=\meta{name}}
        When this key is set to any non-empty \meta{name} and if this
        \meta{name} has previously been used with the |new set| key, then the
        following happens: For the current |\pgfdata| command, all parsed data
        is not passed to the rendering pipeline. Instead, the parsed data is
        appended to the data set \meta{name}. This includes all options parsed
        to the |\pgfdata| command, which is why neither this key nor the
        previous key should be passed as options to a |\pgfdata| command.
    \end{key}
    %
    \begin{key}{/pgf/data/use set=\meta{name}}
        This works similar to |read from file|. When this key is used with a
        |\pgfdata| command, no inline data may follow. Instead, the data stored
        in the data set \meta{name} is used.
    \end{key}
\end{command}


\subsection{Advanced: Defining New Formats}
\label{section-dv-declaring-formats}

In order to define a new data format you can use the following command, which
is basic layer command defined in the module |datavisualization|:

\begin{command}{\pgfdeclaredataformat\marg{format name}\marg{catcode
    code}\marg{startup code}\marg{line arguments}\\\marg{line
    code}\marg{empty line code}\marg{end code}%
}
    This command defines a new data format called \meta{format name}, which can
    subsequently be used in the |\pgfdata| command. (The \tikzname's |data|
    maps directly to |\pgfdata|, so the following applies to \tikzname\ as
    well.)

    As explained in the description of the |\pgfdata| command, when data is
    being parsed that is formatted according to \meta{format name}, the
    following happens:
    %
    \begin{enumerate}
        \item The \meta{catcode code} is executed. This code should just
            contain catcode changes. The \meta{catcode code} will also be
            executed when inline data is read.
        \item Next, the \meta{startup code} is executed.
        \item Next, for each non-empty line of the data, the line is passed to
            a macro whose argument list is given by \meta{line arguments} and
            whose body is given by \meta{line code}. The idea is that you can
            use \TeX's powerful pattern matching capabilities to parse the
            non-empty lines. See also the below example.
        \item Empty lines are not processed by the \meta{line code}, but rather
            by the \meta{empty line code}. Typically, empty lines can simply be
            ignored and in this case you can let this parameter be empty.
        \item At the end of the data, the \meta{end code} is executed.
    \end{enumerate}

    As an example, let us now define a simple data format for reading files
    formatted in the following manner: Each line should contain a coordinate
    pair as in |(1.2,3.2)|, so two numbers separated by a comma and surrounded
    by parentheses. To make things more interesting, suppose that the hash mark
    symbol can be used to indicate comments. Here is an example of some data
    given in this format:
    %
\begin{codeexample}[code only]
# This is some data formatted according to the "coordinates" format
(0,0)
(0.5,0.25)
(1,1)
(1.5,2.25)
(2,4)
\end{codeexample}

    A format parser for this format could be defined as follows:
    %
\begin{codeexample}[code only]
\pgfdeclaredataformat{coordinates}
% First comes the catcode argument. We turn the hash mark into a comment character.
{\catcode`\#=14\relax}
% Second comes the startup code. Since we do not need to setup things, we can leave
% it empty. Note that we could also set it to something like \begingroup, provided we
% put an \endgroup in the end code
{}
% Now comes the arguments for non-empty lines. Well, these should be of the form
% (#1,#2), so we specify that:
{(#1,#2)}
% Now we must do something with a line of this form. We store the #1 argument in
% /data point/x and #2 in /data point/y. Then we call \pgfdatapoint to create a data point.
{
  \pgfkeyssetvalue{/data point/x}{#1}
  \pgfkeyssetvalue{/data point/y}{#2}
  \pgfdatapoint
}
% We ignore empty lines:
{}
% And we also have no end-of-line code.
{}
\end{codeexample}
    %
    This format could now be used as follows:
    %
\begin{codeexample}[code only]
\begin{tikzpicture}
  \datavisualization[school book axes, visualize as smooth line]
  data [format=coordinates] {
    # This is some data formatted according
    # to the "coordinates" format
    (0,0)
    (0.5,0.25)
    (1,1)
    (1.5,2.25)
    (2,4)
  };
\end{tikzpicture}
\end{codeexample}
    %
\end{command}