---
title: "Enabling integration of Python libraries and R packages for combined mass spectrometry data analysis"
package: SpectriPy
format:
  html:
    minimal: true
    theme: flatly
vignette: >
  %\VignetteIndexEntry{Enabling integration of Python libraries and R packages for combined mass spectrometry data analysis}
  %\VignetteKeywords{Mass Spectrometry, MS, MSMS, Metabolomics, Infrastructure, Quantitative}
  %\VignettePackage{SpectriPy}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{quarto::html}
  %\VignetteDepends{Spectra,BiocStyle,SpectriPy,reticulate,MsBackendMgf,MsDataHub,mzR}
---

**Compiled**: `r date()`

**Note**: since version 0.99.6 *SpectriPy* uses the newer, recommended approach
to install and configure required Python libraries (i.e., through *reticulate*'s
`py_require()` functionality). This affects also the way a pre-defined Python
environment can be used. See section @sec-python for updated information.

# Introduction

Powerful software libraries for mass spectrometry (MS) data are available in
both Python and R, covering specific and sometimes distinct aspects in the
analysis of proteomics and metabolomics data. R's *reticulate* package converts
basic data types between Python and R and enables a seamless interoperability of
both programming languages. The *SpectriPy* package extends *reticulate*
allowing to translate MS data data structures between R and Python,
specifically, between R's `Spectra` objects and Python MS data structures from
the [*matchms*](https://github.com/matchms/matchms) and
[*spectrum_utils*](https://github.com/bittremieux-lab/spectrum_utils) Python
libraries. In addition, functionality from Python's *matchms* library is
directly wrapped into R functions allowing a seamless integration into R based
workflows. *SpectriPy* thus enables powerful proteomics or metabolomics analysis
workflows combining the strengths of Python and R MS libraries.

This vignette provides information on how to share and translate MS data
structures between R and Python to enable combined Python and R-based analysis
workflows. For an example use case analysis, see the *SpectriPy tutorial:
Annotation of LC-MS/MS spectra* vignette.


# System requirements

The [*reticulate*](https://rstudio.github.io/reticulate/) package enables a
seamless integration of Python with R by translating the core data structures
between the two programming languages and sharing a Python (respectively R)
runtime environment, including shared variables, that can be accessed from the
other programming language. If the *reticulate* package is not already
available, it will be installed during the installation of *SpectriPy*.
Alternatively, it can be installed using `install.packages("reticulate")`.

The *SpectriPy* package builds on *reticulate* providing in addition the
functionality to translate data structures for MS data between the two
languages. Specifically, the package translates between R's `Spectra` objects
(from the `r BiocStyle::Biocpkg("Spectra")` package) and Python's
`matchms.Spectrum` and `spectrum_utils.spectrum.MsmsSpectrum` objects from the
[*matchms*](https://github.com/matchms/matchms) and
[*spectrum_utils*](https://github.com/bittremieux-lab/spectrum_utils) Python
libraries, respectively.


# Installation

If the *BiocManager* package is not already available, please install it with
`install.packages("BiocManager")`. As a system dependency, the package requires
Python (version >= 2.12) to be available. During package installation,
*SpectriPy* will by default install all required Python libraries
automatically. See section @sec-python for information on manual library
installation or usage of a pre-defined or system Python environment.

To install the package use the code below:

```{r}
#| eval: false
install.packages("BiocManager")
BiocManager::install("SpectriPy")
```


# Translating data structures between R and Python

## Library loading and system setup

Below we load all required packages. By loading *SpectriPy*, the package will
evaluate if the required Python libraries (i.e., *matchms* version 0.30.0 and
*spectrum_utils* version 0.3.2) are available. If they are not available,
*SpectriPy* will install them using functionality from the *reticulate* R
package, in particular, *reticulate*'s `py_require()` function (see also
[here](https://rstudio.github.io/reticulate/articles/versions.html) for
details). See section @sec-python for more configuration options of *SpectriPy*.

The *reticulate* package will be loaded by *SpectriPy*, ensuring the R/Python
integration provided by that package to be available as well. To better
discriminate between R and Python code chunks, we add the comment `#' R
session:` or `#' Python session:` to label the R and Python code chunks,
respectively.

```{r}
#| label: libraries
#| message: false
#' R session:

library(Spectra)
library(SpectriPy)
```


## Converting MS data from R to Python

In this section we show how MS data can be converted from R to Python. Below we
first define the name and path of a data file in mzML format (provided by the `r
BiocStyle::Biocpkg("MsDataHub")` package) and load that data as an R `Spectra`
object. We thus first ensure that the package *MsDataHub* is installed. If the
line below results in an error you need to first install *MsDataHub* using
`BiocManager::install("MsDataHub")`.

```{r}
stopifnot(require("MsDataHub"))
```

Next, we define the file path to the test file and load the data as a `Spectra`
object.

```{r}
#| label: load-data
#| message: false
#' R session:

#' Loading the data from a mzML file as a `Spectra` object
library(MsDataHub)
fl <- MsDataHub::PestMix1_DDA.mzML()
mzml_r <- Spectra(fl)
```

We next restrict the data to MS level 2 and remove spectra with less than 3
(fragment) peaks.

```{r}
#| label: filter MS level
#' R session:

#' Restrict to MS level 2 spectra
mzml_r <- filterMsLevel(mzml_r, 2)
mzml_r <- mzml_r[lengths(mzml_r) >= 3]
mzml_r
```

This `Spectra` object can now be converted to equivalent data structures in
Python using the `rspec_to_pyspec()` function:

```{r}
#| label: convert spectra to matchms
#' R session:

#' Convert the R Spectra to a list of Python matchms.Spectrum objects
tmp <- rspec_to_pyspec(mzml_r)
```

The `tmp` variable is now a Python list of `matchms.Spectrum` objects:

```{r}
#' R session:

#' Class of the converted variable
class(tmp)

#' First element
class(tmp[1])
```

Note that this Python data structure is now stored within the R session.
Therefore, we have two full copies of the data in memory, i.e. an R
`Spectra` object and a Python `matchms.Spectrum` object. We can access the
variable from the associated Python environment through *reticulate*'s special
`r` attribute using `r.<variable name in R>`.

```{python}
#' Python session:

#' Access the Python data structure stored in the R session
r.tmp
```

While it is thus possible to access the variable, it is suggested, and has also
performance advantages, if the Python data is stored directly in the
Python environment. Similar to the special `r` attribute, *reticulate* provides
the `py` variable in R that allows access to the main Python environment
associated with the R session. Attributes can be assigned to the Python
environment with the `py_set_attr()` function. We repeat our data conversion
operation but assign the result to a Python attribute `"mzml_py"`:

```{r}
#| label: assign variable to Python
#' R session:

#' Assign the data to a variable in the Python environment
py_set_attr(py, "mzml_py", rspec_to_pyspec(mzml_r))
names(py)
```

We can now access this data directly from within Python:

```{python}
#| label: data types of variables
#' Python session:

#' Data type of the variable:
type(mzml_py)

#' The length of the list:
len(mzml_py)

#' Data type of the first element:
type(mzml_py[0])

#' Intensities of the first spectrum:
mzml_py[0].peaks.intensities
```

This allows us to analyze and process the MS data directly in Python. As an
example we load the *matchms.filtering* library and scale the intensity
values of each spectrum with the `normalize_intensities()` function such that
their total sum is 1.

```{python}
#| label: normalize intensities with matchms
#' Python session:

import matchms.filtering as mms_filt

#' Iterate over the Spectrum list and scale the intensities
for i in range(len(mzml_py)):
    mzml_py[i] = mms_filt.normalize_intensities(mzml_py[i])

#' Intensities for the first spectrum
mzml_py[0].peaks.intensities
```

We can also access the changed data from R. The `py_get_attr()` function can be
used to retrieve the variable with the changed Python data. Through the
*reticulate* package it is then also possible to call attributes and Python
functions directly from R. To extract the intensities of the first spectrum we
can use the same code as above, just replacing `.` with `$`. Note also that,
since the variable returned by `py_get_attr()` is a Python object, we need to
use index `0` to access the first element.

```{r}
#| label: inspect changed intensities in R
#' R session:

#' Access the intensities of the first spectrum
py_get_attr(py, "mzml_py")[0]$peaks$intensities
```

Note that with the `rspec_to_pyspec()` function, we created a copy of the
original data in Python. We have thus now two variables, the `mzml_r` variable
in R with the original, unchanged, intensity values, and the `mzml_py` attribute
in Python with the scaled intensity values. To get the processing results back
to R we need to convert the data from Python to R. This can be done with the
`pyspec_to_rspec()` function which translates Python MS data structures into an
R `Spectra` object (see the following @sec-py-to-r section for more
information). A more elegant way is to use the `MsBackendPy` backend for
R' `Spectra` objects (see section @sec-backend).

With *reticulate* we can use the `py` special variable in R to access attributes
defined in the associated Python environment. Similarly, *reticulate* defines an
attribute `r` in the Python session that allows to access variables in R. When
variables are accessed this way, they are automatically converted to the
corresponding data types in the other programming language if an `r_to_py()`
method (or `py_to_r()` method) is implemented for them. Such methods are
defined for the basic data types, so, when accessing for example the `fl`
R variable from Python, it is converted from the R `character` data type to the
equivalent Python `str` data type:

```{python}
#' Python session:

#' Access the `fl` variable from the R session:
r.fl

type(r.fl)
```

*SpectriPy* implements an `r_to_py()` method for `Spectra` objects, so, when
`Spectra` objects are accessed from Python, they are also automatically
translated to a Python list of `matchms.Spectrum` objects:

```{python}
#' Python session:

#' Access the `Spectra` object with the original data from R; the data
#' gets directly translated on-the-fly
r.mzml_r

#' Access the intensities of the first spectrum
r.mzml_r[0].peaks.intensities
```

While this automatic conversion is convenient in some cases, the manual
translation with `rspec_to_pyspec()` and `pyspec_to_rspec()` is preferred, as it
allows to configure the handling of the spectra variables (i.e. metadata) and
avoids eventual repeated translation of the data.


## Converting MS data from Python to R {#sec-py-to-r}

To show conversion of MS data from Python to R, we import a test data file in
MGF format using the Python *matchms* library. This MGF file is provided within
the *SpectriPy* package so we first define its file name and path in R.

```{r}
#| label: define MGF file
#' R session:

f_mgf <- system.file("extdata", "mgf", "test.mgf", package = "SpectriPy")
```

We next load the required Python library and import the data in Python. We can
access the variable (defined in the R session) with the file name to import the
data from through the `r.<variable name>`.

```{python}
#| label: load MGF in Python
#| warning: false
#' Python session

import matchms
from matchms.importing import load_from_mgf

mgf_py = list(load_from_mgf(r.f_mgf))
mgf_py
```

The MS data from the MGF is now loaded in Python as a list of
`matchms.Spectrum` objects. We can also directly access this variable from the R
session through `py$<variable name in Python>`. Below we access the first
spectrum in that list:

```{r}
#' R session:

#' Access the first spectrum
py$mgf_py[[1]]
```

The data is thus provided in a *matchms.Spectrum* object. We extract the *m/z*
and intensity peak matrix from that spectrum using the built-in functionality
from the *reticulate* package that allows to call Python functions from R or
translate between basic data types. As an example we below get the `peaks`
attribute from the first spectrum and convert that to a Python `numpy`
array. This array can then be translated to an R `matrix` with the `py_to_r()`
function (see also the [*Calling Python from
R*](https://rstudio.github.io/reticulate/articles/calling_python.html) vignette
from the *reticulate* package for examples on accessing data from Python):

```{r}
#' R session:

#' Extract the peaks matrix from the first spectrum
py_to_r(py$mgf_py[[1]]$peaks$to_numpy)
```

While such functionality can thus be used to extract the MS data from Python,
it is for MS data analysis in R more convenient to transform the full MS data to
an R `Spectra` object using *SpectriPy*'s `pyspec_to_rspec()` function. Note
that below we use the `py_get_attr()` function to get the Python variable with
the MS data instead of `py$mgf_py`. Accessing the Python attribute through
`py$mgf_py` immediately translates the Python list into an R `list` through the
default `py_to_r()` function and `pyspec_to_rspec()` thus iterates over the R
`list` (in R). With `py_get_attr()` the attribute is accessed in its
native Python data type and `pyspec_to_rspec()` hence iterates over the data in
Python, which has a minimal performance advantage.

```{r}
#' R session:

#' Convert and copy the data to R
mgf_r <- pyspec_to_rspec(py_get_attr(py, "mgf_py"))
mgf_r

#' Extract intensities
mgf_r$intensity
```

The full MS data is thus now available as a `Spectra` object. Note however that
with `pyspec_to_rspec()` the complete MS data gets **copied**. We have now
two (detached) variables containing the same MS data, one in R and one in
Python.


### Using a dedicated MS data *backend* for MS data in Python {#sec-backend}

An alternative, and more elegant, approach to keeping multiple copies of
spectral objects (i.e. `Spectra`, `matchms.Spectrum`) in memory, is to use a
dedicated data *backend* for the `Spectra` object.
The R's `Spectra` object separates by design the functionality to analyze MS
data from the code to *represent* or retrieve the MS data, which is provided by
dedicated data *backends*. These allow for example to import MS data from
different file formats, or to store and access the data with different storage
modes, respectively (see also [this
tutorial](https://jorainer.github.io/SpectraTutorials/articles/Spectra-backends.html)
for more information). The *SpectriPy* packages defines such a backend, the
`MsBackendPy`, that allows to directly access MS data stored in the respective
data formats in Python.

As an example we create below a `Spectra` object for the MS data (in Python)
previously imported from the MGF file (i.e. `"mgf_py"`) and using the
`MsBackendPy` as the data source (i.e. backend).

```{r}
#| label: initialize MsBackendPy for Python variable
#' R session:

#' Create a Spectra object with a MsBackendPy backend for the
#' attribute "mgf_py"
mgf <- Spectra("mgf_py", source = MsBackendPy())
mgf
```

In contrast to using the `pyspec_to_rspec()` function, no data was copied or
converted by this call. The `Spectra` object (through its backend) does only
keep a *reference* to the original data attribute in Python, but no data. Data
is retrieved and translated on-the-fly each time it is requested from the
`Spectra` object. Thus, calling e.g. `msLevel()` or `intensity()` causes the
backend to iterate over the MS data in Python, extract the respective
information, translate it and return it to the user.

```{r}
#| label: access MS level and intensity through MsBackendPy
#' R session:

#' Extract MS level
msLevel(mgf)

#' Extract intensity values
intensity(mgf)
```

While the performance is a little lower, compared to translating all data to R
using the `pyspec_to_rspec()` function, this approach is much more memory
efficient, because no additional copies of MS data are generated and only the MS
data currently required for a certain analysis task (i.e. a subset of the data)
is loaded and translated at a time.

Also, because data is always retrieved on-the-fly from Python, any changes to
the MS data attribute in Python are also immediately reflected in the respective
`Spectra` object. To illustrate this we below scale the intensities of the mass
peaks in Python:

```{python}
#| label: normalize intensities of MGF in Python
#' Python session:

#' Scale intensities
for i in range(len(mgf_py)):
    mgf_py[i] = mms_filt.normalize_intensities(mgf_py[i])

```

The intensities retrieved from the `Spectra` object are now also scaled.

```{r}
#| label: get intensities after modifying in Python
#' R session:

#' Get intensities after scaling the intensities in Python:
intensity(mgf)
```


## Conversion of spectra variables

Conversion of the MS peaks data (i.e. the *m/z* and intensity values) is always
performed by the `rspec_to_pyspec()` and `pyspec_to_rspec()` functions. But next
to the peaks data, also additional information are available for individual
spectra. In R/*Spectra* these variables are called *spectra variables* while in
*matchms* they are stored as a *metadata* attribute to a `matchms.Spectrum`
object. The *SpectriPy* package defines a core set of spectra variables
that are by default converted by the `rspec_to_pyspec()` and `pyspec_to_rspec()`
function. These default variables can be accessed using the
`defaultSpectraVariableMapping()` function:

```{r}
#' R session:

#' List the *default* spectra variable mapping in R and python, respectively
defaultSpectraVariableMapping()
```

These variables, if present in the respective data object, are transferred (and
renamed) to the MS data structure of the other programming language. The names
of this character vector represent the name of the spectra variable in the
`Spectra` object, the elements (values) the names of the respective metadata
keys in Python's `matchms.Spectrum` class. This default mapping thus transfers
for example the `precursorMz()` spectra variable of R to the `"precursor_mz"`
spectrum metadata in Python (and *vice versa*). Note that any spectra variable
or metadata **not** being part of such a `mapping` will be ignored and hence not
converted.

Below we inspect the available metadata in the `matchms.Spectrum` objects that
were imported from the MGF file.

```{python}
#' Python session:

#' Available metadata for the first spectrum
mgf_py[0].metadata.keys()
```

Several additional metadata variables, such as `"smiles"`, `"inchi"` or
`"compound_name"`, not part of the default variables, are available. To
transfer also these, we create below a custom mapping adding in addition also a
mapping for these 3 variables:

```{r}
#' R session:

#' Add mapping for additional spectra variables to the default mapping in R and
#' python, respectively
map <- c(defaultSpectraVariableMapping(), smiles = "smiles",
         inchi = "inchi", name = "compound_name")
map
```

Such custom mapping can be passed with the parameter `mapping` to the
`pyspec_to_rspec()` (and also the `rspec_to_pyspec()`) function which will then
convert the full data to R.

```{r}
#| label: convert R to Python with spectra variables
#' R session:

#' Convert the Python MS data structures to an R `Spectra`
mgf_r <- pyspec_to_rspec(py_get_attr(py, "mgf_py"), mapping = map)
spectraVariables(mgf_r)
```

The respective metadata values have thus been added as new spectra variables to
our `Spectra` object and can also be extracted:

```{r}
#| label: access spectra variables from Python in R
#' R session:

#' Show the first values for the spectra variable "name"
mgf_r$name |>
    head()

#' Show the first values for the spectra variable "inchi"
mgf_r$inchi |>
    head()
```

When using a `MsBackendPy`, all metadata attributes from the `matchms.Spectrum`
objects can be accessed and extracted with the `spectraData()` function. The
`spectraVariables()` function lists all available columns:

```{r}
#' R session:

#' List available spectra variables
spectraVariables(mgf)
```

We thus have already access to e.g. the `"inchi"` variable:

```{r}
#' R session:

#' Get the first entries from the inchi variable
mgf$inchi |>
    head()
```

A `spectraVariableMapping()` can however be used to rename variables. Below we
add for example a mapping of the `matchms.Spectrum` metadata attribute
`"pepmassint"` to the (core) variable `"precursorIntensity"` using the
`MsBackendPy`:

```{r}
#| label: add additional mapping
#' R session:

#' Add mapping for additional spectra variables to the `MsBackendPy`
m <- defaultSpectraVariableMapping()
m["precursorIntensity"] <- "pepmassint"
spectraVariableMapping(mgf) <- m

#' List available spectra variables
spectraVariables(mgf)

#' Get spectra data for two variables
spectraData(mgf, c("precursorMz", "precursorIntensity"))
```

The `r_to_py()` methods do not support additional parameters, thus, in order to
use a similar mapping also with the `r_to_py()` method for `Spectra` the
*global* spectra variable mapping need to be changed. See the help of the
`setSpectraVariableMapping()` function for more details.


## Combined MS data analysis

With the functionality to translate between R and Python MS data structures, the
*SpectriPy* package enables thus a MS data analysis combining functionality
provided by both R and Python libraries. Some of the functionality of the
*matchms* Python library are directly wrapped into R functions simplifying
their use and inclusion in R-based workflows (see the help of
`compareSpectraPy()` and `filterSpectriPy()` functions). Combined analyses with
code chunks in both programming languages has however the advantage to use the
respective functionality provided by the original package/library.

As a simple use case we calculate below spectra similarities using two different
similarity scores between spectra from the mzML and the MGF files.

In R we can use the `compareSpectra()` function that by default calculates the
normalized dot product similarity between the compared spectra. The MS data from
the mzML file was processed in Python. To use this data we first create a
`Spectra` object with a `MsBackendPy` backend. For the MS data from the MGF file
we re-use the `mgf` variable, which is also a `Spectra` with a `MsBackendPy`
backend referencing the MS data imported and processed in Python.

```{r}
#| label: compareSpectra with mzML and MGF
#' R session:

#' Create a `Spectra` for the scaled MS data in Python
mzml <- Spectra("mzml_py", source = MsBackendPy())

#' Calculate the pairwise similarity between all spectra
sim <- compareSpectra(mzml, mgf, tolerance = 0.1)
dim(sim)
```

We can calculate the *Cosine Hungarian* similarity score in Python using the
functionality from the *matchms* library. Here we use the *original* attributes
available in Python, i.e. `mzml_py` and `mgf_py`.

```{python}
#| label: Cosine Hungarian similarity in Python
#' Python session:

import matchms.similarity as mms_similarity

#' Calculate similarity scores
scores = matchms.calculate_scores(
    mzml_py, mgf_py, mms_similarity.CosineHungarian(tolerance = 0.1))

#' Extract similarity scores
sim = scores.to_array()["CosineHungarian_score"]
```

Alternatively, we can also directly use any `Spectra` object from R as the
data is converted automatically when accessed from Python. We can for
example use `r.mzml` instead of `mzml_py` in the call above, which converts
the `Spectra` object `mzml` in R to Python before calculating the similarities.

We can also directly compare the scores calculated using the two different
algorithms.

```{r}
#| label: compare dot product vs Cosine Hungarian
#' R session:

#' Plot the similarity scores against each other
plot(sim, py$sim, pch = 21, col = "#000000ce", bg = "#00000060",
     xlab = "Dot product", ylab = "Cosine Hungarian")
grid()
```


## Summary

By translating between R and Python MS data structures, the *SpectriPy* package
enables data analyses that combine functionalities from both programming
language. In terms of (memory) efficiency, usage of *SpectriPy*'s `MsBackendPy`
backend for `Spectra` objects has clear advantages over the repeated translation
and copying of the MS data. See also section @sec-comments for general comments.


# Appendix

## General comments {#sec-comments}

- Be careful accessing Python attributes using `py$<attribute name>`: base
  Python data types will be automatically converted to the equivalent R data
  type. For MS data, it might be better to get the attributes using the
  `py_get_attr()` functions.

- Since MS data can be large, it is suggested the user converts MS data mostly
  manually, and only if/when needed, using `rspec_to_pyspec()` and
  `pyspec_to_rspec()`.

- The `rspec_to_pyspec()` and `pyspec_to_rspec()` functions **copy** the data
  while transferring. Thus, there will be eventually two (detached) copies of
  the same data in Python and R.

- The `MsBackendPy` backend allows to directly interface MS data from
  Python. Data will be converted on-the-fly, so no additional copies of the data
  exist.

- Be aware that, since the `MsBackendPy` backend does not contain any MS data
  but simply interfaces the MS data in Python, any changes to this data in
  Python affect also the `Spectra` object using that backend.


## Startup and Python configuration {#sec-python}

The way Python dependencies are defined and managed has changed in *SpectriPy*
beginning with version 0.99.6. *SpectriPy* now declares Python dependencies
using the `py_require()` function from the *reticulate* package. These Python
package dependencies requested via `py_require()` will automatically be
provisioned and made available for the user when the *SpectriPy* package is
loaded, *via* an ephemeral Python virtual environment. Eventually missing
libraries are downloaded and installed automatically.

The previous *virtualenv* or *conda*-based setup is now replaced by the
preferred `py_require()` approach, hence the previous package's options and
environment variables `"spectripy.use_system"`, `SPECTRIPY_USE_SYSTEM`,
`"spectripy.use_conda"`, `SPECTRIPY_USE_CONDA`, `"spectripy.env"`,
`SPECTRIPY_ENV` are now ignored.

- NOTE: still define SPECTRIPY_ENV/"spectripy.env"? maybe to be used in `import`
  parameter `delay_load`?

A pre-defined Python environment can be used by pointing the `RETICULATE_PYTHON`
environment variable to the respective Python binary or the
`RETICULATE_PYTHON_ENV` to the path of the environment. All Python libraries
required by *SpectriPy* need however to be installed and available in that
Python environment, as *SpectriPy* respectively *reticulate*'s `py_require()`
functionality will be bypassed. These Python libraries are:

- *matchms* >= 0.30.0
- *spectrum_utils* >= 0.3.2
- *numpy* >= 2.2.0

See also the help of the *reticulate* package for more information on
configuring Python with R.


# Session information

```{r}
#' R session:

sessionInfo()
```