--- title: "Enabling integration of Python libraries and R packages for combined mass spectrometry data analysis" package: SpectriPy format: html: minimal: true theme: flatly vignette: > %\VignetteIndexEntry{Enabling integration of Python libraries and R packages for combined mass spectrometry data analysis} %\VignetteKeywords{Mass Spectrometry, MS, MSMS, Metabolomics, Infrastructure, Quantitative} %\VignettePackage{SpectriPy} %\VignetteEncoding{UTF-8} %\VignetteEngine{quarto::html} %\VignetteDepends{Spectra,BiocStyle,SpectriPy,reticulate,MsBackendMgf,MsDataHub,mzR} --- **Compiled**: `r date()` **Note**: since version 0.99.6 *SpectriPy* uses the newer, recommended approach to install and configure required Python libraries (i.e., through *reticulate*'s `py_require()` functionality). This affects also the way a pre-defined Python environment can be used. See section @sec-python for updated information. # Introduction Powerful software libraries for mass spectrometry (MS) data are available in both Python and R, covering specific and sometimes distinct aspects in the analysis of proteomics and metabolomics data. R's *reticulate* package converts basic data types between Python and R and enables a seamless interoperability of both programming languages. The *SpectriPy* package extends *reticulate* allowing to translate MS data data structures between R and Python, specifically, between R's `Spectra` objects and Python MS data structures from the [*matchms*](https://github.com/matchms/matchms) and [*spectrum_utils*](https://github.com/bittremieux-lab/spectrum_utils) Python libraries. In addition, functionality from Python's *matchms* library is directly wrapped into R functions allowing a seamless integration into R based workflows. *SpectriPy* thus enables powerful proteomics or metabolomics analysis workflows combining the strengths of Python and R MS libraries. This vignette provides information on how to share and translate MS data structures between R and Python to enable combined Python and R-based analysis workflows. For an example use case analysis, see the *SpectriPy tutorial: Annotation of LC-MS/MS spectra* vignette. # System requirements The [*reticulate*](https://rstudio.github.io/reticulate/) package enables a seamless integration of Python with R by translating the core data structures between the two programming languages and sharing a Python (respectively R) runtime environment, including shared variables, that can be accessed from the other programming language. If the *reticulate* package is not already available, it will be installed during the installation of *SpectriPy*. Alternatively, it can be installed using `install.packages("reticulate")`. The *SpectriPy* package builds on *reticulate* providing in addition the functionality to translate data structures for MS data between the two languages. Specifically, the package translates between R's `Spectra` objects (from the `r BiocStyle::Biocpkg("Spectra")` package) and Python's `matchms.Spectrum` and `spectrum_utils.spectrum.MsmsSpectrum` objects from the [*matchms*](https://github.com/matchms/matchms) and [*spectrum_utils*](https://github.com/bittremieux-lab/spectrum_utils) Python libraries, respectively. # Installation If the *BiocManager* package is not already available, please install it with `install.packages("BiocManager")`. As a system dependency, the package requires Python (version >= 2.12) to be available. During package installation, *SpectriPy* will by default install all required Python libraries automatically. See section @sec-python for information on manual library installation or usage of a pre-defined or system Python environment. To install the package use the code below: ```{r} #| eval: false install.packages("BiocManager") BiocManager::install("SpectriPy") ``` # Translating data structures between R and Python ## Library loading and system setup Below we load all required packages. By loading *SpectriPy*, the package will evaluate if the required Python libraries (i.e., *matchms* version 0.30.0 and *spectrum_utils* version 0.3.2) are available. If they are not available, *SpectriPy* will install them using functionality from the *reticulate* R package, in particular, *reticulate*'s `py_require()` function (see also [here](https://rstudio.github.io/reticulate/articles/versions.html) for details). See section @sec-python for more configuration options of *SpectriPy*. The *reticulate* package will be loaded by *SpectriPy*, ensuring the R/Python integration provided by that package to be available as well. To better discriminate between R and Python code chunks, we add the comment `#' R session:` or `#' Python session:` to label the R and Python code chunks, respectively. ```{r} #| label: libraries #| message: false #' R session: library(Spectra) library(SpectriPy) ``` ## Converting MS data from R to Python In this section we show how MS data can be converted from R to Python. Below we first define the name and path of a data file in mzML format (provided by the `r BiocStyle::Biocpkg("MsDataHub")` package) and load that data as an R `Spectra` object. We thus first ensure that the package *MsDataHub* is installed. If the line below results in an error you need to first install *MsDataHub* using `BiocManager::install("MsDataHub")`. ```{r} stopifnot(require("MsDataHub")) ``` Next, we define the file path to the test file and load the data as a `Spectra` object. ```{r} #| label: load-data #| message: false #' R session: #' Loading the data from a mzML file as a `Spectra` object library(MsDataHub) fl <- MsDataHub::PestMix1_DDA.mzML() mzml_r <- Spectra(fl) ``` We next restrict the data to MS level 2 and remove spectra with less than 3 (fragment) peaks. ```{r} #| label: filter MS level #' R session: #' Restrict to MS level 2 spectra mzml_r <- filterMsLevel(mzml_r, 2) mzml_r <- mzml_r[lengths(mzml_r) >= 3] mzml_r ``` This `Spectra` object can now be converted to equivalent data structures in Python using the `rspec_to_pyspec()` function: ```{r} #| label: convert spectra to matchms #' R session: #' Convert the R Spectra to a list of Python matchms.Spectrum objects tmp <- rspec_to_pyspec(mzml_r) ``` The `tmp` variable is now a Python list of `matchms.Spectrum` objects: ```{r} #' R session: #' Class of the converted variable class(tmp) #' First element class(tmp[1]) ``` Note that this Python data structure is now stored within the R session. Therefore, we have two full copies of the data in memory, i.e. an R `Spectra` object and a Python `matchms.Spectrum` object. We can access the variable from the associated Python environment through *reticulate*'s special `r` attribute using `r.`. ```{python} #' Python session: #' Access the Python data structure stored in the R session r.tmp ``` While it is thus possible to access the variable, it is suggested, and has also performance advantages, if the Python data is stored directly in the Python environment. Similar to the special `r` attribute, *reticulate* provides the `py` variable in R that allows access to the main Python environment associated with the R session. Attributes can be assigned to the Python environment with the `py_set_attr()` function. We repeat our data conversion operation but assign the result to a Python attribute `"mzml_py"`: ```{r} #| label: assign variable to Python #' R session: #' Assign the data to a variable in the Python environment py_set_attr(py, "mzml_py", rspec_to_pyspec(mzml_r)) names(py) ``` We can now access this data directly from within Python: ```{python} #| label: data types of variables #' Python session: #' Data type of the variable: type(mzml_py) #' The length of the list: len(mzml_py) #' Data type of the first element: type(mzml_py[0]) #' Intensities of the first spectrum: mzml_py[0].peaks.intensities ``` This allows us to analyze and process the MS data directly in Python. As an example we load the *matchms.filtering* library and scale the intensity values of each spectrum with the `normalize_intensities()` function such that their total sum is 1. ```{python} #| label: normalize intensities with matchms #' Python session: import matchms.filtering as mms_filt #' Iterate over the Spectrum list and scale the intensities for i in range(len(mzml_py)): mzml_py[i] = mms_filt.normalize_intensities(mzml_py[i]) #' Intensities for the first spectrum mzml_py[0].peaks.intensities ``` We can also access the changed data from R. The `py_get_attr()` function can be used to retrieve the variable with the changed Python data. Through the *reticulate* package it is then also possible to call attributes and Python functions directly from R. To extract the intensities of the first spectrum we can use the same code as above, just replacing `.` with `$`. Note also that, since the variable returned by `py_get_attr()` is a Python object, we need to use index `0` to access the first element. ```{r} #| label: inspect changed intensities in R #' R session: #' Access the intensities of the first spectrum py_get_attr(py, "mzml_py")[0]$peaks$intensities ``` Note that with the `rspec_to_pyspec()` function, we created a copy of the original data in Python. We have thus now two variables, the `mzml_r` variable in R with the original, unchanged, intensity values, and the `mzml_py` attribute in Python with the scaled intensity values. To get the processing results back to R we need to convert the data from Python to R. This can be done with the `pyspec_to_rspec()` function which translates Python MS data structures into an R `Spectra` object (see the following @sec-py-to-r section for more information). A more elegant way is to use the `MsBackendPy` backend for R' `Spectra` objects (see section @sec-backend). With *reticulate* we can use the `py` special variable in R to access attributes defined in the associated Python environment. Similarly, *reticulate* defines an attribute `r` in the Python session that allows to access variables in R. When variables are accessed this way, they are automatically converted to the corresponding data types in the other programming language if an `r_to_py()` method (or `py_to_r()` method) is implemented for them. Such methods are defined for the basic data types, so, when accessing for example the `fl` R variable from Python, it is converted from the R `character` data type to the equivalent Python `str` data type: ```{python} #' Python session: #' Access the `fl` variable from the R session: r.fl type(r.fl) ``` *SpectriPy* implements an `r_to_py()` method for `Spectra` objects, so, when `Spectra` objects are accessed from Python, they are also automatically translated to a Python list of `matchms.Spectrum` objects: ```{python} #' Python session: #' Access the `Spectra` object with the original data from R; the data #' gets directly translated on-the-fly r.mzml_r #' Access the intensities of the first spectrum r.mzml_r[0].peaks.intensities ``` While this automatic conversion is convenient in some cases, the manual translation with `rspec_to_pyspec()` and `pyspec_to_rspec()` is preferred, as it allows to configure the handling of the spectra variables (i.e. metadata) and avoids eventual repeated translation of the data. ## Converting MS data from Python to R {#sec-py-to-r} To show conversion of MS data from Python to R, we import a test data file in MGF format using the Python *matchms* library. This MGF file is provided within the *SpectriPy* package so we first define its file name and path in R. ```{r} #| label: define MGF file #' R session: f_mgf <- system.file("extdata", "mgf", "test.mgf", package = "SpectriPy") ``` We next load the required Python library and import the data in Python. We can access the variable (defined in the R session) with the file name to import the data from through the `r.`. ```{python} #| label: load MGF in Python #| warning: false #' Python session import matchms from matchms.importing import load_from_mgf mgf_py = list(load_from_mgf(r.f_mgf)) mgf_py ``` The MS data from the MGF is now loaded in Python as a list of `matchms.Spectrum` objects. We can also directly access this variable from the R session through `py$`. Below we access the first spectrum in that list: ```{r} #' R session: #' Access the first spectrum py$mgf_py[[1]] ``` The data is thus provided in a *matchms.Spectrum* object. We extract the *m/z* and intensity peak matrix from that spectrum using the built-in functionality from the *reticulate* package that allows to call Python functions from R or translate between basic data types. As an example we below get the `peaks` attribute from the first spectrum and convert that to a Python `numpy` array. This array can then be translated to an R `matrix` with the `py_to_r()` function (see also the [*Calling Python from R*](https://rstudio.github.io/reticulate/articles/calling_python.html) vignette from the *reticulate* package for examples on accessing data from Python): ```{r} #' R session: #' Extract the peaks matrix from the first spectrum py_to_r(py$mgf_py[[1]]$peaks$to_numpy) ``` While such functionality can thus be used to extract the MS data from Python, it is for MS data analysis in R more convenient to transform the full MS data to an R `Spectra` object using *SpectriPy*'s `pyspec_to_rspec()` function. Note that below we use the `py_get_attr()` function to get the Python variable with the MS data instead of `py$mgf_py`. Accessing the Python attribute through `py$mgf_py` immediately translates the Python list into an R `list` through the default `py_to_r()` function and `pyspec_to_rspec()` thus iterates over the R `list` (in R). With `py_get_attr()` the attribute is accessed in its native Python data type and `pyspec_to_rspec()` hence iterates over the data in Python, which has a minimal performance advantage. ```{r} #' R session: #' Convert and copy the data to R mgf_r <- pyspec_to_rspec(py_get_attr(py, "mgf_py")) mgf_r #' Extract intensities mgf_r$intensity ``` The full MS data is thus now available as a `Spectra` object. Note however that with `pyspec_to_rspec()` the complete MS data gets **copied**. We have now two (detached) variables containing the same MS data, one in R and one in Python. ### Using a dedicated MS data *backend* for MS data in Python {#sec-backend} An alternative, and more elegant, approach to keeping multiple copies of spectral objects (i.e. `Spectra`, `matchms.Spectrum`) in memory, is to use a dedicated data *backend* for the `Spectra` object. The R's `Spectra` object separates by design the functionality to analyze MS data from the code to *represent* or retrieve the MS data, which is provided by dedicated data *backends*. These allow for example to import MS data from different file formats, or to store and access the data with different storage modes, respectively (see also [this tutorial](https://jorainer.github.io/SpectraTutorials/articles/Spectra-backends.html) for more information). The *SpectriPy* packages defines such a backend, the `MsBackendPy`, that allows to directly access MS data stored in the respective data formats in Python. As an example we create below a `Spectra` object for the MS data (in Python) previously imported from the MGF file (i.e. `"mgf_py"`) and using the `MsBackendPy` as the data source (i.e. backend). ```{r} #| label: initialize MsBackendPy for Python variable #' R session: #' Create a Spectra object with a MsBackendPy backend for the #' attribute "mgf_py" mgf <- Spectra("mgf_py", source = MsBackendPy()) mgf ``` In contrast to using the `pyspec_to_rspec()` function, no data was copied or converted by this call. The `Spectra` object (through its backend) does only keep a *reference* to the original data attribute in Python, but no data. Data is retrieved and translated on-the-fly each time it is requested from the `Spectra` object. Thus, calling e.g. `msLevel()` or `intensity()` causes the backend to iterate over the MS data in Python, extract the respective information, translate it and return it to the user. ```{r} #| label: access MS level and intensity through MsBackendPy #' R session: #' Extract MS level msLevel(mgf) #' Extract intensity values intensity(mgf) ``` While the performance is a little lower, compared to translating all data to R using the `pyspec_to_rspec()` function, this approach is much more memory efficient, because no additional copies of MS data are generated and only the MS data currently required for a certain analysis task (i.e. a subset of the data) is loaded and translated at a time. Also, because data is always retrieved on-the-fly from Python, any changes to the MS data attribute in Python are also immediately reflected in the respective `Spectra` object. To illustrate this we below scale the intensities of the mass peaks in Python: ```{python} #| label: normalize intensities of MGF in Python #' Python session: #' Scale intensities for i in range(len(mgf_py)): mgf_py[i] = mms_filt.normalize_intensities(mgf_py[i]) ``` The intensities retrieved from the `Spectra` object are now also scaled. ```{r} #| label: get intensities after modifying in Python #' R session: #' Get intensities after scaling the intensities in Python: intensity(mgf) ``` ## Conversion of spectra variables Conversion of the MS peaks data (i.e. the *m/z* and intensity values) is always performed by the `rspec_to_pyspec()` and `pyspec_to_rspec()` functions. But next to the peaks data, also additional information are available for individual spectra. In R/*Spectra* these variables are called *spectra variables* while in *matchms* they are stored as a *metadata* attribute to a `matchms.Spectrum` object. The *SpectriPy* package defines a core set of spectra variables that are by default converted by the `rspec_to_pyspec()` and `pyspec_to_rspec()` function. These default variables can be accessed using the `defaultSpectraVariableMapping()` function: ```{r} #' R session: #' List the *default* spectra variable mapping in R and python, respectively defaultSpectraVariableMapping() ``` These variables, if present in the respective data object, are transferred (and renamed) to the MS data structure of the other programming language. The names of this character vector represent the name of the spectra variable in the `Spectra` object, the elements (values) the names of the respective metadata keys in Python's `matchms.Spectrum` class. This default mapping thus transfers for example the `precursorMz()` spectra variable of R to the `"precursor_mz"` spectrum metadata in Python (and *vice versa*). Note that any spectra variable or metadata **not** being part of such a `mapping` will be ignored and hence not converted. Below we inspect the available metadata in the `matchms.Spectrum` objects that were imported from the MGF file. ```{python} #' Python session: #' Available metadata for the first spectrum mgf_py[0].metadata.keys() ``` Several additional metadata variables, such as `"smiles"`, `"inchi"` or `"compound_name"`, not part of the default variables, are available. To transfer also these, we create below a custom mapping adding in addition also a mapping for these 3 variables: ```{r} #' R session: #' Add mapping for additional spectra variables to the default mapping in R and #' python, respectively map <- c(defaultSpectraVariableMapping(), smiles = "smiles", inchi = "inchi", name = "compound_name") map ``` Such custom mapping can be passed with the parameter `mapping` to the `pyspec_to_rspec()` (and also the `rspec_to_pyspec()`) function which will then convert the full data to R. ```{r} #| label: convert R to Python with spectra variables #' R session: #' Convert the Python MS data structures to an R `Spectra` mgf_r <- pyspec_to_rspec(py_get_attr(py, "mgf_py"), mapping = map) spectraVariables(mgf_r) ``` The respective metadata values have thus been added as new spectra variables to our `Spectra` object and can also be extracted: ```{r} #| label: access spectra variables from Python in R #' R session: #' Show the first values for the spectra variable "name" mgf_r$name |> head() #' Show the first values for the spectra variable "inchi" mgf_r$inchi |> head() ``` When using a `MsBackendPy`, all metadata attributes from the `matchms.Spectrum` objects can be accessed and extracted with the `spectraData()` function. The `spectraVariables()` function lists all available columns: ```{r} #' R session: #' List available spectra variables spectraVariables(mgf) ``` We thus have already access to e.g. the `"inchi"` variable: ```{r} #' R session: #' Get the first entries from the inchi variable mgf$inchi |> head() ``` A `spectraVariableMapping()` can however be used to rename variables. Below we add for example a mapping of the `matchms.Spectrum` metadata attribute `"pepmassint"` to the (core) variable `"precursorIntensity"` using the `MsBackendPy`: ```{r} #| label: add additional mapping #' R session: #' Add mapping for additional spectra variables to the `MsBackendPy` m <- defaultSpectraVariableMapping() m["precursorIntensity"] <- "pepmassint" spectraVariableMapping(mgf) <- m #' List available spectra variables spectraVariables(mgf) #' Get spectra data for two variables spectraData(mgf, c("precursorMz", "precursorIntensity")) ``` The `r_to_py()` methods do not support additional parameters, thus, in order to use a similar mapping also with the `r_to_py()` method for `Spectra` the *global* spectra variable mapping need to be changed. See the help of the `setSpectraVariableMapping()` function for more details. ## Combined MS data analysis With the functionality to translate between R and Python MS data structures, the *SpectriPy* package enables thus a MS data analysis combining functionality provided by both R and Python libraries. Some of the functionality of the *matchms* Python library are directly wrapped into R functions simplifying their use and inclusion in R-based workflows (see the help of `compareSpectraPy()` and `filterSpectriPy()` functions). Combined analyses with code chunks in both programming languages has however the advantage to use the respective functionality provided by the original package/library. As a simple use case we calculate below spectra similarities using two different similarity scores between spectra from the mzML and the MGF files. In R we can use the `compareSpectra()` function that by default calculates the normalized dot product similarity between the compared spectra. The MS data from the mzML file was processed in Python. To use this data we first create a `Spectra` object with a `MsBackendPy` backend. For the MS data from the MGF file we re-use the `mgf` variable, which is also a `Spectra` with a `MsBackendPy` backend referencing the MS data imported and processed in Python. ```{r} #| label: compareSpectra with mzML and MGF #' R session: #' Create a `Spectra` for the scaled MS data in Python mzml <- Spectra("mzml_py", source = MsBackendPy()) #' Calculate the pairwise similarity between all spectra sim <- compareSpectra(mzml, mgf, tolerance = 0.1) dim(sim) ``` We can calculate the *Cosine Hungarian* similarity score in Python using the functionality from the *matchms* library. Here we use the *original* attributes available in Python, i.e. `mzml_py` and `mgf_py`. ```{python} #| label: Cosine Hungarian similarity in Python #' Python session: import matchms.similarity as mms_similarity #' Calculate similarity scores scores = matchms.calculate_scores( mzml_py, mgf_py, mms_similarity.CosineHungarian(tolerance = 0.1)) #' Extract similarity scores sim = scores.to_array()["CosineHungarian_score"] ``` Alternatively, we can also directly use any `Spectra` object from R as the data is converted automatically when accessed from Python. We can for example use `r.mzml` instead of `mzml_py` in the call above, which converts the `Spectra` object `mzml` in R to Python before calculating the similarities. We can also directly compare the scores calculated using the two different algorithms. ```{r} #| label: compare dot product vs Cosine Hungarian #' R session: #' Plot the similarity scores against each other plot(sim, py$sim, pch = 21, col = "#000000ce", bg = "#00000060", xlab = "Dot product", ylab = "Cosine Hungarian") grid() ``` ## Summary By translating between R and Python MS data structures, the *SpectriPy* package enables data analyses that combine functionalities from both programming language. In terms of (memory) efficiency, usage of *SpectriPy*'s `MsBackendPy` backend for `Spectra` objects has clear advantages over the repeated translation and copying of the MS data. See also section @sec-comments for general comments. # Appendix ## General comments {#sec-comments} - Be careful accessing Python attributes using `py$`: base Python data types will be automatically converted to the equivalent R data type. For MS data, it might be better to get the attributes using the `py_get_attr()` functions. - Since MS data can be large, it is suggested the user converts MS data mostly manually, and only if/when needed, using `rspec_to_pyspec()` and `pyspec_to_rspec()`. - The `rspec_to_pyspec()` and `pyspec_to_rspec()` functions **copy** the data while transferring. Thus, there will be eventually two (detached) copies of the same data in Python and R. - The `MsBackendPy` backend allows to directly interface MS data from Python. Data will be converted on-the-fly, so no additional copies of the data exist. - Be aware that, since the `MsBackendPy` backend does not contain any MS data but simply interfaces the MS data in Python, any changes to this data in Python affect also the `Spectra` object using that backend. ## Startup and Python configuration {#sec-python} The way Python dependencies are defined and managed has changed in *SpectriPy* beginning with version 0.99.6. *SpectriPy* now declares Python dependencies using the `py_require()` function from the *reticulate* package. These Python package dependencies requested via `py_require()` will automatically be provisioned and made available for the user when the *SpectriPy* package is loaded, *via* an ephemeral Python virtual environment. Eventually missing libraries are downloaded and installed automatically. The previous *virtualenv* or *conda*-based setup is now replaced by the preferred `py_require()` approach, hence the previous package's options and environment variables `"spectripy.use_system"`, `SPECTRIPY_USE_SYSTEM`, `"spectripy.use_conda"`, `SPECTRIPY_USE_CONDA`, `"spectripy.env"`, `SPECTRIPY_ENV` are now ignored. - NOTE: still define SPECTRIPY_ENV/"spectripy.env"? maybe to be used in `import` parameter `delay_load`? A pre-defined Python environment can be used by pointing the `RETICULATE_PYTHON` environment variable to the respective Python binary or the `RETICULATE_PYTHON_ENV` to the path of the environment. All Python libraries required by *SpectriPy* need however to be installed and available in that Python environment, as *SpectriPy* respectively *reticulate*'s `py_require()` functionality will be bypassed. These Python libraries are: - *matchms* >= 0.30.0 - *spectrum_utils* >= 0.3.2 - *numpy* >= 2.2.0 See also the help of the *reticulate* package for more information on configuring Python with R. # Session information ```{r} #' R session: sessionInfo() ```