Type: Package
Title: Determining the Number of Factors in Poisson Factor Models via Thinning Cross-Validation
Version: 0.1.0
Date: 2025-09-01
Description: Implements methods for selecting the number of factors in Poisson factor models, with a primary focus on Thinning Cross-Validation (TCV). The TCV method is based on the 'data thinning' technique, which probabilistically partitions each count observation into training and test sets while preserving the underlying factor structure. The Poisson factor model is then fit on the training set, and model selection is performed by comparing predictive performance on the test set. This toolkit is designed for researchers working with high-dimensional count data in fields such as genomics, text mining, and social sciences. The data thinning methodology is detailed in Dharamshi et al. (2025) <doi:10.1080/01621459.2024.2353948> and Wang et al. (2025) <doi:10.1080/01621459.2025.2546577>.
License: GPL (≥ 3)
Encoding: UTF-8
Imports: stats, GFM, countsplit, irlba
LinkingTo: Rcpp, RcppArmadillo
Suggests: knitr, rmarkdown, testthat (≥ 3.0.0)
SystemRequirements: C++17
RoxygenNote: 7.3.2
URL: https://github.com/Wangzhijingwzj/tcv
BugReports: https://github.com/Wangzhijingwzj/tcv/issues
NeedsCompilation: yes
Packaged: 2025-09-18 11:30:24 UTC; clswt-wangzhijing
Author: Zhijing Wang [aut, cre], Heng Peng [aut], Peirong Xu [aut]
Maintainer: Zhijing Wang <wangzhijing@sjtu.edu.cn>
Repository: CRAN
Date/Publication: 2025-09-23 07:40:02 UTC

Enforce Identifiability Constraints on Factor Model Components

Description

Post-processes the factor scores (H), loadings (B), and intercept (mu) to ensure a unique solution by applying SVD-based rotation. This typically enforces orthogonality constraints.

Usage

add_identifiability(H, B, mu)

Arguments

H

A numeric matrix of factor scores (n x q).

B

A numeric matrix of factor loadings (p x q).

mu

A numeric vector for the intercept/mean term.

Value

A list containing the transformed H, B, and mu that satisfy identifiability constraints.


Estimating the Number of Factor by Eigenvalue Ratio of Natural Parameter Matrix in Generalized Factor Model.

Description

Estimating the Number of Factor by Eigenvalue Ratio of Natural Parameter Matrix in Generalized Factor Model.

Usage

chooseFacNumber_ratio(
  XList,
  types,
  q_set = 1:5,
  select_method = c("SVR", "IC"),
  offset = FALSE,
  dc_eps = 1e-04,
  maxIter = 30,
  verbose = FALSE,
  parallelList = NULL
)

Arguments

XList

A list that containing an n by p matrix, where n is the number of samples, p is the number of features.

types

The type of data. In Poisson factor models, the type is "poisson".

q_set

The maximum number of factors for conducting ratio methods. Default as 5.

select_method

The methods to conduct GFM. Default as AM.

offset

Default as FALSE.

dc_eps

The tolerance for convergence. Default as 1e-4.

maxIter

The maximum iteration times. Defualt as 30.

verbose

Default as FALSE

parallelList

Whether to use parallel. Default as NULL.

Value

The number of factors estimated by ratio methods.


Perform Thinning Cross-Validation to Select Factor Number

Description

This function implements a K-fold cross-validation scheme based on data thinning (count splitting) to determine the optimal number of factors for a Poisson matrix factorization model.

Usage

multiDT(x, K = 5, rmax = 8)

Arguments

x

A numeric matrix of count data (n x p).

K

An integer, the number of folds for cross-validation. Default is 5.

rmax

An integer, the maximum number of factors to test. Default is 8.

Value

A list containing two elements: - TCV: A numeric vector of total cross-validation error for each number of factors. - TICV: A numeric vector of the natural logarithm of TCV.

Examples

# 1. Set parameters for data generation
# Use smaller dimensions for a quick example
n <- 50 # Number of samples
p <- 30 # Number of features
true_q <- 2  # True number of factors

# 2. Generate data from a Poisson factor model
set.seed(123) # For reproducibility

# Factor matrix (scores)
FF <- matrix(rnorm(n * true_q), nrow = n, ncol = true_q)

# Loading matrix
BB <- matrix(runif(p * true_q, min = -1, max = 1), nrow = p, ncol = true_q)

# Intercept term
a <- runif(p, min = 0, max = 1)

# Enforce identifiability for a unique generating model
FF0 <- add_identifiability(FF, BB, a)$H
BB0 <- add_identifiability(FF, BB, a)$B
alpha <- add_identifiability(FF, BB, a)$mu

# Calculate the mean matrix (lambda) with some noise
lambda <- exp(FF0 %*% t(BB0) + rep(1, n) %*% t(alpha) + matrix(rnorm(n*p, 0, 0.5), n, p))

# Generate the final count data matrix 'x'
x <- matrix(rpois(n * p, lambda = as.vector(lambda)), nrow = n, ncol = p)

# 3. Run multiDT to find the best number of factors
# Use small K and rmax for a quick example run
cv_results <- multiDT(x, K = 2, rmax = 4)

# 4. Print results and select the best 'r' based on the minimum TCV
print(cv_results$TCV)
best_r <- which.min(cv_results$TCV)