| Type: | Package | 
| Title: | Unsupervised Feature Selection using the Heterogeneous Correlation Matrix | 
| Version: | 1.0 | 
| Description: | Unsupervised multivariate filter feature selection using the UFS-rHCM or UFS-cHCM algorithms based on the heterogeneous correlation matrix (HCM). The HCM consists of Pearson's correlations between numerical features, polyserial correlations between numerical and ordinal features, and polychoric correlations between ordinal features. Tortora C., Madhvani S., Punzo A. (2025). "Designing unsupervised mixed-type feature selection techniques using the heterogeneous correlation matrix." International Statistical Review. Forthcoming. | 
| License: | GPL-2 | 
| Imports: | polycor, dplyr, cluster, graphics,psych | 
| Depends: | R (≥ 3.5.0) | 
| Encoding: | UTF-8 | 
| LazyData: | true | 
| RoxygenNote: | 7.3.1 | 
| NeedsCompilation: | no | 
| Packaged: | 2025-10-23 09:22:44 UTC; cristina | 
| Author: | Cristina Tortora [aut, cre, fnd], Antonio Punzo [aut], Shaam Madhvani [aut] | 
| Maintainer: | Cristina Tortora <grikris1@gmail.com> | 
| Repository: | CRAN | 
| Date/Publication: | 2025-10-28 08:40:02 UTC | 
Employee Satisfaction Index (ESI) Data Set
Description
The Employee Satisfaction Index (ESI) data set, from Kaggle (Harris, 2023), is a fictional data set that measures employee satisfaction
Usage
data(ESI)
Format
A data frame with 500 rows and 10 features.
- emp_id
- label. 
- age
- continuous from 23 to 45. 
- Dept
- categorical. 
- location
- binary. 
- education
- binary. 
- recruitment_type
- categorical. 
- job_level
- ordinal from 1 to 5. 
- rating
- ordinal from 1 to 5. 
- onsite
- binary. 
- awards
- number of awards 0-9. 
- certifications
- binary. 
- salary
- continuous from 24.1 to 86.8. 
- satisfied
- binary. 
Source
Harris, M. (2023). Employee Satisfaction Index Dataset. Evanston, Illinois: Kaggle. Version 1
Feature importance bar plot
Description
Displays retained features for different values of alpha in a bar plot.
Usage
FS_barplot(
  data = NULL,
  grid.alpha = seq(0.01, 0.99, by = 0.01),
  missing = FALSE,
  pv_adj = "none",
  smooth.tol = 10^-12,
  method = "c"
)
Arguments
| data | A data frame. Values of type 'numeric' or 'integer' are treated as numerical. | 
| grid.alpha | A vector of alpha values to be plotted, default = seq(0.01,0.99,by=0.01). | 
| missing | Pairwise complete by default, set to TRUE for complete deletion. | 
| pv_adj | Correction method for p-value, "none" by default. For options see p.adjust. | 
| smooth.tol | Minimum acceptable eigenvalue for the smoothing, default 10^-12. | 
| method | Algorithm used. c (cell-wise) by default, r (row-wise) as the alternative. | 
Value
Displays a bar plot depicting which features are selected at each value of alpha (multiplied by 100) and a list with elements:
| survivors | Vector depicting how many alphas a variable is selected for | 
| data_names | Vector depicting the corresponding names of the features | 
References
Tortora C., Madhvani S., Punzo A. (2025). Designing unsupervised mixed-type feature selection techniques using the heterogeneous correlation matrix. International Statistical Review. https://doi.org/10.1111/insr.70016
Examples
 
data(ESI)
data=ESI[,-c(1,3,4,6,9)]##removing categorical features
FS_barplot(data, pv_adj='BH') #using BH adkustment for the p-values
Heterogeneous correlation and p-value matrices
Description
Extends the traditional correlation matrix (between numerical data) to also include binary and ordinal categorical data and computes the p-values for the tests of uncorrelation.
Usage
HCPM(data = NULL)
Arguments
| data | A data frame. Values of type 'numeric' or 'integer' are treated as numerical. | 
Value
A list with with elements:
| cor_mat | An  | 
| p_value | An  | 
References
Tortora C., Madhvani S., Punzo A. (2025). Designing unsupervised mixed-type feature selection techniques using the heterogeneous correlation matrix. International Statistical Review. https://doi.org/10.1111/insr.70016
Examples
 
data(ESI)
data=ESI[,-c(1,3,4,6,9)]##removing categorical features
HCPM(data)
Jaccard Rate
Description
Computes the Jaccard index using Gower's dissimilarity.
Usage
JaccardRate(
  data, 
  data_red,
  k=6
)
Arguments
| data | A data frame. Values of type 'numeric' or 'integer' are treated as numerical. | 
| data_red | A data frame. A subset of data with the selected features. | 
| k | number of neighbors | 
Value
| Jaccard Index | numeric | 
References
Zhao, Z., L. Wang, and H. Liu (2010). Efficient spectral feature selection with minimum redundancy. In Proceedings of the AAAI conference on artificial intelligence, Volume 24, pp. 673–678.
Examples
data(ESI)
data=ESI[,-c(1,3,4,6,9)] ##removing categorical features
out=UFS(data,alpha=0.01,method='c',pv_adj='BH')
JR=JaccardRate(data,out$selected.features)
JR #visualize the index
Redundancy Rate
Description
Computes the Redundancy Rate using heterogeneous correlation matrix.
Usage
RedRate(
  data_red
)
Arguments
| data_red | A data frame. A subset of data with the selected features. | 
Value
| Redundancy Rate | numeric | 
References
Zhao, Z., L. Wang, and H. Liu (2010). Efficient spectral feature selection with minimum redundancy. In Proceedings of the AAAI conference on artificial intelligence, Volume 24, pp. 673–678.
Examples
data(ESI)
data=ESI[,-c(1,3,4,6,9)] ##removing categorical features
out=UFS(data,alpha=0.01,method='c',pv_adj='BH')
RR=RedRate(out$selected.features)
RR #visualize the index
Unsupervised Feature Selection
Description
Performs unsupervised feature selection for mixed type data. Both algorithms are based on the heterogeneous correlation matrix.
Usage
UFS(
  data = NULL,
  alpha = 0.05,
  missing = FALSE,
  pv_adj = "none",
  smooth.tol = 10^-12,
  method = "c"
)
Arguments
| data | A data frame. Values of type 'numeric' or 'integer' are treated as numerical, factors as ordinal categorical. | 
| alpha | Significance level to be used for testing, default = 0.05. | 
| missing | Pairwise complete by default, set to TRUE for complete deletion. | 
| pv_adj | Correction method for p-value, "none" by default. For options see p.adjust. | 
| smooth.tol | Minimum acceptable eigenvalue for the smoothing, default = 10^-12. | 
| method | Algorithm used. c (cell-wise) by default, r (row-wise) as the alternative. | 
Value
An list of elements:
| rearranged.data.set | Original data frame with with numerical features first | 
| selected.features | A data frame of the selected features | 
| feature.indices | The indices of the selected features from the original data frame | 
| original.corr.matrix | The  | 
| corr.matrix | The  | 
| original.p.value.matrix | The  | 
| p.value.matrix | The  | 
References
Tortora C., Madhvani S., Punzo A. (2025). Designing unsupervised mixed-type feature selection techniques using the heterogeneous correlation matrix. International Statistical Review. https://doi.org/10.1111/insr.70016
Examples
data(ESI)#Loading the data
data = ESI[,-c(1,3,4,6,9)]##removing categorical features
res = UFS(data)
### visualize selected features
colnames(res$selected.features)