This guide provides a quick introduction to using mLLMCelltype for cell type annotation in single-cell RNA sequencing data. We’ll cover the basic workflow, input data requirements, and a simple example to get you started.
The mLLMCelltype workflow consists of these main steps:
First, load the mLLMCelltype package:
Before using mLLMCelltype, you need to set up API keys for the LLM providers you plan to use:
# Set API keys as environment variables
Sys.setenv(ANTHROPIC_API_KEY = "your-anthropic-api-key") # For Claude models
Sys.setenv(OPENAI_API_KEY = "your-openai-api-key") # For GPT models
Sys.setenv(GEMINI_API_KEY = "your-gemini-api-key") # For Gemini models
Sys.setenv(OPENROUTER_API_KEY = "your-openrouter-api-key") # For OpenRouter modelsYou can obtain API keys from: - Anthropic: https://console.anthropic.com/ - OpenAI: https://platform.openai.com/ - Google (Gemini): https://ai.google.dev/ - OpenRouter: https://openrouter.ai/keys
Alternatively, you can provide API keys directly in function calls:
mLLMCelltype accepts marker gene data in several formats:
A data frame with the following columns: - cluster:
Cluster ID (preserved as-is from your data) - gene: Gene
name/symbol - avg_log2FC or similar metric: Log fold change
- p_val_adj or similar metric: Adjusted p-value
Example:
You can directly use the output from Seurat’s
FindAllMarkers() function:
A path to a CSV file containing marker gene data:
The annotate_cell_types function has the following
parameters:
| Parameter | Description | Default Value |
|---|---|---|
input |
Marker gene data (data frame, list, or file path) | (required) |
tissue_name |
Tissue name (e.g., “human PBMC”, “mouse brain”) | NULL |
model |
LLM model to use | "gpt-5" |
api_key |
API key (if not set in environment) | NA |
top_gene_count |
Number of top genes per cluster to use | 10 |
debug |
Whether to print debugging information | FALSE |
Note: If api_key is set to NA, the function
will return the generated prompt without making an API call, which is
useful for reviewing the prompt before sending it to the API.
Here’s a simple example using a single LLM model for annotation:
# Example marker data
markers <- data.frame(
cluster = c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1),
gene = c("CD3D", "CD3E", "CD2", "IL7R", "LTB", "CD14", "LYZ", "CST3", "MS4A7", "FCGR3A"),
avg_log2FC = c(2.5, 2.3, 2.1, 1.8, 1.7, 3.1, 2.8, 2.5, 2.2, 2.0),
p_val_adj = c(0.001, 0.001, 0.002, 0.003, 0.005, 0.0001, 0.0002, 0.0005, 0.001, 0.002)
)
# Run annotation with a single model
results <- annotate_cell_types(
input = markers,
tissue_name = "human PBMC",
model = "claude-sonnet-4-5-20250929",
api_key = Sys.getenv("ANTHROPIC_API_KEY"),
top_gene_count = 10,
debug = FALSE # Set to TRUE for more detailed output
)
# Print results
print(results)For more reliable annotations, you can use multiple models and create a consensus:
# Define models to use
models <- c(
"claude-sonnet-4-5-20250929", # Anthropic
"gpt-5", # OpenAI
"gemini-1.5-pro" # Google
)
# API keys for different providers
api_keys <- list(
anthropic = Sys.getenv("ANTHROPIC_API_KEY"),
openai = Sys.getenv("OPENAI_API_KEY"),
gemini = Sys.getenv("GEMINI_API_KEY")
)
# Run annotation with multiple models
results <- list()
for (model in models) {
provider <- get_provider(model)
api_key <- api_keys[[provider]]
results[[model]] <- annotate_cell_types(
input = markers,
tissue_name = "human PBMC",
model = model,
api_key = api_key,
top_gene_count = 10
)
}
# Create consensus
consensus_results <- interactive_consensus_annotation(
input = markers,
tissue_name = "human PBMC",
models = models, # Use all the models defined above
api_keys = api_keys,
controversy_threshold = 0.7,
entropy_threshold = 1.0,
consensus_check_model = "claude-sonnet-4-5-20250929"
)The function automatically prints a summary upon completion:
>
Consensus Summary:
-----------------
Total clusters: 2
Controversial clusters: 0
Consensus achieved for all clusters
Cluster 0:
Final annotation: T cells
Consensus proportion: 1.0
Entropy: 0.0
Model predictions:
- claude-sonnet-4-5-20250929: T cells
- gpt-5: T cells
- gemini-2.5-pro: T cells
Cluster 1:
Final annotation: Monocytes
Consensus proportion: 1.0
Entropy: 0.0
Model predictions:
- claude-sonnet-4-5-20250929: Monocytes
- gpt-5: Monocytes
- gemini-2.5-pro: MonocytesTo add the annotations to your Seurat object:
# Assuming you have a Seurat object named 'seurat_obj' and consensus results
library(Seurat)
# Add consensus annotations to Seurat object
seurat_obj$cell_type_consensus <- plyr::mapvalues(
x = as.character(Idents(seurat_obj)),
from = names(consensus_results$final_annotations),
to = consensus_results$final_annotations
)
# Extract consensus metrics from the consensus results
# Note: These metrics are available in the consensus_results$initial_results$consensus_results
consensus_metrics <- lapply(names(consensus_results$initial_results$consensus_results), function(cluster_id) {
metrics <- consensus_results$initial_results$consensus_results[[cluster_id]]
return(list(
cluster = cluster_id,
consensus_proportion = metrics$consensus_proportion,
entropy = metrics$entropy
))
})
# Convert to data frame for easier handling
metrics_df <- do.call(rbind, lapply(consensus_metrics, data.frame))
# Add consensus proportion to Seurat object
seurat_obj$consensus_proportion <- plyr::mapvalues(
x = as.character(Idents(seurat_obj)),
from = metrics_df$cluster,
to = metrics_df$consensus_proportion
)
# Add entropy to Seurat object
seurat_obj$entropy <- plyr::mapvalues(
x = as.character(Idents(seurat_obj)),
from = metrics_df$cluster,
to = metrics_df$entropy
)Here’s a simple visualization of the results using Seurat:
The output of annotate_cell_types() is a vector of cell
type annotations, where each element corresponds to a cluster.
The output of interactive_consensus_annotation() is a
list containing:
final_annotations: Final consensus cell type
annotationsinitial_results: Initial predictions from each
modelcontroversial_clusters: List of clusters that required
discussiondiscussion_logs: Detailed logs of the discussion
processsession_id: Unique identifier for the annotation
sessionWhen using consensus annotation, two key metrics help evaluate the reliability of annotations:
Clusters with low consensus proportion or high entropy may require manual review.
If you don’t have access to paid API keys, you can use OpenRouter’s free models:
# Set OpenRouter API key
Sys.setenv(OPENROUTER_API_KEY = "your-openrouter-api-key")
# Use a free model
free_results <- annotate_cell_types(
input = markers,
tissue_name = "human PBMC",
model = "meta-llama/llama-4-maverick:free", # Note the :free suffix
api_key = Sys.getenv("OPENROUTER_API_KEY"),
top_gene_count = 10
)
# Print results
print(free_results)Available free models (Updated Oct 2025):
meta-llama/llama-4-maverick:free - Meta Llama 4
Maverick (256K context, best performance)deepseek/deepseek-r1:free - DeepSeek R1 (advanced
reasoning)meta-llama/llama-3.3-70b-instruct:free - Meta Llama 3.3
70B (reliable)venice/uncensored:free - Venice Uncensored (new
model)minimax/minimax-m2:free - MiniMax M2 (optimized for
coding)z-ai/glm-4.5-air:free - GLM 4.5 Air (lightweight)Important: OpenRouter reduced free tier limits in 2025: - Free accounts: 50 requests/day (down from 200), 20 requests/minute - Accounts with $10+ credits: 1000 requests/day - Some models removed: NVIDIA Nemotron and others have exited the free tier - For production use: Consider using paid models for better reliability
API Key Not Found:
Solution: Ensure you’ve set the correct API key environment variable or provided it directly in the function call.
Rate Limiting:
Solution: Wait a few minutes before trying again, or reduce the number of API calls by processing fewer clusters at once.
Invalid Model Name:
Solution: Check that you’re using a supported model name and that it’s spelled correctly.
Network Issues:
Solution: Check your internet connection and try again. If the problem persists, the API service might be down.
Now that you understand the basics of mLLMCelltype, you can explore:
If you encounter any issues, check the FAQ or open an issue on our GitHub repository.