--- title: "Best Practices and Troubleshooting" author: "Zaoqu Liu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Best Practices and Troubleshooting} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, warning = FALSE, message = FALSE ) ``` ## Best Practices This vignette provides recommendations for optimal use of MultiK and solutions to common issues. ## 1. Data Preprocessing ### 1.1 Quality Control Before running MultiK, ensure your data has been properly quality-controlled: ```{r qc, eval=FALSE} library(Seurat) # Standard QC metrics seu <- PercentageFeatureSet(seu, pattern = "^MT-", col.name = "percent.mt") # Filter cells seu <- subset(seu, subset = nFeature_RNA > 200 & nFeature_RNA < 5000 & percent.mt < 20 ) ``` ### 1.2 Recommended Preprocessing MultiK handles normalization internally, but for consistency: ```{r preprocess, eval=FALSE} # If you want to pre-process seu <- NormalizeData(seu) seu <- FindVariableFeatures(seu, nfeatures = 2000) # Pass to MultiK - it will scale and run PCA result <- MultiK(seu, reps = 100) ``` ## 2. Parameter Selection ### 2.1 Number of Repetitions (`reps`) | Dataset Size | Recommended `reps` | |--------------|-------------------| | < 5,000 cells | 100-150 | | 5,000-20,000 cells | 100 | | > 20,000 cells | 50-100 | **Rule of thumb**: More repetitions = more stable results, but longer runtime. ### 2.2 Subsampling Proportion (`pSample`) ```{r psample, eval=FALSE} # Default: 80% of cells result <- MultiK(seu, pSample = 0.8) # For very small datasets (< 500 cells), consider higher result <- MultiK(seu, pSample = 0.9) # For large datasets, 80% is usually sufficient ``` ### 2.3 Resolution Range ```{r resolution, eval=FALSE} # Default covers most use cases result <- MultiK(seu, resolution = seq(0.05, 2, 0.05)) # If you expect few clusters (< 5) result <- MultiK(seu, resolution = seq(0.05, 1, 0.05)) # If you expect many clusters (> 15) result <- MultiK(seu, resolution = seq(0.1, 3, 0.1)) ``` ### 2.4 PCA Dimensions (`nPC`) ```{r npc, eval=FALSE} # Determine optimal PCs using elbow plot ElbowPlot(seu, ndims = 50) # Use ~80-90% of variance captured result <- MultiK(seu, nPC = 30) # Default is often good ``` ## 3. Computational Considerations ### 3.1 Parallel Processing ```{r parallel, eval=FALSE} # Check available cores parallel::detectCores() # Use all but one core result <- MultiK(seu, cores = parallel::detectCores() - 1) # For HPC/cluster environments with limited memory per core result <- MultiK(seu, cores = 8) ``` ### 3.2 Memory Management For large datasets: ```{r memory, eval=FALSE} # Reduce features to save memory seu <- FindVariableFeatures(seu, nfeatures = 1500) # Reduce PCA dimensions result <- MultiK(seu, nPC = 20) # Reduce reps if memory-constrained result <- MultiK(seu, reps = 50) ``` ### 3.3 Runtime Estimates | Cells | Reps | Cores | Approximate Time | |-------|------|-------|------------------| | 2,000 | 100 | 4 | 5-10 min | | 10,000 | 100 | 8 | 20-40 min | | 50,000 | 50 | 16 | 1-2 hours | ## 4. Interpreting Results ### 4.1 Clear Optimal K **Ideal scenario:** - Single peak in K frequency distribution - Low rPAC at that K - Pareto-optimal point stands out ```{r clear, eval=FALSE} # Result is straightforward optK <- result$optimal_k clusters <- getClusters(seu, optK = optK) ``` ### 4.2 Multiple Candidate K Values When multiple K values appear Pareto-optimal: ```{r multiple, eval=FALSE} # Consider biological context # Lower K: Major cell types # Higher K: Subtypes/states # Examine both clusters_low <- getClusters(seu, optK = 3) clusters_high <- getClusters(seu, optK = 5) # Use SigClust to help decide pval_low <- CalcSigClust(seu, clusters_low$clusters[, 1]) pval_high <- CalcSigClust(seu, clusters_high$clusters[, 1]) ``` ### 4.3 Hierarchical Relationships ```{r hierarchy, eval=FALSE} # If K=5 is optimal but K=3 also looks good # Check if 5 clusters = 3 major + 2 subtypes # Run at both K values PlotSigClust(seu, clusters_low$clusters[, 1], pval_low) PlotSigClust(seu, clusters_high$clusters[, 1], pval_high) ``` ## 5. Troubleshooting ### 5.1 "No valid consensus matrices found" **Cause**: All clustering runs produced the same K. **Solutions:** ```{r trouble1, eval=FALSE} # Expand resolution range result <- MultiK(seu, resolution = seq(0.01, 3, 0.02)) # Increase reps result <- MultiK(seu, reps = 200) ``` ### 5.2 High PAC for All K **Cause**: Data may have continuous structure rather than discrete clusters. **Solutions:** ```{r trouble2, eval=FALSE} # Check for batch effects DimPlot(seu, group.by = "batch") # Consider trajectory analysis instead # Or accept that data may have transitional populations ``` ### 5.3 SigClust Returns NA **Cause**: Too few cells in a cluster. **Solutions:** ```{r trouble3, eval=FALSE} # Use lower K to get larger clusters clusters <- getClusters(seu, optK = optimal_k - 1) # Or increase nsim pval <- CalcSigClust(seu, clusters$clusters[, 1], nsim = 500) ``` ### 5.4 Long Runtime **Solutions:** ```{r trouble4, eval=FALSE} # Reduce resolution granularity result <- MultiK(seu, resolution = seq(0.1, 2, 0.1)) # Use more cores result <- MultiK(seu, cores = parallel::detectCores()) # Reduce reps (minimum ~50 for stability) result <- MultiK(seu, reps = 50) # Subsample large datasets first seu_sub <- seu[, sample(ncol(seu), 10000)] result <- MultiK(seu_sub, reps = 100) ``` ## 6. Validation Strategies ### 6.1 Biological Validation ```{r bio, eval=FALSE} # Check known markers FeaturePlot(seu, features = c("CD3D", "CD14", "MS4A1")) # Compare to reference # (if you have annotated reference data) ``` ### 6.2 Technical Validation ```{r tech, eval=FALSE} # Bootstrap validation set.seed(123) results <- lapply(1:10, function(i) { MultiK(seu, reps = 100, seed = i)$optimal_k }) # Check consistency table(unlist(results)) ``` ### 6.3 Cross-Validation ```{r cv, eval=FALSE} # Split data and check consistency idx <- sample(ncol(seu), ncol(seu)/2) seu1 <- seu[, idx] seu2 <- seu[, -idx] result1 <- MultiK(seu1, reps = 100) result2 <- MultiK(seu2, reps = 100) # Compare optimal K c(result1$optimal_k, result2$optimal_k) ``` ## 7. Reporting Guidelines When publishing results using MultiK, report: 1. **Parameters used**: reps, pSample, resolution range, nPC 2. **Optimal K selected** and rationale 3. **Diagnostic plots** (Figure) 4. **SigClust p-values** for cluster validation 5. **MultiK version**: `packageVersion("MultiK")` ### Example Methods Text > *"Optimal cluster number was determined using the MultiK algorithm (v1.0.0, Liu 2025). We performed 100 subsampling iterations with 80% cell sampling, testing resolution parameters from 0.05 to 2.0 in increments of 0.05. The optimal K was selected based on the Pareto frontier of frequency and stability (rPAC). Cluster significance was validated using pairwise SigClust tests with 100 simulations."* ## Author **Zaoqu Liu, PhD** - Email: liuzaoqu@163.com - GitHub: [Zaoqu-Liu](https://github.com/Zaoqu-Liu)