Performance Tuning
Optimize GapClean for your specific use case.
Benchmarks
Typical performance on modern hardware (M1 Mac, 16 GB RAM):
| Alignment Size | Sequences | Positions | Time | Memory |
|---|---|---|---|---|
| Small | 100 | 1,000 | < 1s | < 100 MB |
| Medium | 1,000 | 10,000 | ~5s | < 500 MB |
| Large | 10,000 | 100,000 | ~60s | < 2 GB |
| Very Large | 100,000 | 100,000 | ~15min | < 5 GB |
Speed Optimization
Choose the Right Mode
Fastest to slowest:
- Seed mode: Only checks one sequence
- Threshold mode: Counts gaps across all sequences
- Entropy mode: Calculates entropy for each column
# Fastest
gapclean -i alignment.fa -o output.fa -s 0
# Fast
gapclean -i alignment.fa -o output.fa -t 75
# Slower (but still efficient)
gapclean -i alignment.fa -o output.fa -e 1.5
Optimize Chunk Sizes
For Speed (If You Have RAM)
# Larger chunks = fewer iterations = faster
gapclean -i alignment.fa -o output.fa -t 75 \
--row-chunk-size 10000 --col-chunk-size 10000
For Memory (If RAM is Limited)
# Smaller chunks = more iterations = slower but less memory
gapclean -i alignment.fa -o output.fa -t 75 \
--row-chunk-size 1000 --col-chunk-size 1000
Use Threshold Mode When Possible
If your goal is just to remove gappy columns, threshold mode is faster than entropy mode:
# Instead of this (slower):
gapclean -i alignment.fa -o output.fa -e 0.1
# Consider this (faster):
gapclean -i alignment.fa -o output.fa -t 75
Memory Optimization
Estimate Memory Usage
Peak memory ≈ row_chunk_size × col_chunk_size × 2 bytes
# 5000 × 5000 × 2 = 50 MB peak memory
gapclean -i alignment.fa -o output.fa -t 75 \
--row-chunk-size 5000 --col-chunk-size 5000
# 1000 × 1000 × 2 = 2 MB peak memory
gapclean -i alignment.fa -o output.fa -t 75 \
--row-chunk-size 1000 --col-chunk-size 1000
For Very Large Alignments
# 100,000 sequences × 500,000 positions
# Use conservative chunk sizes
gapclean -i huge.fa -o cleaned.fa -t 75 \
--row-chunk-size 2000 --col-chunk-size 2000
Memory usage: ~8 MB peak (vs. 100 GB without chunking!)
Monitor Memory Usage
On Unix systems:
On macOS:
Disk I/O Optimization
Use Fast Storage
- SSD: 5-10× faster than HDD
- NVMe SSD: 2× faster than SATA SSD
- RAM disk: Fastest (for temp files)
Reduce Disk I/O
GapClean uses temporary files. Set temp directory to fast storage:
# Linux/macOS
export TMPDIR=/path/to/fast/ssd
gapclean -i alignment.fa -o output.fa -t 75
# Windows
set TEMP=C:\fast\ssd
gapclean -i alignment.fa -o output.fa -t 75
Parallel Processing
Current version: Serial processing
For multiple alignments, parallelize externally:
# GNU Parallel
parallel -j 4 'gapclean -i {} -o {.}_cleaned.fa -t 75' ::: *.fa
# Simple shell loop with background jobs
for file in *.fa; do
gapclean -i "$file" -o "${file%.fa}_cleaned.fa" -t 75 &
done
wait
Profiling
Check Which Step is Slow
GapClean shows progress for each step:
[GAPCLEAN] (1) Flattening input FASTA... ← Fast
[GAPCLEAN] (2) Splitting headers... ← Fast
[GAPCLEAN] (3) It's Gappin' Time...
[GAPCLEAN] Counting gaps: 100% ← May be slow for large alignments
[GAPCLEAN] Cleaning gaps: 100% ← May be slow for large alignments
[GAPCLEAN] (4) Stitching back headers... ← Fast
Performance Tips Summary
For Maximum Speed
- Use seed mode if possible
- Use large chunk sizes (if RAM allows)
- Use SSD storage
- Process multiple files in parallel
For Minimum Memory
- Use small chunk sizes (1000 × 1000)
- Process one file at a time
- Close other applications
For Balance
- Use default settings (5000 × 5000)
- Monitor first run to adjust if needed
- Threshold mode for most use cases
Expected Performance
Scaling Behavior
- Sequences: Linear scaling O(n)
- Positions: Linear scaling O(m)
- Overall: O(n × m) for threshold/entropy modes
Doubling alignment size ≈ doubles runtime (very predictable!)
Comparison with Other Tools
GapClean is competitive with specialized tools:
- Faster than pure Python tools (uses NumPy)
- Comparable to C-based tools for typical alignments
- Better memory efficiency than most alternatives
Troubleshooting Slow Performance
Problem: Very slow on small alignment
Likely cause: Disk I/O (slow storage)
Solution: Move files to SSD or use RAM disk
Problem: Slow on large alignment
Likely cause: Normal behavior for large data
Solutions: - Increase chunk sizes if you have RAM - Use threshold mode instead of entropy mode - Ensure you're using SSD storage
Problem: Out of memory error
Solution: Reduce chunk sizes
Problem: Taking too long
Check: Is your alignment unexpectedly large?
Expected time: ~1 minute per 1,000,000,000 characters on modern hardware.