Memory Management

Understanding and optimizing GapClean's memory usage.

Memory Architecture

GapClean uses three memory areas:

Input buffer: Reading sequences from disk
Processing buffer: 2D chunks during gap removal
Output buffer: Writing cleaned sequences

The largest is the processing buffer, controlled by chunk sizes.

Memory Requirements

Minimum

RAM = row_chunk_size × col_chunk_size × 2 bytes + overhead

Overhead includes: - Python interpreter: ~100 MB - NumPy arrays: ~50 MB - Temporary files (on disk, not RAM) - Input/output buffers: ~10 MB

Example with defaults (5000 × 5000):

RAM = 5000 × 5000 × 2 + 160 MB
    = 50 MB + 160 MB  
    = ~210 MB total

Memory Profiles by Alignment Size

Small Alignments (< 1 MB)

Sequences: < 1,000
Positions: < 10,000
Memory: < 256 MB
Settings: Use defaults

gapclean -i small.fa -o cleaned.fa -t 75

Medium Alignments (1-100 MB)

Sequences: 1,000 - 10,000
Positions: 10,000 - 100,000
Memory: 256 MB - 2 GB
Settings: Use defaults or slightly larger

gapclean -i medium.fa -o cleaned.fa -t 75 \
  --row-chunk-size 5000 --col-chunk-size 5000

Large Alignments (100 MB - 1 GB)

Sequences: 10,000 - 100,000
Positions: 100,000 - 1,000,000
Memory: 512 MB - 4 GB
Settings: Adjust based on available RAM

# If you have 8 GB RAM
gapclean -i large.fa -o cleaned.fa -t 75 \
  --row-chunk-size 5000 --col-chunk-size 5000

# If you have 4 GB RAM
gapclean -i large.fa -o cleaned.fa -t 75 \
  --row-chunk-size 3000 --col-chunk-size 3000

Very Large Alignments (> 1 GB)

Sequences: > 100,000
Positions: > 1,000,000
Memory: 512 MB - 8 GB
Settings: Conservative chunks

# Conservative settings
gapclean -i huge.fa -o cleaned.fa -t 75 \
  --row-chunk-size 2000 --col-chunk-size 2000

Calculating Optimal Chunk Sizes

Formula

chunk_size = sqrt(available_RAM_MB × 500,000)

Then use same size for both row and column chunks.

Examples:

# 2 GB available RAM
chunk = sqrt(2000 × 500,000) = sqrt(1,000,000,000) ≈ 31,600

# 512 MB available RAM
chunk = sqrt(512 × 500,000) = sqrt(256,000,000) ≈ 16,000

# 256 MB available RAM
chunk = sqrt(256 × 500,000) = sqrt(128,000,000) ≈ 11,300

Quick Reference Table

Available RAM	Recommended Chunk Size
256 MB	1,000 × 1,000
512 MB	3,000 × 3,000
1 GB	5,000 × 5,000 (default)
2 GB	10,000 × 10,000
4 GB	15,000 × 15,000
8 GB	20,000 × 20,000

Monitoring Memory Usage

During Execution

Linux

# Monitor memory in real-time
watch -n 1 'ps aux | grep gapclean'

# Or use top
top -p $(pgrep -f gapclean)

macOS

# Monitor memory
top | grep gapclean

# Or use Activity Monitor (GUI)

Windows

# Task Manager or:
Get-Process python | Select-Object -Property Name,CPU,Memory

Post-Execution Analysis

# Linux
/usr/bin/time -v gapclean -i alignment.fa -o output.fa -t 75

# macOS
/usr/bin/time -l gapclean -i alignment.fa -o output.fa -t 75

Look for "Maximum resident set size".

Memory Optimization Strategies

Strategy 1: Reduce Chunk Sizes

When: Running out of memory

# Current settings cause OOM
gapclean -i alignment.fa -o output.fa -t 75 \
  --row-chunk-size 10000 --col-chunk-size 10000  # ✗ OOM

# Reduce chunks
gapclean -i alignment.fa -o output.fa -t 75 \
  --row-chunk-size 2000 --col-chunk-size 2000    # ✓ Works

Strategy 2: Use Disk Temp Space

When: Have fast SSD but limited RAM

# Use SSD for temp files
export TMPDIR=/mnt/fast_ssd/tmp
gapclean -i alignment.fa -o output.fa -t 75 \
  --row-chunk-size 1000 --col-chunk-size 1000

Strategy 3: Split Large Alignments

When: Alignment is huge and you have very limited RAM

# Split alignment into chunks
split -l 20000 alignment.fa chunk_

# Process each chunk
for chunk in chunk_*; do
  gapclean -i "$chunk" -o "cleaned_$chunk" -t 75
done

# Merge results
cat cleaned_chunk_* > final_cleaned.fa

Strategy 4: Use Seed Mode

When: Don't need gap statistics, just reference-based cleaning

# Seed mode uses less memory
gapclean -i alignment.fa -o output.fa -s 0

Memory Troubleshooting

Problem: Out of Memory (OOM) Error

Symptoms:

MemoryError: Unable to allocate array

Solutions (in order of preference):

Reduce chunk sizes:

--row-chunk-size 1000 --col-chunk-size 1000

Close other applications
Use seed mode instead of threshold mode
Split the alignment into smaller pieces

Problem: Swap Thrashing

Symptoms: - System becomes very slow - Disk activity LED constantly on - Process takes 10× longer than expected

Solutions:

Reduce chunk sizes significantly:

--row-chunk-size 500 --col-chunk-size 500

Close other applications
Increase system swap space (if possible)

Problem: Python Crashes Without Error

Symptoms: - Process terminates suddenly - No error message - System log shows OOM killer