Skip to content

Entropy-based Gap Removal

Remove columns based on Shannon entropy (information content).

Two Approaches

GapClean offers two entropy-based filtering modes:

Remove Conserved Regions (--entropy-min)

gapclean -i alignment.fa -o output.fa --entropy-min THRESHOLD

Removes columns with entropy < threshold (removes conserved, keeps variable).

Use case: SNP detection, diversity studies, finding hypervariable regions

Remove Variable Regions (--entropy-max)

gapclean -i alignment.fa -o output.fa --entropy-max THRESHOLD

Removes columns with entropy > threshold (removes variable, keeps conserved).

Use case: Alignment cleaning, focusing on conserved functional regions

What is Entropy?

Entropy measures the diversity of characters in a column:

  • Low entropy (0): All characters the same (conserved)
  • High entropy: Many different characters (variable)

Shannon entropy formula: H = -Σ p(char) * log2(p(char))

How It Works

GapClean calculates Shannon entropy for each column:

Example columns:

Column 1: AAAA (all same) → entropy = 0 bits (conserved)
Column 2: AATT (half A, half T) → entropy = 1.0 bits (moderate)
Column 3: ATCG (equal mix) → entropy = 2.0 bits (variable)

With --entropy-min 1.5 (remove conserved, keep variable): - Column 1 removed (0 < 1.5) - too conserved - Column 2 removed (1.0 < 1.5) - too conserved - Column 3 kept (2.0 >= 1.5) - variable enough

With --entropy-max 1.5 (remove variable, keep conserved): - Column 1 kept (0 < 1.5) - conserved - Column 2 kept (1.0 < 1.5) - conserved - Column 3 removed (2.0 > 1.5) - too variable

Entropy Ranges

DNA/RNA Alignments

Maximum entropy: ~2.0 bits (4 nucleotides)

For finding variable regions (SNPs, diversity):

# Keep only variable positions
gapclean -i dna.fa -o variable.fa --entropy-min 1.0

# Keep highly variable positions only
gapclean -i dna.fa -o hypervariable.fa --entropy-min 1.5

For alignment cleaning (keep conserved):

# Remove noisy/hypervariable columns
gapclean -i dna.fa -o conserved.fa --entropy-max 1.5

# Remove moderately variable columns
gapclean -i dna.fa -o highly_conserved.fa --entropy-max 1.0

Protein Alignments

Maximum entropy: ~4.32 bits (20 amino acids)

For finding variable regions:

# Keep variable positions
gapclean -i protein.fa -o variable.fa --entropy-min 2.0

# Keep highly variable positions
gapclean -i protein.fa -o hypervariable.fa --entropy-min 3.0

For alignment cleaning (keep conserved):

# Remove noisy columns
gapclean -i protein.fa -o conserved.fa --entropy-max 3.0

# Keep only highly conserved positions
gapclean -i protein.fa -o highly_conserved.fa --entropy-max 2.0

Understanding the Output

[GAPCLEAN] Calculating entropy: 100%|████████| 5000/5000
[GAPCLEAN] Average entropy: 1.75 bits
[GAPCLEAN] Removed 1230 columns with entropy < 1.5
[GAPCLEAN] Final alignment length: 3770 columns (was 5000)

The average entropy tells you about overall alignment diversity.

Common Use Cases

Use Case 1: SNP Detection (Keep Variable)

# DNA: keep positions with >1 bit entropy (variable regions)
gapclean -i population.fa -o snps.fa --entropy-min 1.0

Goal: Find polymorphic sites for diversity analysis

Result: Removes conserved positions, keeps sites with variation

Use Case 2: Alignment Cleaning (Keep Conserved)

# Remove noisy/hypervariable columns for cleaner alignment
gapclean -i alignment.fa -o cleaned.fa --entropy-max 1.5

Goal: Clean alignment for visualization or downstream analysis

Result: Removes highly variable/noisy columns, keeps well-conserved positions

Use Case 3: Functional Region Analysis

# Protein: keep conserved functional regions
gapclean -i protein.fa -o conserved_functional.fa --entropy-max 2.0

Goal: Focus on conserved (likely functional) regions

Result: Removes variable regions, keeps conserved domains

Use Case 4: Hypervariable Region Detection

# Find hypervariable regions (e.g., antibody CDRs)
gapclean -i antibody.fa -o hypervariable.fa --entropy-min 2.5

Goal: Identify highly diverse regions

Result: Keeps only highly variable positions

Decision Guide

What do you want to keep?

I want to keep... Use... Example
Variable regions (SNPs, diversity) --entropy-min --entropy-min 1.0
Conserved regions (functional) --entropy-max --entropy-max 1.5
Moderately variable positions Both flags --entropy-min 0.5 --entropy-max 1.5

Entropy Reference Table

Entropy DNA/RNA Protein Character
0.0 - 0.5 Very conserved Very conserved Almost all same
0.5 - 1.0 Somewhat variable Conserved Some variation
1.0 - 1.5 Moderately variable Somewhat variable Good diversity
1.5 - 2.0 Highly variable Moderately variable High diversity
> 2.0 Maximum (4 bases) Highly variable Maximum diversity
> 4.0 N/A Maximum (20 AAs) Maximum diversity

Quick Examples

Remove Conserved, Keep Variable

# SNP detection (DNA)
gapclean -i population.fa -o snps.fa --entropy-min 1.0

# Antigenic variation (Protein)
gapclean -i antigen.fa -o variable_epitopes.fa --entropy-min 2.5

Remove Variable, Keep Conserved

# Alignment cleaning (DNA)
gapclean -i noisy_alignment.fa -o clean.fa --entropy-max 1.5

# Functional conservation (Protein)
gapclean -i protein.fa -o conserved_domains.fa --entropy-max 2.0

Keep Moderate Range

# Keep moderately informative positions (not too conserved, not too variable)
gapclean -i alignment.fa -o moderate.fa --entropy-min 0.8 --entropy-max 1.8

Note: When using both flags, columns are removed if they fall outside the range.

Tips

  • Check the "Average entropy" in output to understand your alignment
  • Start with middle values (1.0 for DNA, 2.0 for protein)
  • Higher threshold = keep fewer, more variable columns
  • Entropy considers ALL characters (including gaps)
  • Gaps count as a separate character class

Comparison: Entropy vs. Threshold Mode

Feature Entropy Mode Threshold Mode
What it measures Character diversity Gap percentage
Typical goal Find variable/conserved regions Remove gappy columns
Best for SNP analysis, conservation studies Alignment cleaning
Characters considered All characters Only gaps (- or .)
Speed Slower (more computation) Faster
Complexity Two flags (min/max) for flexibility Single threshold

Which Should I Use?

Use Threshold Mode (-t) when: - You want to clean up gappy alignments - Gaps are your primary concern - You need faster processing - Simple gap percentage is sufficient

Use Entropy Mode (--entropy-min/--entropy-max) when: - You're analyzing sequence diversity - You want to find SNPs or conserved regions - Character variation matters (not just gaps) - You need fine control over conservation levels

Performance Note

Entropy calculation is more computationally intensive than threshold mode, but GapClean's 2D chunking keeps it efficient even for large alignments.

Expected performance: ~10-20% slower than threshold mode.