Exclude specified range of bases from multiple sequences in a FASTA file
Hi, I am trying to eliminate a range of bases from sequences within a FASTA file in multiple places based on the header ID and positions that I mention.

For example; I have file; A.fa

>ID1
TTGTTCAACGGATCCACCTGTTGCCAAGAGTGCTTCAGTACATTGCTCACGGCTGAATCCCATATCCATCAAAGCACAAGATTTGAATTCACTCGAGGATCTGCTTCGTCGACCATTGGAAATGAAAAAATTACAATTACACATTGAATTTGTAAAGCTTGAAATTAATGAACTTACCAAAATAGATTTGCACACAGAAGCAACAGCTTGGCCGTGTTACAACTTGTAACGGGTAAAGACAAAATCGCTAACAACGGTTGTAGGCCACCATGTTCCACAAATTCACGACA

>ID2
ATGGTCGTCCGTTGAATTGT**TACTCAAAAT**TGCGTCGACAAATTTCATCACGTTCATAATGTAGTCAATGAGAACGATTGGAATGCGTTCGGAAGTAGATGATGAAGTCTGTGCAGATTCTTGTTCTGTATTCCCAGTTGCATTT

>ID3
TCTGCA**TTCT**GTCCA**TTGTC**ATCTCTGTGATTGTTGTACGGTGACGTACTTGCTTCTTCTTAGTCTTCATCTTCATCATCATTGCTACCTGCATTCATATCCGGATTATTTGTATAAGATTATTGGAAATGCCTAGCTACACAAATCCTTAAAATAAAAATAGGAAAAAAGTGTAAAAAAATAAAAGAAAAAAAATATTGAATGTAACTCACCTAAAGTAATA


I have another file with FASTA headers and with specified positions (X.txt) that looks like;

  ID start end

ID2 20...30

ID3  6...10, 15...20


I would like to modify the file A.fa in such a way that in the sequence ID2, I exclude bases between 20 and 30, in ID3 i exclude bases between 6 to 10 & 15 to 20 to create B.fa which looks like below;

>ID1
TTGTTCAACGGATCCACCTGTTGCCAAGAGTGCTTCAGTACATTGCTCACGGCTGAATCCCATATCCATCAAAGCACAAGATTTGAATTCACTCGAGGATCTGCTTCGTCGACCATTGGAAATGAAAAAATTACAATTACACATTGAATTTGTAAAGCTTGAAATTAATGAACTTACCAAAATAGATTTGCACACAGAAGCAACAGCTTGGCCGTGTTACAACTTGTAACGGGTAAAGACAAAATCGCTAACAACGGTTGTAGGCCACCATGTTCCACAAATTCACGACA
>ID2
ATGGTCGTCCGTTGAATTGTTGCGTCGACAAATTTCATCACGTTCATAATGTAGTCAATGAGAACGATTGGAATGCGTTCGGAAGTAGATGATGAAGTCTGTGCAGATTCTTGTTCTGTATTCCCAGTTGCATTT

>ID3
TCTGCAGTCCATTTCTGTGATTGTTGTACGGTGACGTACTTGCTTCTTCTTAGTCTTCATCTTCATCATCATTGCTACCTGCATTCATATCCGGATTATTTGTATAAGATTATTGGAAATGCCTAGCTACACAAATCCTTAAAATAAAAATAGGAAAAAAGTGTAAAAAAATAAAAGAAAAAAAATATTGAATGTAACTCACCTAAAGTAATA


I have more than 100 IDs and different positions in X.txt to modify A.fa. Any help would be appreciated. Thank you very much.

8 months ago
nickp60 ▴ 40

I'd probably do something like the following, assuming you can convert the X.txt positions file you describe into a 1-feature-per-line bed file:

1) Index your sequences (taken from

samtools faidx A.fa


2) make a bed file of the original sequences using the index:

awk 'BEGIN {FS="\t"}; {print $1 FS "0" FS$2}' A.fa.fai > A.bed


3) remove the bad regions from the original bed file (note the -v)

bedtools intersect -a A.bed -b X.txt -v > A.goodregions.bed


4) pull out the good regions

bedtools getfasta  -fi A.fa -bed A.goodregions.bed > A.goodregions.fa

Hi, Thank you for your response. I created the bed file for X.txt and it looks as given below;

NODE_1138     1535     4521
NODE_11674     1119    2587
NODE_11674     3000    3043
NODE_120      60144   62167


When i run the step 3 from your answer, it excludes the entire node present in X.txt from A.bed and not just the regions (start - end) mentioned in the file.

Could you please let me know if there is a workaround for it ?

I figured out a way to do it. Instead of 'intersect', 'subtract' works fine for this problem.