bedtools getfasta duplicated fasta
1
0
Entering edit mode
23 months ago
talbots ▴ 30

Hi all,

I'm having a weird issue that seems unique, not sure why it is happening. I have a ref genome, and a related .gff file of annotated genes of that ref. I'm trying to pull all the genes from the reference, using the gff file. I've used bedtools getfasta in the past, with the command:

bedtools getfasta -fi ref.fa -bed annot.gff -fo output.fa . My current output has duplicates, such that every >header and the following nt sequence is duplicated twice, giving me a doubled fasta file.

I assume there is something wrong with my gff file, and it is in an incorrect format (gff2.5/gff3). Wondering if there is a quickfix I can do to fix the fasta file -- the uniq command doesn't seem to be helping here. Does anyone have any thoughts on this?

EDIT: Decided to use seqkit rmdup , but this doesn't fully explain how I got the duplicated file in the first place.

EDITEDIT: I figured it out. It seems to be extracting 2 fields from the gff, the CDS and the geneid. This is leading to two duplicates, with the same name/coords. Solution is to extract gff GeneIDs , then run the getfasta on that extracted geneID gff. Another possible solution is to acquire a proper bed file -- which I'm in the process of doing, it has been recommended to me to use agat, since according to this table: https://github.com/NBISweden/GAAS/blob/master/annotation/knowledge/gff_to_gtf.md , agat conserves the most data. If anyone has any thoughts on this feel free to add.

Thanks, Sam

fastaFromBed duplicates getfasta • 748 views
ADD COMMENT
1
Entering edit mode
23 months ago
Juke34 8.5k

You can also try to extract the fasta sequence with AGAT see here for details: https://agat.readthedocs.io/en/latest/tools/agat_sp_extract_sequences.html

ADD COMMENT

Login before adding your answer.

Traffic: 2384 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6