Filter contig fasta file by GC content
1
0
Entering edit mode
3.0 years ago

Hello!

I have a de novo built contig file (contigs.fasta), assembly is done in SPAdes. From this file, I need to extract only the node sequences which have GC content below a certain % (e.g. extract all the node sequences that have GC content < 35%).

Do you have any suggestions of how I can do this? I am currently using seqkit to show me the GC content % of each node:

 seqkit fx2tab --name --only-id --gc contigs.fasta > results.txt

The problem is, this way I can only see GC% of each node, and cannot do any "extraction" of the actual node sequences I need.

Thank you very much in advance!

contig gc • 1.2k views
ADD COMMENT
2
Entering edit mode
3.0 years ago
GenoMax 141k

Try the following.

seqkit fx2tab --name --only-id --gc contigs.fa | awk -F "\t" '{if ($2 < 35) print $1}' | xargs -n 1 sh -c 'seqkit grep --pattern "$0" contigs.fa' > results.fa
ADD COMMENT

Login before adding your answer.

Traffic: 2633 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6