Question: extract single contig from fasta file based on name?
0
gravatar for goatsrunfaster
25 days ago by
goatsrunfaster20 wrote:

I would like to extract a single contig from a fasta file, and I have many fasta files and contigs I need to do this with. Note the fasta files have different names and the contigs have different names for each scenario. I know I can use seqtk with a list, but building a list for each assembly is a pain because there are so many, and I am only looking to pull one contig from each assembly. Does anyone know of an easy way to do this (without having to make a separate list of 1 contig for each assembly). I just want to name the single contig in the code. Any help is appreciated!

assembly • 138 views
ADD COMMENTlink modified 22 days ago by harishk020170 • written 25 days ago by goatsrunfaster20
1

Please provide input and output examples. Probably samtools faidx is the answer.

ADD REPLYlink written 25 days ago by ATpoint40k
1

See for some ideas: C: How do I extract Fasta Sequences based on a list of IDs?

ADD REPLYlink written 25 days ago by genomax91k
1

One way is to linearise all the contigs so they are contained within a single line (incase they are not) awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' | grep "\S" and then use grep -A1 with the contig name to grab the line with the name and then the contig that follows in the next line

ADD REPLYlink written 22 days ago by samuel.a.odonnell60

If I was able to extract a single contig based on its name with seqtk it would look like this:

seqtk subseq in.fq contig00001 > out.fq

but I cannot do that because it actually requires a list, and must look like this:

seqtk subseq in.fq name.lst > out.fq

Given that I have hundreds of fasta files all that need I a single contig extracted, making a list for each is a pain, so assuming subseq worked as presented in the original example I would want something like:

seqtk subseq in1.fq contig00001 > out.fq

seqtk subseq in2.fq contig00004 > out.fq

seqtk subseq in3.fq contig00008 > out.fq

etc.

Make sense?

ADD REPLYlink modified 24 days ago • written 24 days ago by goatsrunfaster20

No, makes no sense because you still do not provide any example data. We have no idea how your contig file looks like.

ADD REPLYlink written 24 days ago by ATpoint40k

Its not a file, its just the name of the contig. I am looking for way to do this with just the name of a contig rather than using a file, that is the whole point of this post.

ADD REPLYlink written 24 days ago by goatsrunfaster20
0
gravatar for harishk0201
22 days ago by
harishk020170
harishk020170 wrote:

The easiest way is to do the following, but ofcourse as ATpoint points out, we don't know how your contig headers look, so that may be an issue. The easiest way is however below:

printf "contigid\n" | seqtk subseq contigs.fasta - > contigid.fasta

ADD COMMENTlink modified 22 days ago • written 22 days ago by harishk020170
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 749 users visited in the last hour