Question: processing FASTA file
0
gravatar for s_bio
4 weeks ago by
s_bio10
s_bio10 wrote:

Dear All,

I have a file with lots of sequences in FASTA format. All these sequences are around 7.2-7.5 kb long. However, I want to retain first 1000 nts and last 1200 nts and want to delete all the remaining middle nts. I would appreciate if anybody can guide me how its done.

Thx :-)

R genome • 321 views
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by s_bio10
1

Why did you tag R? Others have given some viable options, but you didn't post anything that you had tried using R (i.e. code snippet), or that you need to use R. This would be easy with Biopython.

ADD REPLYlink written 4 weeks ago by st.ph.n1.2k
1

Example code to retain first 2 and last two bases of fasta sequences in a single file:

my fasta file:

>test
ACCTGATGT
>test2
TGATAGCTACTAGGGTGTCTATCG

code:

paste <(seqkit subseq -r  1:2 test1.fa | paste - -) <(seqkit subseq -r -2:-1 test1.fa | paste -  -  | cut -f 2) | awk '{print $1"\n"$2 $3}'

output:

>test
ACGT
>test2
TGCG

Download seqkit from here

ADD REPLYlink modified 20 days ago • written 20 days ago by cpad01121.7k

Another solution:

seqkit fx2tab testnt.fa |awk  '{print $1, "\t", substr($2,1,2), x = substr($2,length($2)-1,length ($2))}' OFS= | seqkit tab2fx
ADD REPLYlink modified 8 days ago • written 8 days ago by cpad01121.7k

Dear Vinayjrao and Pierre,

Thank you so much guys for sharing the code lines. I was able to make the desired files with the help of your code lines.

Best!

s_bio

ADD REPLYlink written 4 weeks ago by s_bio10
1

If those answers have helped you then consider "upvoting" and "accepting" (use green check mark) to provide closure for this thread.

ADD REPLYlink written 4 weeks ago by genomax32k
1
gravatar for Pierre Lindenbaum
4 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum96k wrote:

linearize and extract the 5' and 3' part

 awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta  |\
awk -F '\t' '{L=length($2);if(L<2200) {printf("%s\n%s\n",$1,$2);} else {printf("%s\n%s%s\n",$1,substr($2,1,1000),substr($2,L-1200,1200));}}'
ADD COMMENTlink written 4 weeks ago by Pierre Lindenbaum96k
1
gravatar for vinayjrao
4 weeks ago by
vinayjrao60
vinayjrao60 wrote:

If I understand your question correctly, you could try grep -v ">" file.fa | head -c 1000 > 1000.txt tail -c 1200 file.fa > 1200.txt

ADD COMMENTlink written 4 weeks ago by vinayjrao60
1

I assume that header information is important:)

ADD REPLYlink written 4 weeks ago by grant.hovhannisyan120

Grant is right. file.fa | head -c 1000 > 1000.txt

tail -c 1200 file.fa > 1200.txt would work better

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by vinayjrao60
0
gravatar for shenwei356
8 days ago by
shenwei3563.2k
China
shenwei3563.2k wrote:

A simple solution using seqkit concate (concatenate sequences with same ID from multiple files) newly added in v0.7.0:

$ seqkit concate <(seqkit subseq -r 1:1000 seqs.fa) <(seqkit subseq -r -1200:-1 seqs.fa)
ADD COMMENTlink modified 8 days ago • written 8 days ago by shenwei3563.2k
1

@shenwei356: Consider renaming this option concat.

ADD REPLYlink written 8 days ago by genomax32k

I'll fix it. thank you dear @genomax !

ADD REPLYlink written 7 days ago by shenwei3563.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1445 users visited in the last hour