Question: processing FASTA file
0
gravatar for s_bio
3 months ago by
s_bio10
s_bio10 wrote:

Dear All,

I have a file with lots of sequences in FASTA format. All these sequences are around 7.2-7.5 kb long. However, I want to retain first 1000 nts and last 1200 nts and want to delete all the remaining middle nts. I would appreciate if anybody can guide me how its done.

Thx :-)

R genome • 380 views
ADD COMMENTlink modified 3 months ago • written 3 months ago by s_bio10
1

Why did you tag R? Others have given some viable options, but you didn't post anything that you had tried using R (i.e. code snippet), or that you need to use R. This would be easy with Biopython.

ADD REPLYlink written 3 months ago by st.ph.n1.8k
1

Example code to retain first 2 and last two bases of fasta sequences in a single file:

my fasta file:

>test
ACCTGATGT
>test2
TGATAGCTACTAGGGTGTCTATCG

code:

paste <(seqkit subseq -r  1:2 test1.fa | paste - -) <(seqkit subseq -r -2:-1 test1.fa | paste -  -  | cut -f 2) | awk '{print $1"\n"$2 $3}'

output:

>test
ACGT
>test2
TGCG

Download seqkit from here

ADD REPLYlink modified 11 weeks ago • written 12 weeks ago by cpad01122.3k

Another solution:

seqkit fx2tab testnt.fa |awk  '{print $1, "\t", substr($2,1,2), x = substr($2,length($2)-1,length ($2))}' OFS= | seqkit tab2fx
ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by cpad01122.3k

Dear Vinayjrao and Pierre,

Thank you so much guys for sharing the code lines. I was able to make the desired files with the help of your code lines.

Best!

s_bio

ADD REPLYlink written 3 months ago by s_bio10
1

If those answers have helped you then consider "upvoting" and "accepting" (use green check mark) to provide closure for this thread.

ADD REPLYlink written 3 months ago by genomax34k
1
gravatar for Pierre Lindenbaum
3 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum99k wrote:

linearize and extract the 5' and 3' part

 awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta  |\
awk -F '\t' '{L=length($2);if(L<2200) {printf("%s\n%s\n",$1,$2);} else {printf("%s\n%s%s\n",$1,substr($2,1,1000),substr($2,L-1200,1200));}}'
ADD COMMENTlink written 3 months ago by Pierre Lindenbaum99k
1
gravatar for vinayjrao
3 months ago by
vinayjrao70
vinayjrao70 wrote:

If I understand your question correctly, you could try grep -v ">" file.fa | head -c 1000 > 1000.txt tail -c 1200 file.fa > 1200.txt

ADD COMMENTlink written 3 months ago by vinayjrao70
1

I assume that header information is important:)

ADD REPLYlink written 3 months ago by grant.hovhannisyan260

Grant is right. file.fa | head -c 1000 > 1000.txt

tail -c 1200 file.fa > 1200.txt would work better

ADD REPLYlink modified 3 months ago • written 3 months ago by vinayjrao70
0
gravatar for shenwei356
10 weeks ago by
shenwei3563.4k
China
shenwei3563.4k wrote:

A simple solution using seqkit concate (concatenate sequences with same ID from multiple files) newly added in v0.7.0:

$ seqkit concate <(seqkit subseq -r 1:1000 seqs.fa) <(seqkit subseq -r -1200:-1 seqs.fa)
ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by shenwei3563.4k
1

@shenwei356: Consider renaming this option concat.

ADD REPLYlink written 10 weeks ago by genomax34k

I'll fix it. thank you dear @genomax !

ADD REPLYlink written 10 weeks ago by shenwei3563.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1077 users visited in the last hour