Question: processing FASTA file
0
gravatar for s_bio
6 months ago by
s_bio10
s_bio10 wrote:

Dear All,

I have a file with lots of sequences in FASTA format. All these sequences are around 7.2-7.5 kb long. However, I want to retain first 1000 nts and last 1200 nts and want to delete all the remaining middle nts. I would appreciate if anybody can guide me how its done.

Thx :-)

R genome • 446 views
ADD COMMENTlink modified 6 months ago • written 6 months ago by s_bio10
1

Why did you tag R? Others have given some viable options, but you didn't post anything that you had tried using R (i.e. code snippet), or that you need to use R. This would be easy with Biopython.

ADD REPLYlink written 6 months ago by st.ph.n2.0k
1

Example code to retain first 2 and last two bases of fasta sequences in a single file:

my fasta file:

>test
ACCTGATGT
>test2
TGATAGCTACTAGGGTGTCTATCG

code:

paste <(seqkit subseq -r  1:2 test1.fa | paste - -) <(seqkit subseq -r -2:-1 test1.fa | paste -  -  | cut -f 2) | awk '{print $1"\n"$2 $3}'

output:

>test
ACGT
>test2
TGCG

Download seqkit from here

ADD REPLYlink modified 5 months ago • written 5 months ago by cpad01124.1k

Another solution:

seqkit fx2tab testnt.fa |awk  '{print $1, "\t", substr($2,1,2), x = substr($2,length($2)-1,length ($2))}' OFS= | seqkit tab2fx
ADD REPLYlink modified 5 months ago • written 5 months ago by cpad01124.1k

Dear Vinayjrao and Pierre,

Thank you so much guys for sharing the code lines. I was able to make the desired files with the help of your code lines.

Best!

s_bio

ADD REPLYlink written 6 months ago by s_bio10
1

If those answers have helped you then consider "upvoting" and "accepting" (use green check mark) to provide closure for this thread.

ADD REPLYlink written 6 months ago by genomax40k
1
gravatar for Pierre Lindenbaum
6 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum103k wrote:

linearize and extract the 5' and 3' part

 awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta  |\
awk -F '\t' '{L=length($2);if(L<2200) {printf("%s\n%s\n",$1,$2);} else {printf("%s\n%s%s\n",$1,substr($2,1,1000),substr($2,L-1200,1200));}}'
ADD COMMENTlink written 6 months ago by Pierre Lindenbaum103k
1
gravatar for vinayjrao
6 months ago by
vinayjrao80
vinayjrao80 wrote:

If I understand your question correctly, you could try grep -v ">" file.fa | head -c 1000 > 1000.txt tail -c 1200 file.fa > 1200.txt

ADD COMMENTlink written 6 months ago by vinayjrao80
1

I assume that header information is important:)

ADD REPLYlink written 6 months ago by grant.hovhannisyan340

Grant is right. file.fa | head -c 1000 > 1000.txt

tail -c 1200 file.fa > 1200.txt would work better

ADD REPLYlink modified 5 months ago • written 5 months ago by vinayjrao80
0
gravatar for shenwei356
5 months ago by
shenwei3563.4k
China
shenwei3563.4k wrote:

A simple solution using seqkit concate (concatenate sequences with same ID from multiple files) newly added in v0.7.0:

$ seqkit concate <(seqkit subseq -r 1:1000 seqs.fa) <(seqkit subseq -r -1200:-1 seqs.fa)
ADD COMMENTlink modified 5 months ago • written 5 months ago by shenwei3563.4k
1

@shenwei356: Consider renaming this option concat.

ADD REPLYlink written 5 months ago by genomax40k

I'll fix it. thank you dear @genomax !

ADD REPLYlink written 5 months ago by shenwei3563.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1167 users visited in the last hour