Question

processing FASTA file

0

Entering edit mode

6.8 years ago

s_bio ▴ 10

Dear All,

I have a file with lots of sequences in FASTA format. All these sequences are around 7.2-7.5 kb long. However, I want to retain first 1000 nts and last 1200 nts and want to delete all the remaining middle nts. I would appreciate if anybody can guide me how its done.

Thx :-)

genome R • 2.7k views

ADD COMMENT • link updated 5.8 years ago by amartinez.ull • 0 • written 6.8 years ago by s_bio ▴ 10

1

Entering edit mode

Why did you tag R? Others have given some viable options, but you didn't post anything that you had tried using R (i.e. code snippet), or that you need to use R. This would be easy with Biopython.

ADD REPLY • link 6.8 years ago by st.ph.n ★ 2.7k

1

Entering edit mode

Example code to retain first 2 and last two bases of fasta sequences in a single file:

my fasta file:

>test
ACCTGATGT
>test2
TGATAGCTACTAGGGTGTCTATCG

code:

paste <(seqkit subseq -r  1:2 test1.fa | paste - -) <(seqkit subseq -r -2:-1 test1.fa | paste -  -  | cut -f 2) | awk '{print $1"\n"$2 $3}'

output:

>test
ACGT
>test2
TGCG

Download seqkit from here

ADD REPLY • link 6.7 years ago by cpad0112 21k

0

Entering edit mode

Another solution:

seqkit fx2tab testnt.fa |awk  '{print $1, "\t", substr($2,1,2), x = substr($2,length($2)-1,length ($2))}' OFS= | seqkit tab2fx

ADD REPLY • link 6.7 years ago by cpad0112 21k

0

Entering edit mode

Dear Vinayjrao and Pierre,

Thank you so much guys for sharing the code lines. I was able to make the desired files with the help of your code lines.

Best!

s_bio

ADD REPLY • link 6.8 years ago by s_bio ▴ 10

1

Entering edit mode

If those answers have helped you then consider "upvoting" and "accepting" (use green check mark) to provide closure for this thread.

ADD REPLY • link 6.8 years ago by GenoMax 141k

0

Entering edit mode

Thanks for the useful answer. I have an extra problem since, since some sequences are missing on my files. Departing from:

File1.fas
>seq1    acacaca
>seq2    acacacg
>seq3    acacact
File 2.fas
>seq1    GCGCGCG
>seq3    GCGCGCT

Is it possible to get the following using seqkit concat

File3.fas
>seq1    acacacaGCGCGC
>seq2    acacacgNNNNNN
>seq3    acacactGCGCGCT

I am using: $seqkit concat File*.fas -o File3.fas, and I get file 3 with different lengths according to the missing data. (of course i have more than two fasta files, and there are missing sequences on all of them)

Thanks!!!

ADD REPLY • link 5.8 years ago by amartinez.ull • 0

0

Entering edit mode

source of Ns in seq2? padding ? @OP

ADD REPLY • link 5.8 years ago by cpad0112 21k

0

Entering edit mode

I forgot to specify that in my study each file correspond to a different pcr amplify genetic marker, and that seq1-3 correspond to different species. The source of N's in seq2 is the missing marker included in File 2 (e.g. we couldn't amplify it or it is missing in the reference database).

ADD REPLY • link 5.8 years ago by amartinez.ull • 0

0

Entering edit mode

amartinez.ull : Please use ADD REPLY/ADD COMMENT when responding to existing posts to keep threads logically organized. This comment/question should have gone under @shenwei's answer.

ADD REPLY • link 5.8 years ago by GenoMax 141k

0

Entering edit mode

I will do next time. It is my first post here, and I am still not very familiar with the rule. I apologize.

ADD REPLY • link 5.8 years ago by amartinez.ull • 0

0

Entering edit mode

@ amartinez.ull : If length of the sequence is fixed, you can follow this post What is the fastest way to add 'Ns' to variable length sequences in a .fasta such that they have the same length. See the post by Petr Ponomarenko and upvote the OP.

ADD REPLY • link 5.8 years ago by cpad0112 21k

0

Entering edit mode

Thanks! Seems like a potential solution, but as far as I see, this still requieres that I manually include the name of the missing markers in "File 2" (following the names in my original example) and use that function. It must be a quicker way to do it...

ADD REPLY • link 5.8 years ago by amartinez.ull • 0

0

Entering edit mode

You don't have to fill in. Concat the files using seqkit and then use this script. Input would be stdin. Concern i have is if the lengths are not fixed, then take the maximum length of the sequence, store that in a variable, use it to pad the sequence, for each id

$ seqkit concat a.fa b.fa --quiet | awk '$1~">"{print $0}$1!~">"{tmp="";for(i=1;i<15-length($0);i++){tmp=tmp"N"};print $0""tmp}' 

>seq1
acacacaGCGCGCG
>seq2
acacacgNNNNNNN
>seq3
acacactGCGCGCT

In this example (posted in oP), length of each sequence is fixed 15bp. Then it is easier to pad. If the padding is done as per length of largest sequence, then you need to find largest sequence, it's length, store it in a variable, then apply above code.

ADD REPLY • link 5.8 years ago by cpad0112 21k

1

Entering edit mode

6.8 years ago

vinayjrao ▴ 250

If I understand your question correctly, you could try grep -v ">" file.fa | head -c 1000 > 1000.txt tail -c 1200 file.fa > 1200.txt

ADD COMMENT • link 6.8 years ago by vinayjrao ▴ 250

1

Entering edit mode

I assume that header information is important:)

ADD REPLY • link 6.8 years ago by grant.hovhannisyan ★ 2.6k

0

Entering edit mode

Grant is right. file.fa | head -c 1000 > 1000.txt

tail -c 1200 file.fa > 1200.txt would work better

ADD REPLY • link 6.7 years ago by vinayjrao ▴ 250

0

Entering edit mode

6.7 years ago

shenwei356 8.4k

A simple solution using seqkit concate (concatenate sequences with same ID from multiple files) newly added in v0.7.0:

$ seqkit concate <(seqkit subseq -r 1:1000 seqs.fa) <(seqkit subseq -r -1200:-1 seqs.fa)

ADD COMMENT • link 6.7 years ago by shenwei356 8.4k

1

Entering edit mode

@shenwei356: Consider renaming this option concat.

ADD REPLY • link 6.7 years ago by GenoMax 141k

0

Entering edit mode

I'll fix it. thank you dear @genomax !

ADD REPLY • link 6.7 years ago by shenwei356 8.4k

score 1 · Accepted Answer · 2017-07-21

1

Entering edit mode

6.8 years ago

Pierre Lindenbaum 161k

linearize and extract the 5' and 3' part

 awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta  |\
awk -F '\t' '{L=length($2);if(L<2200) {printf("%s\n%s\n",$1,$2);} else {printf("%s\n%s%s\n",$1,substr($2,1,1000),substr($2,L-1200,1200));}}'

ADD COMMENT • link 6.8 years ago by Pierre Lindenbaum 161k