Question: Subseq a FASTA file
0
gravatar for Explorer
6 weeks ago by
Explorer10
Canada
Explorer10 wrote:

I am trying to subsetting a FASTA file at a specific nucleotide positions. For example

 >random sequence 1
    tatgtgcgag
    >random sequence 2
    agggtgttat
    >random sequence 3
    tatgtgcgag
    >random sequence 4
    gactcgcggt
    >random sequence 5
    tatgtgcgag
    >random sequence 6
    gcagccatcg
    >random sequence 7
    gactcgcggt
    >random sequence 8
    tatgtgcgag
    >random sequence 9
    tatgtgcgag
    >random sequence 10
    tatgtgcgag

I am able to cut the sequence from position 3 to 6 but ID is missing. I want to same IDs as the original file. Can anyone help to modify my code, please? Thanks

cat random.fasta |sed -n 2~2p |cut -c3-6 >out.fasta

    tgtg
    ggtg
    tgtg
    ctcg
    tgtg
    agcc
    ctcg
    tgtg
    tgtg
    tgtg
awk sed • 133 views
ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by Explorer10
1

Extraction of nt bases from sequence

ADD REPLYlink written 6 weeks ago by Pierre Lindenbaum134k

Thanks for sharing the link. I tried the command for multiline FASTA and it partially worked. I am extending the thread there.

ADD REPLYlink written 6 weeks ago by Explorer10

or you sure you are using a multiline fasta?

With 'multiline' we mean that the sequence is block-formatted and is thus present on several lines under 1 header. It does not refer to the fact you have several entries in a single fasta file.

ADD REPLYlink written 6 weeks ago by lieven.sterck10.0k

I am not sure about that. When I open it in a text editor, there are blocks of 60nt each while in SnapGene, ApE it is as per software settings. I got confused because the code suggested for multiline sort-of worked.

In the original thread shared by Pierre Lindenbaum, the following code was suggested.

For single line FASTA file

awk '{if ($1 ~ />/) {print}else{print substr ($0, 0, 12)}}' file.fa

For multiline FASTA file

awk '/^>/{print;getline;print substr ($0, 0, 12)}' file.fa

When I tried the first code, there was no subsetting, while for 2nd code it worked up to 1000 position.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Explorer10
0
gravatar for lieven.sterck
6 weeks ago by
lieven.sterck10.0k
VIB, Ghent, Belgium
lieven.sterck10.0k wrote:

if it does not have to be in plain bash scripting you might be better of using specific tools to achieve this.

For instance use SeqKit, more precise the subseq command from it. Have a look here https://bioinf.shenwei.me/seqkit/usage/#subseq on how to use is.

ADD COMMENTlink written 6 weeks ago by lieven.sterck10.0k

It seems SeqKit is more memory efficient. I will try it if bash does not work.

ADD REPLYlink written 6 weeks ago by Explorer10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1004 users visited in the last hour
_