Extraction of first sequences from a big fasta file
6
4
Entering edit mode
7.6 years ago
vahapel ▴ 210

Dear All,

I would like to ask a question regarding extraction of 100000 sequences in a big fasta file. In the forum, there is a bunch of script handling the sequence extraction based on ID number, but I could not find a script for such a purpose. Basically, is there any script or bash command for extraction first and/or last 100000 sequences in a fasta file?

sequence • 13k views
0
Entering edit mode

If not, you could write one easily enough with biopython or bioperl...

Well, the first X records is easier than the last X records, but still.

0
Entering edit mode
0
Entering edit mode

That is not an answer for the question that was originally asked.

6
Entering edit mode
7.6 years ago

This strategy is based on standard nix tools. Get the *first two sequences:

awk -v RS='>' 'NR>1 { gsub("\n", ";", $0); sub(";$", "", $0); print ">"$0 }' seq.fa \
| tr ',' '\n'


It assumes the semicolon ; doesn't occur in the sequence names.

Replace head with tail to get the last sequences.

0
Entering edit mode

Hi, dariober thank you so much for this script. It works well

3
Entering edit mode
7.6 years ago

For the first 100k sequences:

reformat.sh in=data.fasta out=100k.fasta reads=100000


You can also get a random 100k sequences with Reformat, but not the last 100k.

0
Entering edit mode

Hi Brian, thank you for your help and introducing "bbmap" tools for me, it makes my project more easy

0
Entering edit mode

You're welcome!

2
Entering edit mode
7.6 years ago
5heikki 10k

Assuming no linebreaks in sequences, i.e every record is exactly two lines. I believe for fastq files the multiplier would be 4:

head -n 200000 input > output

0
Entering edit mode

thank you, 5heikki for your help, it is very simple and useful command.

0
Entering edit mode

likewise, use tail for the last N sequences.

2
Entering edit mode
3.3 years ago
AK ★ 2.1k

Alternatives: seqkit head and seqkit range:

seqkit head -n 100000 input.fa
seqkit range -r 1:100000 input.fa


(2) Last 100000 records:

seqkit range -r -100000:-1 input.fa


(3) Other ranges:

seqkit range -r 100001:200000 input.fa

0
Entering edit mode
4.9 years ago
REQUESTED_LINES=10
awk "/^>/ {n++} n>\$REQUESTED_LINES {exit} {print}" input.fasta > output.fasta

0
Entering edit mode
3.3 years ago
psschlogl ▴ 50
hdl = gzip.open(file, 'rt')
records = SeqIO.parse(hdl, 'fastq')

1
Entering edit mode

Though your answer shows some potential it does not really answer the question posted. If you can provide a more complete/correct answer we could keep it in the answers thread. If not, we consider removing it.