Parsing header of FASTA File
1
0
Entering edit mode
5.7 years ago
Shelle ▴ 30

I have so many bacterial Refseq fasta files and want to parse the headers in the fasta files to see if is there any word 'chromosome' in the headers, as a side note there are some sequences in FASTA files started with '>' so i want to parse all the lines staring with '>' . I know i have files that do not have word like 'chromosome' . I would like to separate the files with header 'chromosome' from the rest of files. Is there a way to do so?

any help would be appreciated.

FASTA sequence header Parse • 1.9k views
ADD COMMENT
0
Entering edit mode

If you want to get faster/better/more accurate answers it would really help if you show some examples of your data, and how these have to be "parsed".

ADD REPLY
0
Entering edit mode

I’m not sure what the aim of filtering the genomes is by the word chromosome is exactly?

To my knowledge the work chromosome in the header doesn’t tell you anything about that assembly specifically.

ADD REPLY
0
Entering edit mode

Assuming that fasta is linearized (i.e sequence is in single line, after header):

sed -n '/>/p' test.fa | grep -vw chromosome | grep --no-group-separator -f - -A 1 test.fa

should give you all the fasta sequences with no chromosome in header.

sed -n '/>/p' test.fa | grep -w chromosome | grep --no-group-separator -f - -A 1 test.fa

should give you all the fasta sequences with chromosome in header.

ADD REPLY
0
Entering edit mode
5.7 years ago
ATpoint 82k

You can do it with this one-liner:

grep '>chromosome' *.fasta | awk -F ":" '{print $1}' | tee haveChr.txt | diff /dev/stdin <(ls *.fasta) | awk -F "> " '{print $2}' | awk NF > haveNOChr.txt

By the way, fasta headers must start with '>'. Do you have some that do not start with it?

ADD COMMENT
0
Entering edit mode

I tried this command and all the files go to haveNoChr.txt which is not correct as i have files with header (first line) as below: A few examples is as follows:

>NZ_LS483492.1 Serratia rubidaea strain NCTC10848 genome assembly, chromosome: 1
>NC_013791.2 Bacillus pseudofirmus OF4, complete genome  
>NZ_CP016324.1 Vibrio cholerae 2740-80 chromosome 1, complete sequence

I have gone through the whole file in second example and i didn't see any line starting with '>' which includes 'chromosome'. I am not sure why this one-liner doesn't separate at least this file in a haveNOchr.txt

ADD REPLY
0
Entering edit mode

Ok, I see. In this case, simply grep for 'chromosome' instead of '>chromosome'.:

grep 'chromosome' *.fasta | awk -F ":" '{print $1}' | tee haveChr.txt | diff /dev/stdin <(ls *.fasta) | awk -F "> " '{print $2}' | awk NF > haveNOChr.txt
ADD REPLY
0
Entering edit mode

Did it work for you @Shelle ?

ADD REPLY

Login before adding your answer.

Traffic: 1624 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6