Question: Parsing header of FASTA File
0
gravatar for Shelle
3 months ago by
Shelle0
Shelle0 wrote:

I have so many bacterial Refseq fasta files and want to parse the headers in the fasta files to see if is there any word 'chromosome' in the headers, as a side note there are some sequences in FASTA files started with '>' so i want to parse all the lines staring with '>' . I know i have files that do not have word like 'chromosome' . I would like to separate the files with header 'chromosome' from the rest of files. Is there a way to do so?

any help would be appreciated.

parse header sequence fasta • 246 views
ADD COMMENTlink modified 3 months ago by ATpoint11k • written 3 months ago by Shelle0

If you want to get faster/better/more accurate answers it would really help if you show some examples of your data, and how these have to be "parsed".

ADD REPLYlink written 3 months ago by WouterDeCoster35k

I’m not sure what the aim of filtering the genomes is by the word chromosome is exactly?

To my knowledge the work chromosome in the header doesn’t tell you anything about that assembly specifically.

ADD REPLYlink written 3 months ago by jrj.healey9.1k

Assuming that fasta is linearized (i.e sequence is in single line, after header):

sed -n '/>/p' test.fa | grep -vw chromosome | grep --no-group-separator -f - -A 1 test.fa

should give you all the fasta sequences with no chromosome in header.

sed -n '/>/p' test.fa | grep -w chromosome | grep --no-group-separator -f - -A 1 test.fa

should give you all the fasta sequences with chromosome in header.

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by cpad011210k
0
gravatar for ATpoint
3 months ago by
ATpoint11k
Germany
ATpoint11k wrote:

You can do it with this one-liner:

grep '>chromosome' *.fasta | awk -F ":" '{print $1}' | tee haveChr.txt | diff /dev/stdin <(ls *.fasta) | awk -F "> " '{print $2}' | awk NF > haveNOChr.txt

By the way, fasta headers must start with '>'. Do you have some that do not start with it?

ADD COMMENTlink modified 3 months ago • written 3 months ago by ATpoint11k

I tried this command and all the files go to haveNoChr.txt which is not correct as i have files with header (first line) as below: A few examples is as follows:

>NZ_LS483492.1 Serratia rubidaea strain NCTC10848 genome assembly, chromosome: 1
>NC_013791.2 Bacillus pseudofirmus OF4, complete genome  
>NZ_CP016324.1 Vibrio cholerae 2740-80 chromosome 1, complete sequence

I have gone through the whole file in second example and i didn't see any line starting with '>' which includes 'chromosome'. I am not sure why this one-liner doesn't separate at least this file in a haveNOchr.txt

ADD REPLYlink modified 3 months ago • written 3 months ago by Shelle0

Ok, I see. In this case, simply grep for 'chromosome' instead of '>chromosome'.:

grep 'chromosome' *.fasta | awk -F ":" '{print $1}' | tee haveChr.txt | diff /dev/stdin <(ls *.fasta) | awk -F "> " '{print $2}' | awk NF > haveNOChr.txt
ADD REPLYlink written 3 months ago by ATpoint11k

Did it work for you @Shelle ?

ADD REPLYlink written 12 weeks ago by ATpoint11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 536 users visited in the last hour