Question

Remove uncharacterized chromosomes from reference fasta file

0

Entering edit mode

3.1 years ago

williamsbrian5064 ▴ 510

Hi,

I have a fasta file from a species that just had a new reference genome published. I downloaded a version of this new reference genome from UCSC but a piece of downstream software that I use is giving me issues because there are uncharacterized chromosomes in the fasta file. I'm not sure why this is all of the sudden a problem because the old reference genome had uncharacterized chromosomes in the fasta file as well. So I know removing the uncharacterized chromosomes solves my problem but my test was a bit labor-intensive and I was hoping to find a more efficient way of solving this problem in the future. My solution consisted of using sed to remove the header and the first line after the header. The reason for doing this was because I recognized that pattern. So for example, line 101 would be a header and then I removed line 102 because the next header would be at line 103. Here is an example of what some of the uncharacterized chromosome headers look like and some example of normal chromosome headers

>chrM
>chrUn_JAAHUQ010000408v1
>chrUn_JAAHUQ010000409v1
>chrUn_JAAHUQ010000410v1
>chrUn_JAAHUQ010000411v1
....
>chrUn_MU018702v1
>chrUn_MU018703v1
>chrUn_MU018704v1
>chrUn_MU018705v1
>chrUn_MU018706v1
....
>chrX
>chr1
>chr2
>chr3
>chr4
>chr5
>chr6
>chr7
>chr8
>chr9
>chr10

All the uncharacterized chromosomes have the same pattern of chrUn_JAAHUQ or chrUn_MU

What I was wondering was if there was some way to use grep or some other tool to remove these unwanted chromosomes an easier fashion than what I did

Assembly genome sequence next-gen • 4.0k views

ADD COMMENT • link updated 3.1 years ago by rpolicastro 13k • written 3.1 years ago by williamsbrian5064 ▴ 510

6

Entering edit mode

3.1 years ago

rpolicastro 13k

I'll add a seqkit answer to the mix too.

seqkit grep -vrp "^chrUn" file.fa > cleaned.fa

ADD COMMENT • link 3.1 years ago by rpolicastro 13k

0

Entering edit mode

I will give this a try. I've read about the seqkit but haven't used it yet. Seems like a great tool.

ADD REPLY • link 3.1 years ago by williamsbrian5064 ▴ 510

2

Entering edit mode

3.1 years ago

GenoMax 141k

faSomeRecords utility from Jim Kent (LINK). After downloading the file add execute permissions (chmod u+x faSomeRecords). Use as follows

usage:
   faSomeRecords in.fa listFile out.fa
options:
   -exclude - output sequences not in the list file.

ADD COMMENT • link 3.1 years ago by GenoMax 141k

score 5 · Accepted Answer · 2021-03-09

5

Entering edit mode

3.1 years ago

ATpoint 81k

samtools+grep:

samtools faidx your.fa
grep '>' your.fa | grep -v 'chrU' | sed 's/^>//' | xargs samtools faidx your.fa > filtered.fa

ADD COMMENT • link 3.1 years ago by ATpoint 81k

3

Entering edit mode

That almost worked perfectly. The grep commands worked great but samtools gave me trouble. Samtools was giving me the following error

[W::fai_get_val] Reference >chr1 not found in FASTA file, returning empty sequence
[faidx] Failed to fetch sequence in >chr1

I looked at the output file and the only line in the file looked like

>>chr1

I just assumed there was an issue the greater than signed in front of the chromosomes so I tried running the following command and it ended up working fine

grep '>' my.fa | grep -v 'chrU' | sed 's|[>,]||g' | xargs samtools faidx my.fa > my.filtered.fa

ADD REPLY • link 3.1 years ago by williamsbrian5064 ▴ 510

1

Entering edit mode

Ah sorry, stupid me, you have to indeed remove the leading >. Edited it, but you already sorted it out, +1.

ADD REPLY • link 3.1 years ago by ATpoint 81k

0

Entering edit mode

No way! You got me 99% of the way there. I only had to fix the easy part. Thank you for the help!

ADD REPLY • link 3.1 years ago by williamsbrian5064 ▴ 510