Remove uncharacterized chromosomes from reference fasta file
3
0
Entering edit mode
3.3 years ago

Hi,

I have a fasta file from a species that just had a new reference genome published. I downloaded a version of this new reference genome from UCSC but a piece of downstream software that I use is giving me issues because there are uncharacterized chromosomes in the fasta file. I'm not sure why this is all of the sudden a problem because the old reference genome had uncharacterized chromosomes in the fasta file as well. So I know removing the uncharacterized chromosomes solves my problem but my test was a bit labor-intensive and I was hoping to find a more efficient way of solving this problem in the future. My solution consisted of using sed to remove the header and the first line after the header. The reason for doing this was because I recognized that pattern. So for example, line 101 would be a header and then I removed line 102 because the next header would be at line 103. Here is an example of what some of the uncharacterized chromosome headers look like and some example of normal chromosome headers

>chrM
>chrUn_JAAHUQ010000408v1
>chrUn_JAAHUQ010000409v1
>chrUn_JAAHUQ010000410v1
>chrUn_JAAHUQ010000411v1
....
>chrUn_MU018702v1
>chrUn_MU018703v1
>chrUn_MU018704v1
>chrUn_MU018705v1
>chrUn_MU018706v1
....
>chrX
>chr1
>chr2
>chr3
>chr4
>chr5
>chr6
>chr7
>chr8
>chr9
>chr10

All the uncharacterized chromosomes have the same pattern of chrUn_JAAHUQ or chrUn_MU

What I was wondering was if there was some way to use grep or some other tool to remove these unwanted chromosomes an easier fashion than what I did

Assembly genome sequence next-gen • 4.2k views
ADD COMMENT
5
Entering edit mode
3.3 years ago
ATpoint 83k

samtools+grep:

samtools faidx your.fa
grep '>' your.fa | grep -v 'chrU' | sed 's/^>//' | xargs samtools faidx your.fa > filtered.fa
ADD COMMENT
3
Entering edit mode

That almost worked perfectly. The grep commands worked great but samtools gave me trouble. Samtools was giving me the following error

[W::fai_get_val] Reference >chr1 not found in FASTA file, returning empty sequence
[faidx] Failed to fetch sequence in >chr1

I looked at the output file and the only line in the file looked like

>>chr1

I just assumed there was an issue the greater than signed in front of the chromosomes so I tried running the following command and it ended up working fine

grep '>' my.fa | grep -v 'chrU' | sed 's|[>,]||g' | xargs samtools faidx my.fa > my.filtered.fa
ADD REPLY
1
Entering edit mode

Ah sorry, stupid me, you have to indeed remove the leading >. Edited it, but you already sorted it out, +1.

ADD REPLY
0
Entering edit mode

No way! You got me 99% of the way there. I only had to fix the easy part. Thank you for the help!

ADD REPLY
6
Entering edit mode
3.3 years ago

I'll add a seqkit answer to the mix too.

seqkit grep -vrp "^chrUn" file.fa > cleaned.fa
ADD COMMENT
0
Entering edit mode

I will give this a try. I've read about the seqkit but haven't used it yet. Seems like a great tool.

ADD REPLY
2
Entering edit mode
3.3 years ago
GenoMax 143k

faSomeRecords utility from Jim Kent (LINK). After downloading the file add execute permissions (chmod u+x faSomeRecords). Use as follows

usage:
   faSomeRecords in.fa listFile out.fa
options:
   -exclude - output sequences not in the list file.
ADD COMMENT

Login before adding your answer.

Traffic: 1975 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6