Question: Can't load reference sequence from file 'GRCh37.fa': Unexpected character 'M' found.
gravatar for Kai_Qi
8 days ago by
Chicago, IL
Kai_Qi100 wrote:

Hi all:

I am trying to do an analysis using GRCh37.fa as reference genome. After running command

pureclip -i aligned.f.duplRm.pooled.R2.bam -bai aligned.f.duplRm.pooled.R2.bam.bai -g GRCh37.fa -iv "1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;21;22;X;Y;" -nt 10 -o PureCLIP.crosslink_sites.bed

I received an error:

ERROR: Can't load reference sequence from file 'GRCh37.fa': Unexpected character 'M' found.

I got advice from the developer as:

The problem is coming from an external library which is used and which expects the reference sequence to contain only the letters 'A', 'C', 'G', 'T' or 'N'. I know it is not ideal, but if you convert all non-ACGTs to Ns, the problem should be solved

Does anyone can teach me how to convert all non-ACGT to Ns so that I will be able to give it a try?


sequencing rna-seq next-gen • 141 views
ADD COMMENTlink written 8 days ago by Kai_Qi100

It is indeed strange that your reference contains the letter M. As a first step I would double check that the reference contains nucleotides and not amino acids. Once you are sure that this is the case you can use pyfasta of some other tool depending in which programming language you are proficient.

ADD REPLYlink modified 7 days ago • written 7 days ago by Fabio Marroni2.6k

It is a reference genome. Thank you for your comment. I used a differential reference genome and it generated bed file. Nevertheless, I could not see peaks when load the bed to IGV. I guess I need to ask around.


ADD REPLYlink written 7 days ago by Kai_Qi100
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1289 users visited in the last hour