Question

NNNNNN in genomic file

0

Entering edit mode

2.9 years ago

greyman ▴ 190

I downloaded a genome file and a gff file from NCBI ftp site. While extracting the promoter sequence from the upstream of gene of interest, there are a few sequence that basically just Ns. The windows I set was 1000bp and when I tried a bigger range such as 3k bp using bedops, the same sequence is still there. Is this common ? Will it affect downstream analysis using MEME, TOMTOM and enrichment? Should I keep them or remove them, if so, may I have some suggestion? Thank you very much.

bedtools promoter TFBS bedops • 929 views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 2.9 years ago by greyman ▴ 190

1

Entering edit mode

For eukaryotes with complex genomes, these gaps are common. These Ns are filled in by the assembler / scaffolder / contig orderer (some other software, e.g., for optical map integration) when the order and orientation of contigs and scaffolds can be inferred, but there is an undetermined region between them. You will have to read the genome metadata - and possibly check the genome agp file - to know if this is the case.

ADD REPLY • link 2.9 years ago by h.mon 35k

score 1 · Accepted Answer · 2021-05-24

MEME will treat N as any base (see their DNA alphabet table). If you can avoid including these in your input regions, you should get cleaner search results. Starting with good MEME matrices will be useful, if you do TOMTOM searches for matches with published TFs, and further enrichment calculations off of that.

You might look at filtering any candidate regions that overlap with blacklisted regions: e.g., https://personal.broadinstitute.org/anshul/projects/encode/rawdata/blacklists/hg19-blacklist-README.pdf and Where to download blacklisted regions?

$ bedops -n 1 regions.bed <(sort-bed blacklist.bed) > answer.bed

Then you could generate your sequence from filtered regions in answer.bed.

Another potential source of trouble could be repeats. You might look at RepeatMasker datasets and rmsk2bed to do a filtering step on those.