NNNNNN in genomic file
1
0
Entering edit mode
16 months ago
greyman ▴ 170

I downloaded a genome file and a gff file from NCBI ftp site. While extracting the promoter sequence from the upstream of gene of interest, there are a few sequence that basically just Ns. The windows I set was 1000bp and when I tried a bigger range such as 3k bp using bedops, the same sequence is still there. Is this common ? Will it affect downstream analysis using MEME, TOMTOM and enrichment? Should I keep them or remove them, if so, may I have some suggestion? Thank you very much.

TFBS promoter bedtool bedops • 525 views
1
Entering edit mode

For eukaryotes with complex genomes, these gaps are common. These Ns are filled in by the assembler / scaffolder / contig orderer (some other software, e.g., for optical map integration) when the order and orientation of contigs and scaffolds can be inferred, but there is an undetermined region between them. You will have to read the genome metadata - and possibly check the genome agp file - to know if this is the case.

1
Entering edit mode
16 months ago

MEME will treat N as any base (see their DNA alphabet table). If you can avoid including these in your input regions, you should get cleaner search results. Starting with good MEME matrices will be useful, if you do TOMTOM searches for matches with published TFs, and further enrichment calculations off of that.

\$ bedops -n 1 regions.bed <(sort-bed blacklist.bed) > answer.bed


Then you could generate your sequence from filtered regions in answer.bed.

Another potential source of trouble could be repeats. You might look at RepeatMasker datasets and rmsk2bed to do a filtering step on those.