Question: Position of N nucleotides in reference genome
0
gravatar for QVINTVS_FABIVS_MAXIMVS
3.4 years ago by
USA SoCal
QVINTVS_FABIVS_MAXIMVS2.2k wrote:

Title says it all.

Is there a resource that has the positions for every "N" nucleotide in the reference genome (hg38)?

If not, is there a method to quickly generate these positiosn.

I was thinking using bedtools nuc with a bed file that windows the genome, but that seems too much work.

Thanks!

 

 

reference genome nucleotide • 1.2k views
ADD COMMENTlink modified 3.4 years ago by Alex Reynolds28k • written 3.4 years ago by QVINTVS_FABIVS_MAXIMVS2.2k
4
gravatar for Alex Reynolds
3.4 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

On Linux, you could do the following to get sequences, extract them to fasta, and write output to BED format:

$ wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit
$ wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/twoBitToFa
$ for i in `seq 1 22` X Y; do \
    ./twoBitToFa -seq=chr$i hg38.2bit hg38.2bit.chr$i.fa; \
  done
$ for i in `seq 1 22` X Y; do \
    awk -vchr="chr$i" ' \
        BEGIN { \
            p = 0; \
        } \
        { \
            if ($0 !~ /^>/) { \
                n = split($0, chars, ""); \
                for (i = 1; i <= n; i++) { \
                    if (chars[i] == "N") { \
                        printf("%s\t%d\t%d\n", chr, p, p + 1); \
                    } \
                    p++; \
                } \
            } \
        }' hg38.2bit.chr$i.fa \
        > hg38.2bit.chr$i.N.bed; \
  done

The work in the for loops can be parallelized, if you need a speed-up, by splitting out work to per-chromosome tasks.

If you need contiguous regions of Ns in the output BED, you can use BEDOPS bedops --merge:

$ for i in `seq 1 22` X Y; do \
    bedops --merge hg38.2bit.chr$i.N.bed > hg38.2bit.chr$i.N.contiguous.bed; \
  done

Be prepared for large BED files, unless you do a bedops --merge at the end.

Since these are three-column BED files, using the Starch format to archive BED files would be pretty useful here — there's a lot of redundancy to be removed. You could pipe the output of awk to starch to make an archive:

$ for i in `seq 1 22` X Y; do \
    awk -vchr="chr$i" ' \
        BEGIN { \
            p = 0; \
        } \
        { \
            if ($0 !~ /^>/) { \
                n = split($0, chars, ""); \
                for (i = 1; i <= n; i++) { \
                    if (chars[i] == "N") { \
                        printf("%s\t%d\t%d\n", chr, p, p + 1); \
                    } \
                    p++; \
                } \
            } \
        }' hg38.2bit.chr$i.fa \
        | starch - \
        > hg38.2bit.chr$i.N.starch; \
  done
ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Alex Reynolds28k

I would vote you up and say it's the solution but I keep getting a "Unable to submit vote" error

ADD REPLYlink written 3.4 years ago by QVINTVS_FABIVS_MAXIMVS2.2k

Don't worry, I get that a lot.

ADD REPLYlink written 3.4 years ago by Alex Reynolds28k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 795 users visited in the last hour