Question

Extract the region and mutant SNP allele

0

Entering edit mode

6.6 years ago

DanielC ▴ 170

Dear All,

I have data related to the SNPs like this:

1) The chromosome region for the 5' and 3' UTRs obtained from Ensembl. For example, in this example below the 5' and 3' UTRs are from chromosome 17 and their respective regions are also shown.

5'-UTR: 17:43124097-43124115

3'-UTR: 17:43044295-43045677

2) The rsIDs of the SNPs within the 5' and 3' UTRs regions mentioned above.

    17  43124097    43124097    rs587781565 five_prime_UTR
    17  43124098    43124098    rs273899693 five_prime_UTR
    17  43124099    43124099    rs273900720 five_prime_UTR
    17  43124106    43124106    rs748057929 five_prime_UTR
    17  43124107    43124107    rs273897654 five_prime_UTR
    17  43044494    43044494    rs931310802 three_prime_UTR
    17  43044518    43044517    rs34214126  three_prime_UTR
    17  43044518    43044518    rs1010615361    three_prime_UTR
    17  43044539    43044539    CR1212932   three_prime_UTR
    17  43044539    43044539    rs746187092 three_prime_UTR
    17  43044560    43044560    rs1021990558    three_prime_UTR
    17  43044561    43044561    rs571007748 three_prime_UTR

Can someone please tell me;

a) How to extract the sequence between the regions in 1) like this:

>NM_007299.3 | 43124097-43124115 | 5' UTR
cttagcggtagccccttggtttccgtggcaacggaaaagcgcgggaattacagataaatt
aaaactgcgactgcgcggcgtgagctcgctgagacttcctggacgggggacaggctgtgg
ggtttctcagataactgggcccctgcgctcaggaggccttcaccctctgctctggttcat
tggaacagaaagaa

>NM_007299.3 | 43044295-43045677 | 3' UTR
    ggcacctgtggtgacccgagagtgggtgttggacagtgtagcactctaccagtgccagga
    gctggacacctacctgataccccagatcccccacagccactactgactgcagccagccac
    aggtacagagccacaggaccccaagaatgagcttacaaagtggcctttccaggccctggg
    agctcctctcactcttcagtccttctactgtcctggctactaaatattttatgtacatca
    gcctgaaaaggacttctggctatgcaagggtcccttaaagattttctgcttgaagtctcc

b) Get the information about the mutant alleles corresponding to a respective rsID like this:

   17   43124097    43124097    rs587781565 five_prime_UTR         C(reference allele)       T(mutant alele)
   17   43124098    43124098    rs273899693 five_prime_UTR         A(reference allele)       G(mutant alele)
   17   43044560    43044560    rs1021990558    three_prime_UTR       C(reference allele)       G(mutant alele)
   17   43044561    43044561    rs571007748 three_prime_UTR       T(reference allele)       G(mutant alele)

Thanks much!

SNP • 1.8k views

ADD COMMENT • link updated 6.4 years ago by Alex Reynolds 35k • written 6.6 years ago by DanielC ▴ 170

1

Entering edit mode

Part of answer to question 1 is bedtools getfasta.

ADD REPLY • link 6.6 years ago by WouterDeCoster 47k

0

Entering edit mode

Thanks! I guess, I need to download the chromosome fasta file locally to run bedtools? Can it be done without downloading the chromosome fasta file locally?

ADD REPLY • link 6.6 years ago by DanielC ▴ 170

0

Entering edit mode

an it be done without downloading the chromosome fasta file locally?

yes, but it's slower: How To Get The Sequence Of A Genomic Region From Ucsc?

ADD REPLY • link 6.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

bedtools getfasta needs the file locally. Without downloading you would need another method. It's possible without downloading, but I don't see why you want that.

ADD REPLY • link 6.6 years ago by WouterDeCoster 47k

0

Entering edit mode

Thanks! Can you please tell me a reliable source from where I can download the chromosome fasta files for bedtools?

ADD REPLY • link 6.6 years ago by DanielC ▴ 170

1

Entering edit mode

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/

ADD REPLY • link 6.6 years ago by Pierre Lindenbaum 161k

score 1 · Answer 1 · 2017-11-13

One way to do the first step is use samtools-indexed FASTA files with a wrapper script like bed2fastaidx.pl.

First, grab your reference genome by its chromosomes, recompressing them with bgzip and indexing them with samtools (e.g., hg38):

$ cd /foo/bar/baz
$ wget ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/*.fa.gz
$ for fn in `ls *.fa.gz`; do gunzip -c $fn | bgzip -c > $fn.fa.bgz; mv $fn.fa.bgz $fn; done
$ for fn in `ls *.fa.gz`; do samtools faidx $fn; done

Then convert your intervals to BED format and use the wrapper script to get sequence out of the BED file and into FASTA:

$ bed2faidxsta.pl --fastaDir=/foo/bar/baz < intervals.bed > intervals.fa

To do the second step, you could grab the SNPs for your reference genome (e.g., hg38) and convert them to a sorted BED file with BEDOPS sort-bed:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/snp147.txt.gz \
   | gunzip -c \
   | awk -v OFS="\t" '{ print $2, $3, ($3+1), $10" "$11; }' \
   | sort-bed - \
   > hg38.snp147.bed

Then use BEDOPS bedmap on your SNPs from the second file, to map your SNPs to the variant information in the dbGAP table:

First convert your text file to a UCSC-formatted and sorted, zero-indexed and half-open BED file:

$ awk -v OFS="\t" '{ print "chr"$1, $2, ($2+1), $4; }' SNPs.txt | sort-bed - > SNPs.bed

Then map your SNPs to the ID field of the dbGAP records:

$ bedmap --echo --echo-map-id --delim '\t' SNPs.bed hg38.snp147.bed > answer.bed