Question

Reference genome location

0

Entering edit mode

19 months ago

Batel • 0

Hi,

I'm trying to get the alleles of non-SNP locations from the hg19 reference genome in FASTA format.

1) What is the best way to work with FASTA - either python/R/bash 2) How do you navigate in FASTA files, specifically in hg19? 3) How do I find the list of non-SNP bases, that are indeed sequenced in hg19?

Thanks, B.

reference positions FASTA alignment genome • 1.4k views

ADD COMMENT • link updated 19 months ago by barslmn ★ 2.3k • written 19 months ago by Batel • 0

3

Entering edit mode

What is a non-SNP location? How it would be in FASTA format?

Could you make a mock-up of the result you want?

ADD REPLY • link 19 months ago by barslmn ★ 2.3k

0

Entering edit mode

Yes, of course.

I want something that looks like this (positions and alleles are made up)

chr    position   allele_in_ref_genome
1      2457643    A
1      2457644    G
1      2347644    A
1      2347646    T
...

(2347645 is missing because it was a SNP)

A non-SNP is a location that has no variability. for example here, in the UCSC genome browser, I want to get all these bases (marked in yellow) excluding the ones that are SNPs (marked in red), but in a table format as above.

ADD REPLY • link 19 months ago by Batel • 0

3

Entering edit mode

Oh I see, this is certainly an unusual use case. Here is a minimal example you can build upon: https://colab.research.google.com/drive/1-n7smJGy-rdkbD9aT0Lkk5BeID0ix2rv?usp=sharing

Because of the size of the dbSNP it is not a computationally speedy process. I didn't run this all the way through, so there might be errors. However, colab at least have good download speed and enough storage to handle it.

ADD REPLY • link 19 months ago by barslmn ★ 2.3k

0

Entering edit mode

Barslmn, Thank you so much for your reply. I didn't realize it would take that much time to produce the table.

I can actually subset the desired positions to areas in the range of 40 bases around a SNP. so for example, I have these target SNPs:

>            rs4818955    21        0.579215        44829087 C T
>           rs73380301    21        0.579475        44840835 A G
...

I would like to get:

chr         pos    allele_in_reference
21    44829067     A    # 20 bases before the first SNP
21    44829068     T
...
21    44829087     C # the actual SNP
21    44829088     A
...
21    44829107     C #20 bases down the SNP

Then on to the next SNP. I'm okay to keep the SNPs in the table - I can exclude them afterward.

The first problem is I don't know if all these positions are actually sequenced in hg19.

The second is how to produce the output.

I would very much appreciate your help.

Thank you, B.

ADD REPLY • link 19 months ago by Batel • 0

2

Entering edit mode

This is a better approach and easier one to solve. You can use samtools to retrieve your sequences.

# shell
chrom="chr21"
SNP_pos=10000000
start=$((SNP_pos - 20))
end=$((SNP_pos + 20))
samtools faidx https://igv.genepattern.org/genomes/seq/hg19/hg19.fasta $chrom:$start-$end

After this you can check the colab notebook to format it into a table.

ADD REPLY • link 19 months ago by barslmn ★ 2.3k

0

Entering edit mode

Thank you again, barslmn!

I've followed your answer and indeed got the sequence I wanted, but the table does not hold the position, and the order is mixed : for example, for SNP rs2843964:

chrom="21"
SNP_pos=38434625
start=$((SNP_pos - 20))
end=$((SNP_pos + 20))
samtools faidx $ref_genome $chrom:$start-$end

21:38434605-38434645 AGGATGTGTATATCCACTGATACCTCTCCCAGGTTGGGCTC

Then:

grep -v '>' ref.txt | sed 's/./\0\n/g'   | sed '/^$/d'     | nl      | sort -k 1b   > ref.tsv

1 A
10 A
11 T
12 A
13 T
...

If the alleles stay in the right order it will be easy to just concat the base position next to the alleles... Thank you again!

ADD REPLY • link 19 months ago by Batel • 0

1

Entering edit mode

In previous example we need to sort in order the join the two files. Here we don't need to sort anything. However, we need to get the correct position since nl starts from 1. We can add start position to nl index in order to get the real position.

echo "21:38434605-38434645 AGGATGTGTATATCCACTGATACCTCTCCCAGGTTGGGCTC" |
     cut -d " " -f 2 |  # this cut command gets us the sequence
     sed 's/./\0\n/g' | # we split every base to line
     sed '/^$/d'  |     # we remove the emptylines
     nl | # nl index the lines and awk here adds the start position to that index we subtract 1 because nl starts from 1
     awk -v chrom=$chrom -v startpos=$start '{printf "%s\t%s\t%s\n", chrom, $1+startpos-1, $2 }' # Also we can pass chrom to print it with awk

This would output:

21      38434605        A
21      38434606        G
21      38434607        G
21      38434608        A
21      38434609        T
21      38434610        G
21      38434611        T
21      38434612        G
21      38434613        T
21      38434614        A
21      38434615        T
21      38434616        A
21      38434617        T
21      38434618        C
21      38434619        C
21      38434620        A
21      38434621        C
21      38434622        T
21      38434623        G
21      38434624        A
21      38434625        T
21      38434626        A
21      38434627        C
21      38434628        C
21      38434629        T
21      38434630        C
21      38434631        T
21      38434632        C
21      38434633        C
21      38434634        C
21      38434635        A
21      38434636        G
21      38434637        G
21      38434638        T
21      38434639        T
21      38434640        G
21      38434641        G
21      38434642        G
21      38434643        C
21      38434644        T
21      38434645        C

ADD REPLY • link 19 months ago by barslmn ★ 2.3k