Where Can I Download Some Short Dna Reads That Only Contains Snps?
2
0
Entering edit mode
10.4 years ago

I need to work on some alignments on SNPs for dna sequences, but I need real short reads. So far I have been generating my own data like random short reads that covers these locations of SNPs. But looks like if my data is not real, the results I get are sort of artificial.

I appreciate if someone can tell me where do I download short reads that only covers SNPs.

short-read dna-seq • 2.5k views
ADD COMMENT
0
Entering edit mode

Do you mean "a dataset of short reads which consists only of reads spanning at least one SNP position"?

ADD REPLY
0
Entering edit mode

I actually need to work only on SNPs. II actually need a data set that already filtered out all the homozygous sites from a read and only heterozygous sites are left. I'm not a hundred percent sure that there is some data set like there already out there, but since I know many groups are working on haplotype assembly, I thought there should be... thanks for your reply btw!

ADD REPLY
1
Entering edit mode
10.4 years ago
Michael 54k

I think you have to apply a filter to extract such reads from an existing dataset, because the current sequencing technology doesn't allow to direct the position of where exactly reads come from. Even if only a short PCR fragment was used, the read positions will be random.

ADD COMMENT
0
Entering edit mode

That's exactly what I need. I'm gonna work on haplotype assembly. So all I need is a sequence of SNPs. So... you think there is no such a dataset and I have to download regular reads, align them myself and then filter them out. right?

If you don't mind me asking for one more question.... where do you recommend me to download the data? like a 1000 genome is a better option or hapmap? cause I heard a 1000 genome has more dense SNPs that hapmap. Do you think this one is good: http://www.sph.umich.edu/csg/abecasis/MaCH/download/1000G.2012-03-14.html

Thanks a bunch!

ADD REPLY
0
Entering edit mode

I think 1kG is a good option if you want to evaluate some algorithm, it is huge and well analysed. If you use the BAM files, you can skip the alignment step. I think in general Pierre's approach points in the right direction, the ftp url points to a bam file. I am not sure though if the filtering step by cigar string fits your specifiation. I understood your intent as such that you want all reads whose alignments overlaps the loci of known SNPs (e.g. from 1kG variant calling) and any allele, as opposed to reads having any mismatch or insert with respect to the refernce.

ADD REPLY
0
Entering edit mode
10.4 years ago

I wrote a program named SamFixCigar (https://github.com/lindenb/jvarkit/wiki/SamFixCigar) that transforms the 'M' of a cigar string to either 'X' or '='. So, the reads containing a SNP will have the cigar operator 'X', or 'I' or 'D'.

Fetch a BAM from the 1K genomes project and get the reads with SNPs:

$ curl  "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/data/HG00154/alignment/HG00154.mapped.ILLUMINA.bwa.GBR.low_coverage.20101123.bam" |\
  java -jar dist/samfixcigar.jar -r /path/to/human_g1k_v37.fasta |\
  awk -F '  ' '(index($6,"X")!=0 || index($6,"I")!=0 || index($6,"D")!=0)' |\
  head

ERR018420.9475645    163    1    9999    9    2=1X43=30S    =    10403    459    TAAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAACCAAAACCTAACCCCA    <@AAA<=<A@A<<=A@@;=<A?@;==A?@;=<A?@<2<=>8;<<@###############################    X0:i:2    X1:i:10    XC:i:46    MD:Z:0N0N0T43    RG:Z:ERR018420    AM:i:0    NM:i:3    SM:i:0    XN:i:2    BQ:Z:[_`VM@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    MQ:i:9    XT:A:R
ERR018420.21083488    161    1    10000    0    1=1X48=26S    18    10105    0    AAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAACCTAA    <BAB>??AAB>??AAA>??AAA>??AAA=??A?A=??@@A=???6?48>###########################    X0:i:8    X1:i:475    XC:i:50    MD:Z:0N0T48    RG:Z:ERR018420    AM:i:0    NM:i:2    SM:i:0    XN:i:1    BQ:Z:[aVN@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    MQ:i:0    XT:A:R
ERR018418.28989102    145    1    10000    0    41S9=1X11=1X13=    18    63690    0    TGTGCCTAACCCATACCCTAACCATGATCCTATCCCTAAACATAACCCTATCCCTAACCCTATCCCTAACCCTAAC    ############################################################################    X0:i:2    X1:i:0    XC:i:35    MD:Z:0N8A11A13    RG:Z:ERR018418    AM:i:0    NM:i:3    SM:i:0    XN:i:1    BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    MQ:i:23    XT:A:R
ERR018418.15408021    161    1    10002    0    64=1X9=2S    5    11692    0    AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCATAACCCTAACC    <B>@@BAB??@BAB>@@BAA>?@CAB>@@B?@=?@BAB=@<AAB?@4BAB=8<@A<4:7AAB=?1>?5:72@@###    X0:i:377    XC:i:74    MD:Z:64C9    RG:Z:ERR018418    AM:i:0    NM:i:1    SM:i:0    BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    MQ:i:0    XT:A:R
ERR018420.16616190    81    1    10010    0    11=1X5=1X58=    5    11449    0    CCCTAACCCTACCCCTACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT    A>A>A:139-?$0399@'??A?A=@@@AB@@@@AB@@@@AB@@@@AB@@@@AB@@@@AB@@@AAB@@AAABAA?>@    X0:i:371    MD:Z:11A5A58    RG:Z:ERR018420    AM:i:0    NM:i:2    SM:i:0    BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    MQ:i:0    XT:A:R
ERR018418.24273883    1161    1    10011    0    31=1X3=41S    =    10011    0    CCTAACCCTAACCCTAACCCTAACCCTAACCATAACCCTAACCCTAACCCTAAAACGAATGCATAGGCTTTATTTT    ############################################################################    X0:i:590    XC:i:35    MD:Z:31C3    RG:Z:ERR018418    AM:i:0    NM:i:1    SM:i:0    BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    XT:A:R
ERR018420.32792295    97    1    10011    0    41=1X24=10S    5    10551    0    CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAACCTAACCCTAACCCTAACCCTAACCCTAACCCTA    @?AAB?@@BAB>@@AAB>@@AAB>@@AAB>3@B@B>@;A>8)0/<AA>@@B?@<@@B??<@;B?@###########    X0:i:393    XC:i:66    MD:Z:41C24    RG:Z:ERR018420    AM:i:0    NM:i:1    SM:i:0    BQ:Z:BA@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    MQ:i:0    XT:A:R
ADD COMMENT

Login before adding your answer.

Traffic: 3694 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6