Question: 1000 GENOME RETRIEVE DATA
0
gravatar for mohammedabeddin1
4.7 years ago by
mohammedabeddin110 wrote:

Hello Everyone, I am a computer science and working on a bioinformatics project. I want to download SCN9a Gene sequence for all the 1000 individuals from 1000 genomes project. I have been struggling a lot . I tried using SRA toolkit and other stuff but it didn't work out . Requesting you guys to hep me out. The location for the SCN9a gene is Molecular Location on chromosome 2: base pairs 166,195,185 to 166,375,987

rna-seq gene • 2.0k views
ADD COMMENTlink modified 4.7 years ago by Jorge Amigo12k • written 4.7 years ago by mohammedabeddin110

please, define "download SCN9a Gene sequence for all the 1k individuals"

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by Pierre Lindenbaum131k

I want each of the 1000 individual unique coding sequences or mRNA sequences for the human SCN9a gene

ADD REPLYlink written 4.7 years ago by mohammedabeddin110
3
gravatar for Jorge Amigo
4.7 years ago by
Jorge Amigo12k
Santiago de Compostela, Spain
Jorge Amigo12k wrote:

if only fasta sequences are to be obtained, I would download all variants for that region

bcftools view -Oz -r 2:166195185-166375987 ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz > SCN9a.1000g.vcf.gz
tabix -p vcf SCN9a.1000g.vcf.gz

then the reference must be downloaded to be indexed locally (pity that 1000g doesn't have that index remotely, because the reference could be queried directly rather than downloaded)

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz
samtools faidx human_g1k_v37.fasta.gz

and then build each sample's sequence by changing the reference with those variants

for sample in `bcftools view -h 1000g.SCN9a.vcf.gz | grep "^#CHROM" | cut -f10-`; do
  bcftools view -c1 -Oz -s $sample -o 1000g.$sample.vcf.gz SCN9a.1000g.vcf.gz
  tabix -p vcf 1000g.$sample.vcf.gz
  samtools faidx human_g1k_v37.fasta.gz 2:166195185-166375987 \
  | bcftools consensus 1000g.$sample.vcf.gz -o 1000g.SCN9a.$sample.fa
done

to do all this you will "only" need samtools and bcftools in an unix-like environment connected to internet.

ADD COMMENTlink modified 4.6 years ago • written 4.7 years ago by Jorge Amigo12k

I tried your solution and also changed the 1000g.SCN9a.vcf.gz to SCN9a.1000g.vcf.gz while building it but still I am unable to get the fasta file.

Error:
Failed to open 1000g.HG00096.vcf.gz: could not load index
[download_from_remote] fail to open remote file ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz.fai
[fai_load] failed to open remote FASTA index ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz.fai
Could not load fai index of ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz
Failed to open 1000g.HG00097.vcf.gz: could not load index
[download_from_remote] fail to open remote file ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz.fai
[fai_load] failed to open remote FASTA index ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz.fai

I even tried after downloading the file manually to my system but its giving me this error. After generatiing the fai file as well. My code:

for sample in `./bcftools view -h SCN9a.1000g.vcf.gz | grep "^#CHROM" | cut -f10-`; do
  ./bcftools view -c1 -Oz -s "$sample" -o "vcf/1000g.$sample.vcf.gz" 'SCN9a.1000g.vcf.gz'
  samtools faidx human_g1k_v37.fasta 2:166195185-166375987 | ./bcftools consensus vcf/1000g.$sample.vcf.gz -o output/1000g.SCN9a.$sample.fa
done

Error :Failed to open vcf/1000g.HG00096.vcf.gz: could not load index
Failed to open vcf/1000g.HG00097.vcf.gz: could not load index
Failed to open vcf/1000g.HG00099.vcf.gz: could not load index
Failed to open vcf/1000g.HG00100.vcf.gz: could not load index
Failed to open vcf/1000g.HG00101.vcf.gz: could not load index
Failed to open vcf/1000g.HG00102.vcf.gz: could not load index
Failed to open vcf/1000g.HG00103.vcf.gz: could not load index
Failed to open vcf/1000g.HG00105.vcf.gz: could not load index
Failed to open vcf/1000g.HG00106.vcf.gz: could not load index
Failed to open vcf/1000g.HG00107.vcf.gz: could not load index
Failed to open vcf/1000g.HG00108.vcf.gz: could not load index
Failed to open vcf/1000g.HG00109.vcf.gz: could not load index
Failed to open vcf/1000g.HG00110.vcf.gz: could not load index
Failed to open vcf/1000g.HG00111.vcf.gz: could not load index
Failed to open vcf/1000g.HG00112.vcf.gz: could not load index
Failed to open vcf/1000g.HG00113.vcf.gz: could not load index
ADD REPLYlink modified 4.6 years ago by Jorge Amigo12k • written 4.6 years ago by mohammedabeddin110

it seems like the 1000g reference must indeed be downloaded locally in order to generate the missing fai index. pity they don't have that fai file remotely too, since the reference could be queried as stated rather than downloaded. so you did right.

but regarding the next error, there's a step missing in my code right after the bcftools command, just before the samtools command: the vcf.gz index must be generated. I've updated my answer to reflect this.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Jorge Amigo12k

did you try to use the http protocol instead of ftp ?

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/

ADD REPLYlink written 4.6 years ago by Pierre Lindenbaum131k
1

yes Pierre. both http and ftp protocols give the same error, since the .fasta.gz.fai is not there (although there is a .fasta.fai indeed)

$ samtools faidx http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz.fai 2:166195185-166375987
[download_from_remote] fail to open remote file http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz.fai.fai
[fai_load] failed to open remote FASTA index http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz.fai.fai
Could not load fai index of http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz.fai

I've reported it as a bug, since the GRCh38 reference is perfectly described in that folder. either they could index the compressed .fasta.gz reference, or they could uncompress it. I would go for the first one, since the download time would be much lower, but I'm just giving some ideas out.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Jorge Amigo12k

Thanks a lot Jorge !!!! Now that we have the raw DNA sequences for the SCN9A gene from all available individuals, we intend to extract the coding sequence from them using information available on NCBI as a guide. We would ultimately like to align these coding sequences to identify SNPs within the exons. After testing the process by extracting and comparing the CDS from two individuals we found that the base pairs did not line up and were not at all similar to the reference gene available from NCBI and we are not sure why this is the case.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by mohammedabeddin110

file:///home/ahmed/Pictures/Desktop_005.png

ADD REPLYlink written 4.6 years ago by mohammedabeddin110
0
gravatar for archana.bioinfo87
4.7 years ago by
archana.bioinfo87180 wrote:

Dear you can download whole genome using SRA tool kit and and you are saying you are knowing the position; so can extract the sequences from that position. You can try some perl or other script for that. Hope it will help you.

ADD COMMENTlink written 4.7 years ago by archana.bioinfo87180

Can you give me the steos as to how I can do that ??

ADD REPLYlink written 4.6 years ago by mohammedabeddin110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2184 users visited in the last hour