Question: Consensus Fasta Sequence Of A Given Genomic Region For Individuals From 1000G
gravatar for Pierre
4.8 years ago by
Pierre120 wrote:

Hey guys,

I am trying to retrieve consensus fasta sequences of a number of regions for specific individuals from 1000G, and further to make a multiple sequence alignment.

The pipeline I'm using for this task is not working. I would appreciate your help to let me understand what's wrong with it - and preferably how to speed up things.

Say, I have the following reference sequence (reference.fasta) for which I would like to retrieve the corresponding sequences for a number of 1000G individuals.

>hg19_knownGene_uc011dlx.1_7 range=chr6:29695734-29695883 5'pad=0 3'pad=0 strand=+ repeatMasking=none                                     GCCCAGCTCCTGAGTGTCTCTACCTCTCAAACAAGTATTCTCATCCAGGAG

Use samtools to get the sub-section of interest

  ./samtools view -h 6:29695734-29695883  > subsection.sam

Convert SAM to BAM (SAM header is present: 84 sequences)

./samtools view -bS subection.sam > subsection.bam

Sort and index the bam file

./samtools sort subsection.bam subsection_sorted.bam
./samtools index subsection_sorted.bam

Use the reference fasta sequence and the bam reads to create a consensus sequence

./samtools mpileup -uf reference.fasta subsection_sorted.bam | ./bcftools/bcftools view -cg - | ./bcftools/ vcf2fq > subsection_consensus.fastq

Use seqtk to convert fastq to fasta (optional: while marking sites with <20x coverage)

./seqtk seq subsection_consensus.fastq (-q20) > subsection_consensus.fa

Final output: a huge file (~60 MB) containing a sequence string of "n".

What I would like to have instead after a number of iteration across individuals would hopefully look like this:

>NA00001 ......................A...........................G..
>NA00002 ......................................T..............

Thanks in advance for any help / suggestion.

consensus 1000genomes samtools • 3.7k views
ADD COMMENTlink modified 2.8 years ago by greener10 • written 4.8 years ago by Pierre120
gravatar for Pierre Lindenbaum
4.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum111k wrote:

I suppose you're mixing the chromosomes from hg19 (with a 'chr' prefix) and the 1KGenomes chromosomes without 'chr' prefix; That is why you're getting some 'NNNNNNNNNN': the tool cannot find the chromosomes. Use the reference sequence of 1000genomes instead of hg19.

ADD COMMENTlink written 4.8 years ago by Pierre Lindenbaum111k

Hi Pierre, thank you for your reply. To see whether it is indeed a problem related to reference sequence, I used Grch37 reference and tried to retrieve the first 10 bases of the following transcript chr2:72356367-72356377: >ENSG00000003137|ENST00000001146|ENSP00000001146|2|72356367|72375167 ATGCTCTTTG However, I ended up with an even larger fasta file containing only non-bases. Any idea why this is not working?

ADD REPLYlink written 4.8 years ago by Pierre120

I am very likely pointing to the wrong data file, however, cannot really figure out the mistake. Since I'm only interested in the exome, I thought I might try this instead: samtools view -h 2:72356367-72356377 This did not work.

ADD REPLYlink written 4.8 years ago by Pierre120
gravatar for greener
2.8 years ago by
United States
greener10 wrote:

I am trying to do the same thing. I would like to generate a fasta file on a predefined region chr7:128575994-128592088 from the 1000 genomes project. Ideally I'd like to have the output of that region from each individual. It would be helpful to know if the above approach was every successful (and how they got it to work) or if folks could provide me tips on how to complete this. Any help or suggestions are appreciated.


ADD COMMENTlink written 2.8 years ago by greener10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1844 users visited in the last hour