How to use samtools to assembly a gene?
1
0
Entering edit mode
9.4 years ago
clyumath ▴ 20

Hi,

I'm a new comer for samtools.

Here's the question: we got the raw data (reads) from several people. We are only interested in several genes, say, one of them is FTO. So, we want to assembly a person's FTO gene, not its whole genome. The final goal is to find the SNP of this person on this gene.

By using BWA, we did the mapping the reads to the RefSeq FTO gene, and got a large file called "Sequence.fq.sam", which is 23.7GB. To my understanding, all the aligned reads to this FTO gene should not be so huge.

Anyway, I think the next job is to use samtools.

  1. Could you give me some instruction about how to use samtools to get the assembly FTO gene of this person?
  2. How to call the SNP for the assembly FTO gene?

Thank you very much!

Best regards,
clyumath

Assembly • 5.6k views
ADD COMMENT
0
Entering edit mode

To get the consensus sequence of a gene (or its SNPs given a reference sequence) you can use a combination of samtools, bcftools, vcfutils (or any other software cited here.

samtools mpileup -u -l gene.bed aln.bam | bcftools view -c - | vcfutils.pl vcf2fq > cns.fq

but if your final goal is to get a list of SNPs for more than one gene, my suggestion is to use all of the reads at once.

ADD REPLY
1
Entering edit mode
9.4 years ago

the sam file is so big because it's plain text. if you want to reduce its size, just convert it to binary using samtools view:

samtools view -bS Sequence.fq.sam > Sequence.bam

sorting and indexing the bam file is then required:

samtools sort Sequence.bam Sequence.sort
samtools index Sequence.sort.bam

variant calling can be performed using multiple options. you may use the well known combination of samtools and bcftools:

samtools mpileup -uf ref.fa Sequence.sorted.bam | bcftools call -mv - > Sequence.vcf

if you want to limit the variant calling to be performed on the FTO gene only (chr16:53,735,875-54,150,379), then use the samtools mpileup's '-r' option:

samtools mpileup -uf ref.fa -r chr16:53,735,875-54,150,379 Sequence.sorted.bam | bcftools call -mv - > Sequence.FTO.vcf

you can limit the variant calling a little bit further by using a bed file with the exon coordinates for instance, using samtools mpileup's '-l' option.

samtools mpileup -uf ref.fa -l FTO.exons.bed Sequence.sorted.bam | bcftools call -mv - > Sequence.FTO.exons.vcf

being ref.fa the fasta file of the reference you used to map your reads

ADD COMMENT
0
Entering edit mode

When I tried it, it appears:

[fai_load] build FASTA index.
[mpileup] Set max per-file depth to 8000
[bam_pileup_core] the input is not sorted (reads out of order)
[bam_plp_destroy] memory leak: 2. Continue anyway.

So, I did the sort for my BAM file firstly, as

samtools sort sequence.bam sequence.sort.bam

Then I tried your commend:

samtools mpileup -uf FTO.fa Sequence.bam | bcftools call -mv - > Sequence.vcf

I got the vcf file.

So, what should I do next? Where can I find the SNP information for this sample?

Thank you!

ADD REPLY
0
Entering edit mode

you're right. sorting the bam file is required. I've updated my answer.

ADD REPLY
0
Entering edit mode

Hi Sir,

Thanks! I find you also change "bcftools view" into "bcftools call", right?

But when I change to

samtools mpileup -uf ref.fa Sequence.sorted.bam | bcftools call -mv - > Sequence.vcf

it said

[main] Unrecognized command.
[mpileup] 1 samples om 1 input files
<mpileup> Set max per-file depth to 8000

Furthermore, for

samtools mpileup -uf ref.fa -r chr16:53,735,875-54,150,379 Sequence.sorted.bam | bcftools call -mv - > Sequence.FTO.vcf

Both "bcftools call" and "bcftools view" gave

[bam_index_load] fail to load BAM index 
[mpileup] fail to load index for sequ.sorted.bam

and "bcftools call" gave one more: [main] Unrecognized command

How should I do? Thank you!

ADD REPLY
0
Entering edit mode

you're right again. bam file should be indexed right after being indexed. answer updated.

in previous versions, variant calling was performed with bcftools view, although now bcftools call is prefered as stated here.

ADD REPLY
0
Entering edit mode

Thanks, I did

samtools index Sequence.sort.bam

But when I checked my bcftools, there is no "call" there, so I can still use only "bcftools view".

For

samtools mpileup -uf ref.fa Sequence.sorted.bam | bcftools view -mv - > Sequence.vcf

it works out the VCF file, but for

samtools mpileup -uf ref.fa -r chr16:53,735,875-54,150,379 Sequence.sorted.bam | bcftools view -mv - > Sequence.FTO.vcf

it doesn't work well, it said:

[bam_parse_region] fail to determine the sequence name.
[mpileup] malformatted region or wrong seqname for Sequence.sorted.bam

What's the problem? Thank you!!!!

ADD REPLY
0
Entering edit mode

if you "don't have" the bcftools call command it means you're not using latest bcftools (currently version 1.1)

the samtools error you're getting may be due to the reference you're using (does your reference include the "chr" string at the begining of each chromosome? if that isn't the case, you should remove it too from the region search). take into account thta if the former calling command worked you can always filter the vcf file for the variants inside your region of interest.

ADD REPLY
0
Entering edit mode

Yes, I also think my reference file is not correct. I just used a FASTA file for FTO gene as the reference, like:

>gi|226423916|ref|NM_001080432.2| Homo sapiens fat mass and obesity associated (FTO), mRNA
CTACGCTCTTCCAGCTGTCGGACCTGGGAAATTCTCCTGTGCTAAATCCCGTGGCGCTCGCGGGTGTCGC
CGCGGTGCATCCTGGGAGTTGTAGTTTTTTCTACTCAGAGGGAGAATAGCTCCAGACGGGAGCAGGACGC
TGAGAGAACTACATGCAGGAGGCGGGGTCCAGGGCGAGGGATCTACGCAGCTTGCGGTGGCGAAGGCGGC
TTTAGTGGCAGCATGAAGCGCACCCCGACTGCCGAGGAACGAGAGCGCGAAGCTAAGAAACTGAGGCTTC
TTGAAGAGCTTGAAGACACTTGGCTCCCTTATCTGACCCCCAAAGATGATGAATTCTATCAGCAGTGGCA
GCTGAAATATCCTAAACTAATTCTCCGAGAAGCCAGCAGTGTATCTGAGGAGCTCCATAAAGAGGTTCAA
.....

I think it could not be the reference, so what file should I use since I only focus FTO gene?

Thank you very much!

ADD REPLY
0
Entering edit mode

if you've mapped your reads only to the gene fasta, then all variants would be inside the gene, so no filtering should be performed. but mapping your reads only to a reduced size reference is not desirable, since you may be losing information about where those reads may actually come from (pseudogenes, homology regions,...). unless you know what you're doing, I would suggest you to map your reads against the entire human reference, and then filter the reads that properly map inside the gene limits.

ADD REPLY
0
Entering edit mode

Thanks, in order to save time, can I map the reads against only chr16 rather than the entire human reference? Because FTO gene is on chr16:53,735,875-54,150,379.

Can I?

ADD REPLY
0
Entering edit mode

there are several posts here that discuss this issue. most people prefer to invest time on the alignment, since it's usually performed only once and a good mapping is crucial for the variant calling process. if it's just explanatory I wouldn't mind focusing only on the gene region (genomic coordinates should be edited later) or on the particular chromosome (genomic coordinates would be straight), but if you're planning to take any relevant decisions on the variants detected I would definitely try to use the entire reference.

ADD REPLY

Login before adding your answer.

Traffic: 1375 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6