Question

Creating a Zebrafish reference genome for use by mpileup

0

Entering edit mode

6.0 years ago

dominicdhall ▴ 40

I'm trying to download an appropriate reference genome for use in reference calling a bunch of reads I have for zebrafish. The reads I have are from the 10x platform looking at single cell RNA transcripts. Currently I have a BAM file and would like to create a vcf file of variant sites within my reads.

I have a sorted file sorted_QC.bam of reads coming from cells which have previously passed QC. In order to do variant calling I want to run something of the form:

bcftools mpileup --ignore-RG -f <ref.fa> sorted_QC.bam | bcftools call -mv -O v -o <out.vcf>

For use as a reference I used the GRCz11 genome build from Ensembl and chose to include

cDNA sequences
Transcript Stable ID
Gene Name
Gene Stable ID
Chromosome/Scaffold name

This is a sample of the fasta file:

    >ENSDARG00000033231|ENSDART00000000042|mcm6l|2|50199738|50225495
TTACCTGCTCCAGGTGTGCGTGTGTTGGTCGAGTGATGGAGCCTGGTGTCGAAGCGGCGG
GCGTGCACGTGCGCGATGAACTGTCCGAGAAGTGTCAGAAGTTGTTTCTGGAGTTTTTGG
AAGAGTTTCAGGATAAAAACGGCGACGCTCTATATCTGTCGGACGCGCAGGAGCTGATCC
GACCGGAGAGAAACACGCTGACCGTGAGCTTCAGCAACATCGAGCACTACAACCAGCAGC
TCGCCACCACCATCCAGGAGGAGTACTACAGGGTTTATCCGTTTCTCTGCAGAGCAGTTC
GTCACTTTGCCAGAGATCATGGAAATATTCCACCTTCCAAAGAGTTTTATGTGGCTTTTT
CGGAGTTTCCATCAAGGCAAAAGATTCGTGAGCTCTCCACTGTACGTATTGGCACTTTGT
TGCGCATCAGCGGTCAGGTGGTCCGAACCCACCCTGTGCACCCAGAGCTGGTCAGCGGCA
CCTTCCTCTGCCTGGACTGTCAGTCGGTCATCAAAGATGTGGAGCAGCAGTTCAAATACA

I only included unique results and once I had the fasta file I ran

samtools faidx <ref.fa>

to create <ref.fa.fai>

However, when i run the command I get the following error:

[E::faidx_fetch_seq] The sequence "1" not found

over and over again. I checked the faidx file and this is how it looked:

ENSDARG00000033231|ENSDART00000000042|mcm6l|2|50199738|50225495 2585    65  60  61
ENSDARG00000000370|ENSDART00000000382|triob|16|8751815|8927425  8958    2758    60  61
ENSDARG00000076182|ENSDART00000000280|stat1b|9|41218967|41247197    4113    11932   60  61
ENSDARG00000000183|ENSDART00000000192|ptpn4b|22|11778053|11833334   6443    16181   60  61
ENSDARG00000000151|ENSDART00000000160|thraa|3|34688022|34753605 2292    22797   60  61
ENSDARG00000000241|ENSDART00000000250|slc40a1|9|41103127|41113878   3780    25195   60  61
ENSDARG00000000068|ENSDART00000000069|slc9a3r1a|12|33484458|33537126    2033    29109   60  61
ENSDARG00000000069|ENSDART00000000070|dap|24|22070290|22103585  1859    31240   60  61
ENSDARG00000000002|ENSDART00000000005|ccdc80|9|34089156|34113209    2604    33196   60  61
ENSDARG00000036830|ENSDART00000000221|krt91|19|5378563|5380770  1582    35908   60  61

As a reference, this is a sample of the top of the header of my .bam file:

@HD VN:1.4  SO:coordinate
@SQ SN:1    LN:58871917
@SQ SN:10   LN:45574255
@SQ SN:11   LN:45107271
@SQ SN:12   LN:49229541
@SQ SN:13   LN:51780250
@SQ SN:14   LN:51944548
@SQ SN:15   LN:47771147
@SQ SN:16   LN:55381981
@SQ SN:17   LN:53345113

I believe I may have a problem with mpileup or faidx not being able to locate chromosome 1 in either of the files. However, i'm not sure exactly in what way I should edit the files to fix this. Alternatively, should I be using a different file as my reference genome?

Thanks!

RNA-Seq VCF variant calling faidx mpileup • 1.3k views

ADD COMMENT • link 6.0 years ago by dominicdhall ▴ 40

0

Entering edit mode

I will give a try just having the chromosome number - however, how will mpileup know where these variants actually lie if it just has the chromosome and sequence? Does it work out it out?

As for the comment - sorry that is very ambiguous from me. It is essentially just because i couldnt remember exactly what I had called my reference fasta file when i wrote the question! I don't think I have been misplacing symbols all over the place..

Thanks!

ADD REPLY • link 6.0 years ago by dominicdhall ▴ 40

0

Entering edit mode

Please use the ADD COMMENT button when replying to someone's answer.

ADD REPLY • link 6.0 years ago by h.mon 35k

score 2 · Accepted Answer · 2018-05-15

It can't locate the chromosomes because in the bam they are named 1, 2, and so on; and in the fasta reference they are named ENSDARG00000033231|ENSDART00000000042|mcm6l|2|50199738|50225495. You have to use the same reference used to map the files.

Just a comment / question, your example commands wrap file names around <>, like <ref.fa> and <out.vcf>. Do your real commands use <>, or are they used here just to demonstrate an example file name? Because > and < have special for bash, they redirect input and output, and should not be used in your commands.