Question: Creating a Zebrafish reference genome for use by mpileup
0
gravatar for dominicdhall
22 months ago by
dominicdhall40
dominicdhall40 wrote:

I'm trying to download an appropriate reference genome for use in reference calling a bunch of reads I have for zebrafish. The reads I have are from the 10x platform looking at single cell RNA transcripts. Currently I have a BAM file and would like to create a vcf file of variant sites within my reads.

I have a sorted file sorted_QC.bam of reads coming from cells which have previously passed QC. In order to do variant calling I want to run something of the form:

bcftools mpileup --ignore-RG -f <ref.fa> sorted_QC.bam | bcftools call -mv -O v -o <out.vcf>

For use as a reference I used the GRCz11 genome build from Ensembl and chose to include

  • cDNA sequences
  • Transcript Stable ID
  • Gene Name
  • Gene Stable ID
  • Chromosome/Scaffold name

This is a sample of the fasta file:

    >ENSDARG00000033231|ENSDART00000000042|mcm6l|2|50199738|50225495
TTACCTGCTCCAGGTGTGCGTGTGTTGGTCGAGTGATGGAGCCTGGTGTCGAAGCGGCGG
GCGTGCACGTGCGCGATGAACTGTCCGAGAAGTGTCAGAAGTTGTTTCTGGAGTTTTTGG
AAGAGTTTCAGGATAAAAACGGCGACGCTCTATATCTGTCGGACGCGCAGGAGCTGATCC
GACCGGAGAGAAACACGCTGACCGTGAGCTTCAGCAACATCGAGCACTACAACCAGCAGC
TCGCCACCACCATCCAGGAGGAGTACTACAGGGTTTATCCGTTTCTCTGCAGAGCAGTTC
GTCACTTTGCCAGAGATCATGGAAATATTCCACCTTCCAAAGAGTTTTATGTGGCTTTTT
CGGAGTTTCCATCAAGGCAAAAGATTCGTGAGCTCTCCACTGTACGTATTGGCACTTTGT
TGCGCATCAGCGGTCAGGTGGTCCGAACCCACCCTGTGCACCCAGAGCTGGTCAGCGGCA
CCTTCCTCTGCCTGGACTGTCAGTCGGTCATCAAAGATGTGGAGCAGCAGTTCAAATACA

I only included unique results and once I had the fasta file I ran

samtools faidx <ref.fa>

to create <ref.fa.fai>

However, when i run the command I get the following error:

[E::faidx_fetch_seq] The sequence "1" not found

over and over again. I checked the faidx file and this is how it looked:

ENSDARG00000033231|ENSDART00000000042|mcm6l|2|50199738|50225495 2585    65  60  61
ENSDARG00000000370|ENSDART00000000382|triob|16|8751815|8927425  8958    2758    60  61
ENSDARG00000076182|ENSDART00000000280|stat1b|9|41218967|41247197    4113    11932   60  61
ENSDARG00000000183|ENSDART00000000192|ptpn4b|22|11778053|11833334   6443    16181   60  61
ENSDARG00000000151|ENSDART00000000160|thraa|3|34688022|34753605 2292    22797   60  61
ENSDARG00000000241|ENSDART00000000250|slc40a1|9|41103127|41113878   3780    25195   60  61
ENSDARG00000000068|ENSDART00000000069|slc9a3r1a|12|33484458|33537126    2033    29109   60  61
ENSDARG00000000069|ENSDART00000000070|dap|24|22070290|22103585  1859    31240   60  61
ENSDARG00000000002|ENSDART00000000005|ccdc80|9|34089156|34113209    2604    33196   60  61
ENSDARG00000036830|ENSDART00000000221|krt91|19|5378563|5380770  1582    35908   60  61

As a reference, this is a sample of the top of the header of my .bam file:

@HD VN:1.4  SO:coordinate
@SQ SN:1    LN:58871917
@SQ SN:10   LN:45574255
@SQ SN:11   LN:45107271
@SQ SN:12   LN:49229541
@SQ SN:13   LN:51780250
@SQ SN:14   LN:51944548
@SQ SN:15   LN:47771147
@SQ SN:16   LN:55381981
@SQ SN:17   LN:53345113

I believe I may have a problem with mpileup or faidx not being able to locate chromosome 1 in either of the files. However, i'm not sure exactly in what way I should edit the files to fix this. Alternatively, should I be using a different file as my reference genome?

Thanks!

ADD COMMENTlink modified 22 months ago • written 22 months ago by dominicdhall40

I will give a try just having the chromosome number - however, how will mpileup know where these variants actually lie if it just has the chromosome and sequence? Does it work out it out?

As for the comment - sorry that is very ambiguous from me. It is essentially just because i couldnt remember exactly what I had called my reference fasta file when i wrote the question! I don't think I have been misplacing symbols all over the place..

Thanks!

ADD REPLYlink written 22 months ago by dominicdhall40

Please use the ADD COMMENT button when replying to someone's answer.

ADD REPLYlink written 22 months ago by h.mon29k
2
gravatar for h.mon
22 months ago by
h.mon29k
Brazil
h.mon29k wrote:

It can't locate the chromosomes because in the bam they are named 1, 2, and so on; and in the fasta reference they are named ENSDARG00000033231|ENSDART00000000042|mcm6l|2|50199738|50225495. You have to use the same reference used to map the files.

Just a comment / question, your example commands wrap file names around <>, like <ref.fa> and <out.vcf>. Do your real commands use <>, or are they used here just to demonstrate an example file name? Because > and < have special for bash, they redirect input and output, and should not be used in your commands.

ADD COMMENTlink modified 22 months ago • written 22 months ago by h.mon29k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1231 users visited in the last hour