Question

How to obtain chromosome number from given scaffold number

0

Entering edit mode

8.9 years ago

Parimala Devi ▴ 100

Hi,

I am working on Saccharomyces cerevisiae, Y55 strain. I obtained my reference sequence from here. And this is how the Y55_Stanford_2014_JRIF00000000.fsa sequence looks. It doesn't include the chromosome number. Reference

>gi|696435221|gb|JRIF01000001.1| Saccharomyces cerevisiae Y55 scaffold-0, whole genome shotgun sequence [length=107844]
TTAAGCCTTCAAAGAAGAAGCTCTTCTCTTTCTGATTTCGGCCTTTTCAGCCTTTCTTTCAGACAATCTCTTAGCCAACA
ATTGAGCGTATTCGGCAGCAGCTTCTCTTTGAGCTTGAGCGTTTCTGACCTTCAAAGCTCTTTGGTGTCTCTTTCTTTGC

There's also a gatk-snv vcf included which has the chromosome number. gatk.vcf

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    Y55
chrI    111    .    C    T    155.12    PASS    AC=2;AF=1.00;AN=2;BaseQRankSum=3.331;DP=28;Dels=0.00;FS=0.000;HaplotypeScore=12.7288;MLEAC=2;MLEAF=1.00;MQ=28.02;MQ0=0;MQRankSum=0.666;QD=5.54;ReadPosRankSum=2.596;SB=-1.912e+01    GT:AD:DP:GQ:PL    1/1:14,14:28:5:186,5,0
chrI    136    .    G    A    336    PASS    AC=2;AF=1.00;AN=2;BaseQRankSum=1.625;DP=32;Dels=0.00;FS=7.270;HaplotypeScore=30.3526;MLEAC=2;MLEAF=1.00;MQ=28.13;MQ0=0;MQRankSum=-1.733;QD=10.50;ReadPosRankSum=1.083;SB=-9.901e+01    GT:AD:DP:GQ:PL    1/1:1,31:32:45:369,45,0
chrI    156    .    C    G    18.07    LowQual;SnpCluster;filter    AC=1;AF=0.500;AN=2;BaseQRankSum=1.146;DP=44;Dels=0.00;FS=7.776;HaplotypeScore=20.3077;MLEAC=1;MLEAF=0.500;MQ=30.62;MQ0=0;MQRankSum=2.856;QD=0.41;ReadPosRankSum=0.838;SB=-6.519e-03    GT:AD:DP:GQ:PL    0/1:38,6:44:48:48,0,469

Is there anyway I can obtain the chromosome numbers to call for variants for my own data? I want to analyse SNPs and INDELs in each chromosome.

Thank you,
Parimala

SNP scaffold chromosome-number reference-genome • 2.8k views

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by Parimala Devi ▴ 100

Ram · Answer 1 · 2015-05-28

All of the information about the reference file is stored in the .gff

http://downloads.yeastgenome.org/sequence/strains/Y55/Y55_Stanford_2014_JRIF00000000/Y55_JRIF00000000.gff.gz

You'll have to find a way to parse it to your liking, but here is quick example bash shell command to print only the reference information followed by its chromosome:

cat Y55_JRIF00000000.gff | sed 's/,/\t/g' | awk '{print $1,$11}' | grep "^g" > RefChromosomes.txt

Take a look at the format of the .gff file to parse it in a different way; you can then link the reference entries to their respective chromosomes in your own code (e.g. add the chromosome as another field in the reference line), split the fasta file by chromosome and call each individually, etc. I'm not sure if there is a way to incorporate gff directly into the variant calling pipeline, but there may be.