Question

Exome Sequencing Alignment Question

0

Entering edit mode

10.6 years ago

ivivek_ngs ★ 5.2k

Can anyone tell me the which qualities should be used for the alignment

bwa aln -t 4 -f input.sai -I hg19 input.fastq

I give the bwa option to use Illumina1.3+ qualities. I am trying to run a test with one of the samples of 1000 genome project. Can anyone give me an idea how will I know which quality I should be using here? Also it would be nice if anyone can let me know that for Illumina 1.8+ qualities how will the command change? or 1.3+ indicates that it takes care of the higher versions as well. Please guide

illumina exome-sequencing exome • 3.5k views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 10.6 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

The command you have mentioned "bwa aln -t 4 -f input.sai -I hg19 input.fastq" doesnt make sense to me. Read the BWA manual from here http://bio-bwa.sourceforge.net/bwa.shtml. This thread (Guessing the quality scale in FASTQ files) will help you to guess the read quality format.

ADD REPLY • link 10.6 years ago by Ashutosh Pandey 12k

0

Entering edit mode

No this is a generic command I have put from a manual I got. But I want to understand some stuffs from it. Ok I have some more doubts. I am using the below command to make the hg19 reference genome. cat chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa chrM.fa > hg19.fa

and am getting the error below. /scratch/GT/vdas/test_exome$ chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa -clustershell: chr10.fa: command not found vdas@dpt-n2:/scratch/GT/vdas/test_exome$ chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa chrM.fa > hg19.fa -clustershell: chr19.fa: command not found

Can you tell me how to understand why this error is happening and how to debug it? I have checked that the chr19.fa and chr10.fa are there in the directory but still am getting the error

ADD REPLY • link 10.6 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

You are copy and pasting the command on your command line and it gets split into three lines and the command line assumes that chr10.fa is some kind of command. Paste the entire command in a notepad and then make sure that the whole command is in one line. Then copy it from there and paste it on your command line. It should work then.

ADD REPLY • link 10.6 years ago by Ashutosh Pandey 12k

0

Entering edit mode

yes I feel the same.. thanks a lot. I would like to ask that in this chrmFa zip file I see after unzipping a lot other chr files, do we have to concatenate them as well to build the reference genome?

ADD REPLY • link 10.6 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

I have no idea what that file is. Can you post the link from where you downloaded it.

ADD REPLY • link 10.6 years ago by Ashutosh Pandey 12k

0

Entering edit mode

I have download it from the below link http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz

ADD REPLY • link 10.6 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

You better download the human reference sequence from GATK. You may have to use GATK at one point. They provide the readymade hg19. Go to this link and follow the instructions and only download the reference fasta.

See this thread: Where can I download human reference genome in fasta format? hgref.fa file

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 10.6 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Hi @ashutoshmits Hi, I need some suggestions, I am new to GATK and want to use it for my exome sequencing data analysis. I have been a bit lost reading all the blogs , comments and the technical forums. So here is something I want to say and please correct and guide me through the procedure. I have downloaded the hg19 files from the UCSC browser and created the reference genome but do I need to again use the one which is there in GATK repository and then align my samples for downstream analysis? Also I want to run the GATK in my institute cluster. So if am not wrong I should create the directory of the latest GATK version and transfer all the necessary files via Filezilla in the cluster directory with the same name. Now this I have already done. So next thing is to download the bundle from the repository where I see 2 versions , so which one should I download? 2.5? or less? Also once I download the bundle do I have to download anything else? So here it is which I should be downloading right in my cluster. The jar file and the resource folder with the .java files and then in the main directory of the GATK version I should download the bundle version 2.5 or 2.3? and then unzip all the files that are there in the bundle directory. Right? Please let me know. Then I should be ready to use it right?

ADD REPLY • link 10.6 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

Download the reference from GATK if you wish to use GATK for your analysis. Download latest GATK version then unzip the file in any directory. You can always give full path to the GATK jar file whenever u want to use it. You should be ready to go then.

ADD REPLY • link 10.6 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Thank you for the reply ashutosh.. but I need to ask you that since my reference genome is build on the hg19 I donwloaded from UCSC so I shall have to again redo all the alignment steps from beginning else I will not be able to use the ref genome from the GATK repository right? Otherwise I will not be able to continue the next steps

Identify target regions for realignment (Genome Analysis Toolkit) ->Realign BAM to get better Indel calling (Genome Analysis Toolkit) ->Reindex the realigned BAM (SAM Tools) ->Call Indels (Genome Analysis Toolkit) ->Call SNPs (Genome Analysis Toolkit)->View aligned reads in BAM/BAI (Integrated Genome Viewer)

Please let me know if this looks correct or not. The VCF files from the 1kG and the DBSNP are already there in compressed form in the bundle repository of the GATK website which I am currently downloading and I can use them directly after unzipping them.

ADD REPLY • link 10.6 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

Hey if you used UCSC that is fine too. You can continue with the steps that you mentioned. GATK ask your bam file to have chromosomes sorted out in a particular order. Make sure you follow it. Try searching for the format nd you will get it. Best of Luck.

ADD REPLY • link 10.6 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Yes I check the order and the order is the chrM should be at the end after the 22 chr and then chr X and chrY and I have build the reference genome like that only. So I guess it should not be an issue to start the analysis using GATK. I am now downloading the bundle from the GATK repository in my cluster and once it is done I will unzip the files and then continue my analysis for the variant calls

ADD REPLY • link 10.6 years ago by ivivek_ngs ★ 5.2k

score 1 · Answer 1 · 2013-10-03

1

Entering edit mode

10.6 years ago

Irsan ★ 7.8k

You can use this perl script to determine the qual encoding of a fastq file

#!/usr/bin/perl
use strict;
use warnings;

# Usage: gunzip -dc fastq.gz | awk '{if(NR%4==0)print $0}' | getQualEncoding.pl 

my $encoding = "unknown";
my $qual = "";

while(<>){
    if($. > 10000){
        last;
    }    
    my $line = $_;
    chomp($line);
    $qual = $qual . $line;
}

determineEncoding($qual);



###################################################################################
sub determineEncoding{
    $qual = $_[0];
    if($qual =~ /0/){
        if($qual =~ /J/){
            print("Illumina 1.8+ Phred+33 (0,41)\n");
        }
        else {
            print("Sanger Phred+33 (0,40)\n");
        }
    }
    else {
        if($qual =~ /=/){
            print("Solexa Solexa+64 (-5,40)\n");
        }
        elsif($qual =~ /A/){
            print("Illumina 1.3+ Phred+64 (0,40)\n");
        }
        else{
            print("Illumina 1.5+ Phred+64 (3,40)\n");
        }
    }
}

ADD COMMENT • link 10.6 years ago by Irsan ★ 7.8k

0

Entering edit mode

Hi, I need some suggestions, I am new to GATK and want to use it for my exome sequencing data analysis. I have been a bit lost reading all the blogs , comments and the technical forums. So here is something I want to say and please correct and guide me through the procedure. I have downloaded the hg19 files from the UCSC browser and created the reference genome but do I need to again use the one which is there in GATK repository and then align my samples for downstream analysis? Also I want to run the GATK in my institute cluster. So if am not wrong I should create the directory of the latest GATK version and transfer all the necessary files via Filezilla in the cluster directory with the same name. Now this I have already done. So next thing is to download the bundle from the repository where I see 2 versions , so which one should I download? 2.5? or less? Also once I download the bundle do I have to download anything else? So here it is which I should be downloading right in my cluster. The jar file and the resource folder with the .java files and then in the main directory of the GATK version I should download the bundle version 2.5 or 2.3? and then unzip all the files that are there in the bundle directory. Right? Please let me know. Then I should be ready to use it right?

ADD REPLY • link 10.6 years ago by ivivek_ngs ★ 5.2k