Question: How can I work with GATK and cutadapt generating diferent sizes of input?
0
gravatar for whiteeagle115
3.6 years ago by
Brazil
whiteeagle11510 wrote:

Hi!

I am having some problens in work with GATK. The first thing is my sequences are paired-end and I nedd to pass in cutadapt to cut the parts that are not good, but the sequence_1 and sequence_2 stayed with differents sizes.

After that I just passed in bowtie2 to map and after I passed in samtools to transform in .sam, .bam and sort the sequences.

In the end, when I put GATK to run, they run the firste part: AddOrReplaceReadGroups, then give a error that saus that one or more argumments or inputs im my command are wrong. But I always use the same script to run java (if end-to-end and paired-end) and work.

The only thing I do this time that changes is that I use cutadapt and generated a different two input sizes (but I need to cut, because the sequences i am using are not good).

Someone knows something about that?

Thanks a lot

ADD COMMENTlink modified 3.6 years ago by Devon Ryan84k • written 3.6 years ago by whiteeagle11510

The error isn't caused by the difference in read sizes. Post the error message for more useful help.

ADD REPLYlink written 3.6 years ago by Devon Ryan84k

Devon, sorry.

Now the error :

"Chromosome_1.14:2.120.543
INFO    2015-03-07 19:46:52    AddOrReplaceReadGroups    Processed    87.000.000 records.  Elapsed time: 00:16:19s.  Time for last 1.000.000:   11s.  Last read position: Chromosome_1.14:2.407.506
INFO    2015-03-07 19:47:04    AddOrReplaceReadGroups    Processed    88.000.000 records.  Elapsed time: 00:16:30s.  Time for last 1.000.000:   11s.  Last read position: Chromosome_1.14:2.653.997
INFO    2015-03-07 19:47:15    AddOrReplaceReadGroups    Processed    89.000.000 records.  Elapsed time: 00:16:42s.  Time for last 1.000.000:   11s.  Last read position: Chromosome_1.14:2.912.133
[Sat Mar 07 19:47:25 BRT 2015] picard.sam.AddOrReplaceReadGroups done. Elapsed time: 16,87 minutes.
Runtime.totalMemory()=211812352
INFO  19:48:48,620 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  19:48:48,622 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 
INFO  19:48:48,622 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  19:48:48,622 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  19:48:48,626 HelpFormatter - Program Args: -T RealignerTargetCreator -R plasmodium_vivax_salvador_i_1_supercontigs.fasta -I M08.q30.nodup.bam.group.bam -o M08.q30.nodup.bam.realigner.intervals 
INFO  19:48:48,629 HelpFormatter - Executing as Crippa@MacBook-Pro-de-Thais.local on Mac OS X 10.10.2 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_20-b26. 
INFO  19:48:48,630 HelpFormatter - Date/Time: 2015/03/07 19:48:48 
INFO  19:48:48,630 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  19:48:48,630 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  19:48:48,940 GenomeAnalysisEngine - Strictness is SILENT 
INFO  19:48:48,998 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  19:48:49,004 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  19:48:49,015 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
INFO  19:48:49,118 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files 
INFO  19:48:49,133 GenomeAnalysisEngine - Done preparing for traversal 
INFO  19:48:49,133 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  19:48:49,134 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  19:48:49,134 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
INFO  19:48:53,836 GATKRunReport - Uploaded run statistics report to AWS S3 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.3-0-g37228af): 
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: SAM/BAM file SAMFileReader{/Users/Crippa/Desktop/Test_gatk/M08.q30.nodup.bam.group.bam} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 66; please see the GATK --help documentation for options related to this error
##### ERROR ------------------------------------------------------------------------------------------
INFO  19:48:55,308 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  19:48:55,309 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 
INFO  19:48:55,310 HelpFormatter - Copyright (c) 2010 The Broad Institute "

 

 

 

ADD REPLYlink written 3.6 years ago by whiteeagle11510
1

The error message says that one of the quality scores in an alignment is aberrantly high. This would normally only happen if (1) you either incorrectly specified the quality encoding when performing alignments, (2) the SAM/BAM file is corrupt or (3) one of the fastq files is corrupt or otherwise has an error. The a combination of subsetting the file and grep/awk should allow you to find the problematic alignment.

ADD REPLYlink written 3.6 years ago by Devon Ryan84k

The error was thrown right in the beginning as indicated by the ProgressMeter so I would go with reason number (1) as mentioned by Devon above. This is just my guess :-)

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Ashutosh Pandey11k

When you said that the problem is in the quality, I just figure that sequences I'm working can be in anouther format. And this is the case: sequences were sequenced in Illumina 1.3+/1.5+, then the codification is different (is up to 100). So, do you know some program or script that converter to the codification for Illumina 1.8+. Thanks, guys

 

 

 

 

 

ADD REPLYlink written 3.6 years ago by whiteeagle11510

I feel like I wrote a script to do this at some point but don't know where I posted it online. In any case, the conversion is a simple decrement (subtract 31 from each value), so you should be able to write a converter easily enough. If not, you can always redo the alignments with the proper settings.

ADD REPLYlink written 3.6 years ago by Devon Ryan84k

http://www-huber.embl.de/users/anders/HTSeq/doc/sequences.html#sequences (Check write_to_fastq_file) 

https://github.com/lh3/seqtk

 

ADD REPLYlink written 3.6 years ago by Ashutosh Pandey11k
0
gravatar for Devon Ryan
3.6 years ago by
Devon Ryan84k
Freiburg, Germany
Devon Ryan84k wrote:

Since people have run into problems like this before (and I can't find the script that I think I wrote before), I wrote a program to do this. It will take a SAM/BAM/CRAM file with phred+64 qualities (i.e., illumina 1.3+ or 1.5+) and convert the scores to be phred+33, which is required by the BAM format. You can convert SAM/BAM/CRAM to either BAM or CRAM in the process, should that be useful for you. This will save the hassle of realignment correctly, though that might be a better idea given that some alignments could change.

ADD COMMENTlink written 3.6 years ago by Devon Ryan84k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1422 users visited in the last hour