Question

SAM file wrong? help with validatesamfile

0

Entering edit mode

7.8 years ago

cristina_sabiers ▴ 110

First at all thanks for all the support and help...to a newbee

Following a tutorial I did manage to generate my sam file, the problem I get when I want to convert to bam, I cant, so I have been reading the best is to run validatesamfile, so I did

But at this point I get really lost ;_; whats the main reason my sam file is bad? Im biology not a programmer and is the first time I do this and learn on my own, so whatever help or link to tutorial will be grateful for me...

Thanks!!!

java -jar picard.jar ValidateSamFile \
>      I=out.sam \
>      MODE=SUMMARY
[Thu Jun 23 18:17:20 CEST 2016] picard.sam.ValidateSamFile INPUT=out.sam MODE=SUMMARY    MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Thu Jun 23 18:17:20 CEST 2016] Executing as cri@cri-To-be-filled-by-O-E-M on Linux 4.4.0-21-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_91-8u91-b14-0ubuntu4~16.04.1-b14; Picard version: 2.4.1(7c4d36e011df1aec4689b51efcada44e92d1817f_1464389670) JdkDeflater
[Thu Jun 23 18:17:20 CEST 2016] picard.sam.ValidateSamFile done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=494403584
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.IllegalArgumentException: Invalid fastq character: 
    at htsjdk.samtools.SAMUtils.fastqToPhred(SAMUtils.java:407)
    at htsjdk.samtools.SAMUtils.fastqToPhred(SAMUtils.java:385)
    at htsjdk.samtools.SAMRecord.setBaseQualityString(SAMRecord.java:277)
    at htsjdk.samtools.SAMLineParser.parseLine(SAMLineParser.java:326)
    at htsjdk.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:248)
    at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:236)
    at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:212)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:545)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:519)
    at htsjdk.samtools.SamFileValidator.validateSamRecordsAndQualityFormat(SamFileValidator.java:269)
    at htsjdk.samtools.SamFileValidator.validateSamFile(SamFileValidator.java:200)
    at htsjdk.samtools.SamFileValidator.validateSamFileSummary(SamFileValidator.java:128)
    at picard.sam.ValidateSamFile.doWork(ValidateSamFile.java:187)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

EXOME • 7.7k views

ADD COMMENT • link updated 12 months ago by Ram 43k • written 7.8 years ago by cristina_sabiers ▴ 110

0

Entering edit mode

Can you backtrack a bit and include commands that you used for the alignment and conversion to BAM? What software were you using for that part?

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

Thanks genomax!

I did this

cat chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa \
chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa \
chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa chrM.fa > hg19.fa

bwa index -a bwtsw -p hg19 hg19.fa

bwa aln -t 4 -f input.sai -I hg19 input.fastq

single paired ION_EX..:

bwa samse -f out.sam -r "@RG\tID:<ID>\tLB:<LIBRARY_NAME>\tSM:<SAMPLE_NAME>\tPL:PL" \
hg19 input.sai input.fastq

ADD REPLY • link updated 7.8 years ago by GenoMax 141k • written 7.8 years ago by cristina_sabiers ▴ 110

0

Entering edit mode

Did you replace <ID> <LIBRARY_NAME> and <SAMPLE_NAME> with real names (my guess is you did not)? <OPTION> is used in unix to indicate that a real name/string should be used in that location.

What does the single paired ION_EX mean?

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

Did you replace <library_name> and <sample_name> with real names (my guess is you did not)? No I didnt ;_;

What does the single paired ION_EX mean? Exome Ion Torrent, just a note for myself :)

ADD REPLY • link 7.8 years ago by cristina_sabiers ▴ 110

0

Entering edit mode

That may be one of the issues. Can you show us the output of head -30 out.sam ?

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

ok, I will try to do that step right, thanks a lot

mmm by the output of head -30 you mean like this? with this code?Im sorry... Im really newbee just 2 weeks working on this (before crashing my head to learn linux and install all the tools)

head -n 30 out.sam > head.csv

@SQ SN:chr1 LN:249250621                            
@SQ SN:chr2 LN:243199373                            
@SQ SN:chr3 LN:198022430                            
@SQ SN:chr4 LN:191154276                            
@SQ SN:chr5 LN:180915260                            
@SQ SN:chr6 LN:171115067                            
@SQ SN:chr7 LN:159138663                            
@SQ SN:chr8 LN:146364022                            
@SQ SN:chr9 LN:141213431                            
@SQ SN:chr10    LN:135534747                            
@SQ SN:chr11    LN:135006516                            
@SQ SN:chr12    LN:133851895                            
@SQ SN:chr13    LN:115169878                            
@SQ SN:chr14    LN:107349540                            
@SQ SN:chr15    LN:102531392                            
@SQ SN:chr16    LN:90354753                         
@SQ SN:chr17    LN:81195210                         
@SQ SN:chr18    LN:78077248                         
@SQ SN:chr19    LN:59128983                         
@SQ SN:chr20    LN:63025520                         
@SQ SN:chr21    LN:48129895                         
@SQ SN:chr22    LN:51304566                         
@SQ SN:chrX LN:155270560                            
@SQ SN:chrY LN:59373566                         
@SQ SN:chrM LN:16571                            
@RG ID:<ID> LB:<LIBRARY_NAME>   SM:<SAMPLE_NAME>    PL:PL                   
@PG ID:bwa  PN:bwa  VN:0.7.12-r1039 CL:bwa samse -f out.sam -r @RG\tID:<ID>\tLB:<LIBRARY_NAME>\tSM:<SAMPLE_NAME>\tPL:PL h   19 input.sai input.fastq                
CY4KN:00822:08321   16  chr1    10174   15  53M *   0   0   CTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC
                                   
!    RG:Z:<ID>   XT:A:U  NM:i:0  X0:i:1  X1:i:7  XM:i:0  XO:i:0  XG:i:0  MD:Z:53

ADD REPLY • link updated 7.8 years ago by GenoMax 141k • written 7.8 years ago by cristina_sabiers ▴ 110

0

Entering edit mode

That looks good except I don't see any quality values. Can we also look at head -6 input.fastq output?

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

head -n 6 input.fastq > head2.csv

@CY4KN:00822:08321                      
GGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGTTAG                       
+                       
<7=48<<4:5888*55888284:<>2:547<:38=<@2:3/-)-)--
@CY4KN:03979:12865                      
AGGGTTAGGGTTAGGGTTAGGGGTTAGGGATAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGCTTAG

Hope this is right....I know...I still need to learn a lot

ADD REPLY • link updated 7.8 years ago by GenoMax 141k • written 7.8 years ago by cristina_sabiers ▴ 110

1

Entering edit mode

There is something strange with the first record (if the copy/paste has preserved the content). There appear to be more nucleotides than Q-scores. Do you see that for later records as well?

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

I don't know. Are you allowed to have RG tags with "<"?

ADD REPLY • link 7.8 years ago by igor 13k

0

Entering edit mode

Hopefully @cristina is re-running the samse step with real names (since this is Ion data should not be very large files) so that should address the "<" issue. The original error was referring to invalid fastq characters (which was what I was going after first).

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

Thanks! Sadly bc Im new I need to wait 5 hours to next post ;_;

As genomax told I re-running the samse step, I tried again to export to bam, but doesnt work.

samtools import hg19.fa out.sam out.bam

[samopen] SAM header is present: 25 sequences.
Parse error at line 28: sequence and quality are inconsistent
Aborted (core dumped)

head -n 30 out.sam > head.csv

@SQ SN:chr1 LN:249250621    
@SQ SN:chr2 LN:243199373    
@SQ SN:chr3 LN:198022430    
@SQ SN:chr4 LN:191154276    
@SQ SN:chr5 LN:180915260    
@SQ SN:chr6 LN:171115067    
@SQ SN:chr7 LN:159138663    
@SQ SN:chr8 LN:146364022    
@SQ SN:chr9 LN:141213431    
@SQ SN:chr10    LN:135534747    
@SQ SN:chr11    LN:135006516    
@SQ SN:chr12    LN:133851895    
@SQ SN:chr13    LN:115169878    
@SQ SN:chr14    LN:107349540    
@SQ SN:chr15    LN:102531392    
@SQ SN:chr16    LN:90354753 
@SQ SN:chr17    LN:81195210 
@SQ SN:chr18    LN:78077248
@SQ SN:chr19    LN:59128983
@SQ SN:chr20    LN:63025520
@SQ SN:chr21    LN:48129895
@SQ SN:chr22    LN:51304566
@SQ SN:chrX LN:155270560
@SQ SN:chrY LN:59373566
@SQ SN:chrM LN:16571    
@RG ID:FLOWCELL2    LB:LIB-KID-1    SM:KID1 PL:ION
@PG ID:bwa  PN:bwa  VN:0.7.12-r1039 CL:bwa samse -f out.sam -r @RG\tID:FLOWCELL2\tLB:LIB-KID-1\tSM:KID1\tPL:ION hg19 input.sai input.fastq
CY4KN:00822:08321   16  chr1    10174   15  53M *   0   0   CTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC
!   RG:Z:FLOWCELL2  XT:A:U  NM:i:0  X0:i:1  X1:i:7  XM:i:0  XO:i:0  XG:i:0  MD:Z:53

ADD REPLY • link updated 7.8 years ago by GenoMax 141k • written 7.8 years ago by cristina_sabiers ▴ 110

1

Entering edit mode

It appears that your original data file is malformed (number of bases =/= number of Q-scores). Can you run fastqValidator and see how bad the problem is? Can you get another copy of the original data file in case the one you have is corrupt?

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

Finally I managed to run fastqvalidator..

I think the report looks not so good, is any way to fix all of that?

Thanks!!!

cri@cri-To-be-filled-by-O-E-M:~/Desktop/fastQValidator/bin$ ./fastQValidator --file /home/cri/Desktop/hg19/input.fastq

ERROR on Line 11: Raw Sequence is shorter than the min read length: 9 < 10

ERROR on Line 1731: Raw Sequence is shorter than the min read length: 9 < 10

ERROR on Line 4743: Raw Sequence is shorter than the min read length: 9 < 10

ERROR on Line 4747: Raw Sequence is shorter than the min read length: 9 < 10

ERROR on Line 4751: Raw Sequence is shorter than the min read length: 9 < 10

ERROR on Line 4755: Raw Sequence is shorter than the min read length: 9 < 10

ERROR on Line 4795: Raw Sequence is shorter than the min read length: 9 < 10

ERROR on Line 7531: Raw Sequence is shorter than the min read length: 8 < 10

ERROR on Line 7543: Raw Sequence is shorter than the min read length: 9 < 10

ERROR on Line 7611: Raw Sequence is shorter than the min read length: 8 < 10

ERROR on Line 7711: Raw Sequence is shorter than the min read length: 8 < 10

ERROR on Line 7763: Raw Sequence is shorter than the min read length: 8 < 10

ERROR on Line 8367: Raw Sequence is shorter than the min read length: 9 < 10

ERROR on Line 8523: Raw Sequence is shorter than the min read length: 9 < 10

ERROR on Line 8559: Raw Sequence is shorter than the min read length: 9 < 10

ERROR on Line 8603: Raw Sequence is shorter than the min read length: 8 < 10

ERROR on Line 8663: Raw Sequence is shorter than the min read length: 9 < 10

ERROR on Line 8711: Raw Sequence is shorter than the min read length: 8 < 10

ERROR on Line 8767: Raw Sequence is shorter than the min read length: 9 < 10

ERROR on Line 8783: Raw Sequence is shorter than the min read length: 8 < 10

Finished processing /home/cri/Desktop/hg19/input.fastq with 181246044 lines containing 45311511 sequences. There were a total of 112895 errors. Returning: 1 : FASTQ_INVALID

ADD REPLY • link 7.8 years ago by cristina_sabiers ▴ 110

0

Entering edit mode

Those are just short reads than the default (because by default fastqvalidator seems to be looking for a minimum of 10 bases, you can probably change that on the command line, take a look at fastqvalidator -h).

I am surprised that there was no flag for the example you had posted above where there were fewer Q-scores than bases in that particular read.

Returning: 1 : FASTQ_INVALID

Not sure if that is saying that one example you posted is the only invalid sequence. Can you delete that and see if

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

Thanks Geomax, i was checking in tutorial how and I checked twice the file with different commands

I have been reading when you get the file from illumina, Ion ...etc there exist different programs to check the quality and filter...like PRINSEQ , I always need to do that process before I aligning the sequences to the human genome if I get like in this case FASTQ_INVALID??

First

./fastQValidator --file /home/cri/Desktop/hg19/input.fastq --minReadLen 40 --baseComposition --avgQual --disableSeqIDCheck --quiet --params ON --maxErrors -1

Overall Average Phred Quality = 21.74 (I only put the result....this forum dont allow me write so many characters)

Second ./fastQValidator --file /home/cri/Desktop/hg19/input.fastq --minReadLen 40

ERROR on Line 11: Raw Sequence is shorter than the min read length: 9 < 40

ERROR on Line 15: Raw Sequence is shorter than the min read length: 13 < 40

ERROR on Line 203: Raw Sequence is shorter than the min read length: 31 < 40

ERROR on Line 1707: Raw Sequence is shorter than the min read length: 38 < 40

ERROR on Line 1711: Raw Sequence is shorter than the min read length: 39 < 40

ERROR on Line 1715: Raw Sequence is shorter than the min read length: 29 < 40

ERROR on Line 1719: Raw Sequence is shorter than the min read length: 22 < 40

ERROR on Line 1731: Raw Sequence is shorter than the min read length: 9 < 40

ERROR on Line 3459: Raw Sequence is shorter than the min read length: 38 < 40

ERROR on Line 3475: Raw Sequence is shorter than the min read length: 35 < 40

ERROR on Line 3483: Raw Sequence is shorter than the min read length: 33 < 40

ERROR on Line 3607: Raw Sequence is shorter than the min read length: 16 < 40

ERROR on Line 3611: Raw Sequence is shorter than the min read length: 10 < 40

ERROR on Line 3615: Raw Sequence is shorter than the min read length: 10 < 40

ERROR on Line 3627: Raw Sequence is shorter than the min read length: 23 < 40

ERROR on Line 3959: Raw Sequence is shorter than the min read length: 37 < 40

ERROR on Line 3979: Raw Sequence is shorter than the min read length: 12 < 40

ERROR on Line 4123: Raw Sequence is shorter than the min read length: 34 < 40

ERROR on Line 4231: Raw Sequence is shorter than the min read length: 32 < 40

ERROR on Line 4647: Raw Sequence is shorter than the min read length: 12 < 40

Finished processing /home/cri/Desktop/hg19/input.fastq with 181246044 lines containing 45311511 sequences. There were a total of 2340507 errors. Returning: 1 : FASTQ_INVALID

ADD REPLY • link 7.8 years ago by cristina_sabiers ▴ 110

0

Entering edit mode

It's throwing an error because it's getting an ascii value in the quality column either < 33 or > 126. How was this SAM file made? Does it look reasonable if you scroll through it a bit?

ADD REPLY • link 7.8 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks Devon, due Im in spain I write with bit delay :P

I followed this tutorial

http://seqanswers.com/wiki/How-to/exome_analysis

My steps were these to get my sam file

cat chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa
chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa
chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa chrM.fa > hg19.fa

bwa index -a bwtsw -p hg19 hg19.fa

bwa aln -t 4 -f input.sai -I hg19 input.fastq  (take 5 hours)

bwa samse -f out.sam -r "@RG\tID:FLOWCELL2\tLB:LIB-KID-1\tSM:KID1\tPL:ION" hg19 input.sai input.fastq

I have few fastq files to use (mostly are 16-20 gb), yesterday night I tried the same steps with another fastq file in this case I get like same results, probably I do something wrong :(

I tried to convert my sam file to bam with two diferents paths...so i got

FIRST this shorter way

samtools import hg19.fa out.sam out.bam

[samopen] SAM header is present: 25 sequences.
Parse error at line 28: sequence and quality are inconsistent
Aborted (core dumped)

SECOND WAY. from the tutorial..seems even worst, think I need to fix more things than just the sam file O_o

java -Xmx4g -Djava.io.tmpdir=/tmp \
> -jar /home/cri/Desktop/picard1/SortSam.jar \
> SO=coordinate \
> INPUT=out.sam \
> OUTPUT=output.bam \
> VALIDATION_STRINGENCY=LENIENT \
> CREATE_INDEX=true
[Sat Jun 25 12:13:34 CEST 2016] picard.sam.SortSam INPUT=out.sam OUTPUT=output.bam SORT_ORDER=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true    VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_MD5_FILE=false
OpenJDK 64-Bit Server VM warning: You have loaded library /home/cri/Desktop/picard1/libIntelDeflater.so which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
[Sat Jun 25 12:13:34 CEST 2016] Executing as cri@cri-To-be-filled-by-O-E-M on Linux 4.4.0-24-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_91-8u91-b14-0ubuntu4~16.04.1-b14; Picard version: 1.119(d44cdb51745f5e8075c826430a39d8a61f1dd832_1408991805) IntelDeflater
Ignoring SAM validation error due to lenient parsing:
Error parsing text SAM file. length(QUAL) != length(SEQ); Line 28
Line: AHB1V:06972:11426 4   *   0   0   *   *   0   0GGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGTTAGGGTTAGGTTAGGTAGG    
                                    
                                         
                                              
                                                        
[Sat Jun 25 12:13:34 CEST 2016] picard.sam.SortSam done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=494403584
To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp
Exception in thread "main" java.lang.IllegalArgumentException: Invalid fastq character: 
    at htsjdk.samtools.SAMUtils.fastqToPhred(SAMUtils.java:419)
    at htsjdk.samtools.SAMUtils.fastqToPhred(SAMUtils.java:398)
    at htsjdk.samtools.SAMRecord.setBaseQualityString(SAMRecord.java:258)
    at htsjdk.samtools.SAMLineParser.parseLine(SAMLineParser.java:319)
    at htsjdk.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:247)
    at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:235)
    at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:211)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:514)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:488)
    at picard.sam.SortSam.doWork(SortSam.java:74)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
    at picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:124)
    at picard.sam.SortSam.main(SortSam.java:61)

ADD REPLY • link updated 7.8 years ago by GenoMax 141k • written 7.8 years ago by cristina_sabiers ▴ 110

0

Entering edit mode

Can you post the first few alignments in the SAM file (i.e., skip the lines starting with @)?

BTW, samtools view -Sbo output.bam input.sam is the modern version of the samtools command you wrote. I'm not entirely sure that samtools import works anymore, the command was renamed years ago.

ADD REPLY • link 7.8 years ago by Devon Ryan 104k

2

Entering edit mode

@Devon: @cristina has a fastq file with at least one sequence where the Q-scores are less than the bases. That example is in this thread if you go up.

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

Ah, there's the problem!

ADD REPLY • link 7.8 years ago by Devon Ryan 104k

0

Entering edit mode

ok...so If I understand right my main problem is I got a fastq file with less quality scores than nucleotides

I have been checking how delete that part, and found can use fastx-toolkit, I don't know if I do right at all

fastq_quality_trimmer -t N -i input.fastq -o out.fastq

fastq_quality_trimmer: Missing minimum quality threshold value (-t)

mmm which value I should use 48?

I can use something else where I copy and select the part I want to delete?

Thanks a lot for the help!

ADD REPLY • link updated 7.8 years ago by GenoMax 141k • written 7.8 years ago by cristina_sabiers ▴ 110

0

Entering edit mode

Best option would be to use reformat.sh from BBMap suite to remove the broken reads. You will use it like (no spaces between the = other words)

reformat.sh in=/path_to/your.fastq out=/path_to/fixed.fastq tossbrokenreads=t

You will then need to redo the alignments.

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

thanks a lot!!! :)))) I own you a big coffe!!!

Afther doing reformat have same quality scores than nucleotides :)

Now I do aligment once more...in my pc will take around 5 hours, so I m looking forward for the results and hope I can continue and learn how to call snps and generate an excel with predictors. One of my main goals is to generate a report with de novo mutations... well... step by step :P

ADD REPLY • link 7.8 years ago by cristina_sabiers ▴ 110

0

Entering edit mode

Since you have BBMap installed on your machine you can look into using bbmap.sh an alternative for bwa for alignments. You can go from your fastq file directly to an aligned BAM (as long as you have samtools available on your machine).

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

I tried with

reformat.sh in=/path_to/your.fastq out=/path_to/fixed.fastq tossbrokenreads=t

Seems didn't work, after I did alignment and got same error, I should have check twice the fastq before doing it. At first sight look like had same quality scores than nucleotids... but it isnt. :(

head -n 6 fixed.fastq > head.csv


@CY4KN:00822:08321
GGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGTTAG
+
<7=48<<4:5888*55888284:<>2:5;;;4;7<:;3;8=<@2:3/-)-)--

@CY4KN:03979:12865
AGGGTTAGGGTTAGGGTTAGGGGTTAGGGATAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGCTTAG

ADD REPLY • link updated 7.8 years ago by GenoMax 141k • written 7.8 years ago by cristina_sabiers ▴ 110

0

Entering edit mode

That fastq record looks ok to me.

Are there extra blank lines in your fastq file after each line (I removed them when I reformatted the post above)?

Can you try bbmap.sh to do an alignment?

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

I added this blank spaces on pour pose to see better, when i copy and paste the file appear all in one line

you are right are same quality scores, I count them now....stupid of me, at my excell looks like much more less...(quality fonts are smaller)

I will try with bbmap.sh ! Thanks

ADD REPLY • link 7.8 years ago by cristina_sabiers ▴ 110

0

Entering edit mode

Did you try @Devon's suggestion for current samtools command to covert sam to bam? Did the following produce same error?

samtools view -Sbo output.bam input.sam

If you do decide to map with bbmap.sh then as long as samtools is available in your $PATH then BBMap can directly create bam files (saving you a step). Be sure to use:

bbmap.sh in=your.fastq out=your.bam options_for_bbmap_as_needed

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

yes, I tried devon advice, didnt work.

mmm nice to know the shortcut...Im too late, I have used :

./bbmap.sh in=/home/cri/Desktop/hg19/copia/fixed.fastq ref=/home/cri/Desktop/hg19/copia/hg19.fa out=/home/cri/Desktop/hg19/copia/mapped.sam nodisk

my cpu fan run fast XD need buy one that make less noisy, hope will not burn out o_O

THANKS!

ADD REPLY • link updated 7.8 years ago by GenoMax 141k • written 7.8 years ago by cristina_sabiers ▴ 110

0

Entering edit mode

Is the mapping running now? Once done, try the sam conversion to bam on the file produced by BBMap.

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

yes...still running I asume will take 6 hours like did with the other tools, hope this time will be fine!....we will see tomorrow, time for me to go to sleep...zzz...

thanks! :)

ADD REPLY • link 7.8 years ago by cristina_sabiers ▴ 110

0

Entering edit mode

Perhaps BBMap is running with a single thread (I don't remember if it uses all cores by default or just one). Next time add threads=N (N cores in your processor) option to your command to speed things up a bit.

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

gosh need do all again, my father did shut down the pc..... ;;;;;____;;;;;;

ADD REPLY • link 7.8 years ago by cristina_sabiers ▴ 110

0

Entering edit mode

Can't help you with that one :-D

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

XD sadly no...

Now I put this code, so I hope this time I get the file !!!

./bbmap.sh ref=/media/cri/CRIS_DATA/hg19.fa in=/media/cri/CRIS_DATA/fixed.fastq out=/media/cri/CRIS_DATA/out.bam nodisk threads=4

ADD REPLY • link 7.8 years ago by cristina_sabiers ▴ 110

0

Entering edit mode

Just by curisosity...

how long time takes bbmap to give my bam file? Is still working with 100% my 4 cores, and use 23 gb ram...and until now, 6 hours has process a part of bam, 2.3 gb....thats normal?

How big is going to suposse to be my bam file if the fastq is 15 gb?

thanks

ADD REPLY • link 7.8 years ago by cristina_sabiers ▴ 110

0

Entering edit mode

As long as the job is running (and the output file is increasing in size) leave it alone. Hard to tell what sized BAM you will end up with at the end.

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

ok, thanks :)

thought was some kind of average...is bit bad I cant see % of running on this kind of jobbs...

ADD REPLY • link 7.8 years ago by cristina_sabiers ▴ 110

0

Entering edit mode

You could figure out how many reads have been aligned by looking in the BAM but since all 4 cores on your machine are busy aligning the reads let us not try that now.

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

I GOT MY FIRST BAM FILE!!!! YES YES YES !!!! ^^

Think took around 10 hours...Hope my file is fine and I can use it !!

Now I sort and index it with this code

samtools sort out.bam out.sorted

samtools index out.sorted.bam out.sorted.bai

Bbmap do it too?

If everything go fines I hope I can visualize it at IGV :)

Thanks a lot genomax!

"BAM�@HD  VN:1.4  SO:unsorted"

"@SQ    SN:chr1 LN:249250621"

"@SQ    SN:chr2 LN:243199373"

"@SQ    SN:chr3 LN:198022430"

"@SQ    SN:chr4 LN:191154276"

"@SQ    SN:chr5 LN:180915260"

"@SQ    SN:chr6 LN:171115067"

"@SQ    SN:chr7 LN:159138663"

"@SQ    SN:chr8 LN:146364022"

"@SQ    SN:chr9 LN:141213431"

"@SQ    SN:chr10    LN:135534747"

"@SQ    SN:chr11    LN:135006516"

"@SQ    SN:chr12    LN:133851895"

"@SQ    SN:chr13    LN:115169878"

"@SQ    SN:chr14    LN:107349540"

"@SQ    SN:chr15    LN:102531392"

"@SQ    SN:chr16    LN:90354753"

"@SQ    SN:chr17    LN:81195210"

"@SQ    SN:chr18    LN:78077248"

"@SQ    SN:chr19    LN:59128983"

"@SQ    SN:chr20    LN:63025520"

"@SQ    SN:chr21    LN:48129895"

"@SQ    SN:chr22    LN:51304566"

"@SQ    SN:chrX LN:155270560"

"@SQ    SN:chrY LN:59373566"

"@SQ    SN:chrM LN:16571"

"@PG    ID:BBMap    PN:BBMap    

VN:36.11    CL:java -Djava.library.path=/home/cri/Desktop/bbmap/jni/ -ea -Xmx25019m align2.BBMap build=1 overwrite=true fastareadlen=500 ref=/media/cri/CRIS_DATA/hg19.fa in=/media/cri/CRIS_DATA/fixed.fastq out=/media/cri/CRIS_DATA/out.bam nodisk threads=4"

chr1=C�chr2��~chr3��chr4d�dchr5<��
chr6;3
"chr7gC|   chr8vV�chr9�jchr10�chr114  chr12�j�chr13VZ�chr14$fchr15@�chr16A�bchr17���chr18@]�chr19�<�chr20p��chr21gg�chr22v�chrX�=A    chrY��chrM�@"

ADD REPLY • link updated 7.8 years ago by GenoMax 141k • written 7.8 years ago by cristina_sabiers ▴ 110

0

Entering edit mode

Assuming you are using the latest samtools the commands would be slightly different

samtools sort -o sorted.bam in.bam

samtools index sorted.bam

With a large BAM file it may take a couple of hours to sort. So be patient.

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

oh thanks... I note it..for me worked with the other code...now I see my bam file at IGV ^^ next step will be call snps....probably I will end crying XD

I asked to the person who provided me my fastq file and the bam as well, why my bam file is just 5 gb and their bam file is 17-20 gb ....he says depends all on number of alignments...

thanks!

ADD REPLY • link 7.8 years ago by cristina_sabiers ▴ 110

0

Entering edit mode

Sorted BAM files compress better so if your BAM is sorted them it may look smaller than the other one. Can you post the lines from your BBMap alignment stats to see how well it did? This is ion data and as such more prone to errors.

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

Those strange characters look like an artifact of copy/paste so let us ignore those for the moment.

ADD REPLY • link 7.8 years ago by GenoMax 141k