Question

Sorting my .bam with samtools produces differences in the fastq files generated afterwards

0

Entering edit mode

7.3 years ago

guillepalou4 ▴ 20

Hello everybody,

Let me give you some background so you can understand my problem.

1- I produced a normal .BAM file.

2- I extracted ONLY the unmapped reads, which are 29M reads. It looks like this (first line):

HX6_24184:8:1205:14783:73264    141     *       0       0       55S22M74S       *       0       0       TCAAGAAGTTTTAGCAGAAGAAATTCCAATGCTTTTATTATATGGAGAAATTGAAAATACAGTTTATAGACCAGAAAAATATGATTATTGGACAACTAGATATGACCATACTAAACTAGATCATCCTAAATTATCATATGTAATAAGACCA AAAAFJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJFFFFJJFJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJAJFJJFJJJFFJJJJJJJJJFFJAJFJJJJJJJ<JJJJFJJJJJJFFJAAFAJFJJA RX:Z:AAACACCCAATGAAAC   QX:Z:-AFFFJJJJFJJJJJJ   BC:Z:CGTATCGG   QT:Z:AAF<AFJJ   XS:f:-138       XC:Z:   AC:Z:   AS:f:-137       XM:A:0  AM:A:0  XT:i:0  BX:Z:AAACACCCAATGAAAC-1 RG:Z:HAP6977146:LibraryNotSpecified:1:unknown_fc:0

3- I wanted to convert the unmapped.bam to paired-end fastq files (R1, R2 and singletons).

To convert the .BAM to .fastq I used samtools fastq:

samtools fastq -T BX -s ./singletons.fastq ./phased_possorted_unmapped_bf0x4_bam.bam -1 phased_possorted_unmapped_bf0x4_R1.fastq -2 phased_possorted_unmapped_bf0x4_R2.fastq

From the output I have the following sizes and number of reads:

phased_possorted_unmapped_bf0x4_R1.fastq --> 2.1GB and 7M reads
phased_possorted_unmapped_bf0x4_R1.fastq --> 2.4GB and 7M reads
singletons.fastq --> 4.9GB and 15M reads

Then I wanted to do exactly the same, but with my .BAM sorted by read name before doing this step. I used samtools sort -n of my .bam:

samtools sort -n ./phased_possorted_unmapped_bf0x4_bam.bam -o ./phased_possorted_unmapped_bf0x4_sorted_bam.bam

Then I used samtools fasq again and these are the results:

phased_possorted_unmapped_bf0x4_sorted_R1.fastq --> 3.7GB and 12.5M reads
phased_possorted_unmapped_bf0x4_sorted_R1.fastq --> 4.2GB and 12.5M reads
sorted_singletons.fastq --> 1.4GB and 4M reads

I don't understand why I have these differences in the number of reads for the fastq files. Sorting should not modify anything and I should have the same number of reads regardless the sorting.

I would appreciate if someone knows what is happening.

next-gen samtools bam sort fastq • 4.4k views

ADD COMMENT • link updated 7.3 years ago by Devon Ryan 105k • written 7.3 years ago by guillepalou4 ▴ 20

0

Entering edit mode

guillepalou4 : Please don't change the tag on this question back to tool. tool tag is generally reserved for posts that are about new tools (not for questions, like this one, about existing tools).

ADD REPLY • link 7.3 years ago by GenoMax 152k

0

Entering edit mode

Hey, sorry I didn't put it back, I just updated my post because I had an error in it. Thanks for the advice though.

ADD REPLY • link 7.3 years ago by guillepalou4 ▴ 20

0

Entering edit mode

Gah, you rebroke the formatting :(

ADD REPLY • link 7.3 years ago by Devon Ryan 105k

score 5 · Accepted Answer · 2018-03-12

5

Entering edit mode

7.3 years ago

Devon Ryan 105k

It's not actually documented as far as I can tell, but samtools fastq expects the BAM file to be name sorted. You'll otherwise get odd results, since part of what the tool does is compare read names between successive entries.

ADD COMMENT • link 7.3 years ago by Devon Ryan 105k

0

Entering edit mode

Oh, I didn't know that, thanks! I also checked the sorted vs unsorted bam and they have different number of lines. I wouldn't expect that.

ADD REPLY • link 7.3 years ago by guillepalou4 ▴ 20

1

Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLY • link 7.3 years ago by WouterDeCoster 48k

0

Entering edit mode

Do you also know, why do I have different number of characters in the same R1 file for example, after using samtools fastq?

I have 1495694956 number of nucleotides and 1494905216 number of quality scores.

It doesn't make sense.

ADD REPLY • link 7.3 years ago by guillepalou4 ▴ 20

1

Entering edit mode

That is strange indeed. Can you try using reformat.sh in=your.bam out1=new_R1.fq out2=new_R2.fq (from BBMap Suite) to see if that fares better?

If you know that there are singletons (missing corresponding pair) in the file then you may need to run repair.sh in1=new_R1.fq in2=out1=new_R2.fq out1=clean_R1.fq out2=clean_R2.fq outs=singleton.fq to re-sync the data.

ADD REPLY • link 7.3 years ago by GenoMax 152k

0

Entering edit mode

Thanks for the answer. I would prefer using samtools fastq to generate my fastq as It's the only tool I've found I can retrieve an specific TAG from the bam (I need to retrieve a BX: tag which is a barcode).

Also I found this on BBMap:

To discard reads that have mismatching lengths of bases and qualities: reformat.sh in=reads.fq out=fixed.fq tossbrokenreads

Although it may be the solution I would prefer to know why do I have this fastq files corrupted after being generated by samtools fastq...

ADD REPLY • link 7.3 years ago by guillepalou4 ▴ 20

0

Entering edit mode

You're likely miscounting.

ADD REPLY • link 7.3 years ago by Devon Ryan 105k

0

Entering edit mode

It's probably... I used this command:

cat R1.fastq | paste - - - - | cut -f 3 | tr -d '\n' | wc -m

1495694956

cat R1.fastq | paste - - - - | cut -f 5 | tr -d '\n' | wc -m

1494905216

Do you know another command that compares the length of each read to the length of the quality scores? Taking into account that the header of my fastq files are like this:

@HX6_24184:8:2213:9242:58356 BX:Z:ACGCTCTAGGCTGCAA-1

ADD REPLY • link 7.3 years ago by guillepalou4 ▴ 20

1

Entering edit mode

zcat foo_R1.fastq.gz | awk '{if(NR%4==1) {name=$1} else if (NR%4==2) {s = length($1)} else if(NR%4==0) { if(s != length($1)) print name}}'

That will print out the names of any reads with mismatches in their sequence and quality lengths.

ADD REPLY • link 7.3 years ago by Devon Ryan 105k

0

Entering edit mode

Thanks a lot Devon. It showed me no results so I guess my files are correct. But I wonder if my command was wrong. I suppose it is!

ADD REPLY • link 7.3 years ago by guillepalou4 ▴ 20

0

Entering edit mode

You command will work properly if and only if cut -f 3 and cut -f 5 can be guaranteed to give you the sequence and quality lines. What I expect is happening is that you sometimes have an extra tab in the header line or something like that, which will cause the field numbers to change and the results to get thrown off.

ADD REPLY • link 7.3 years ago by Devon Ryan 105k

0

Entering edit mode

I see... Anyway I checked and both gives me the sequence and the quality lines respectively. It's strange, I will try again similarly today. Thanks for the help :).

ADD REPLY • link 7.3 years ago by guillepalou4 ▴ 20