Question: Read name not stored in CRAM file. Is it possible to create new read names for remapping?
0
gravatar for cl
18 days ago by
cl0
cl0 wrote:

I have received some CRAM files with paired-end WGS data. I would like to remap the reads in a specific region to another reference.

However, most of the template/read names have the format filename.cram:21937611, so I suspect that the CRAM files have been stored with samtools ... lossy_names=1.

When converting the CRAM file into FASTQ files (samtools fastq), I loose most of the reads, and from the CRAM file, I can see that <10% of the read names are unique.

Is it possible to generate new read names based on the information present in the CRAM file? Of course it would not be possible to recreate the old names, but is there a solution that will allow me to generate fastq files and perform realignment?

I've inserted one example for a read read name used for multiple reads below. Note that the same sample/library has been analysed in 8 lanes.

From samtools documentation:
lossy_names=0|1 CRAM output only; defaults to 0 (off). If 1, templates with all members within the same CRAM slice will have their read names removed. New names will be automatically generated during decoding. Also see the name_prefix option.


An example where filename.cram:21937611 is found as read name for multiple reads (note that the the reads may come from different read groups: RG:Z:AHH7F7CCXX.sample_name.#, but RG:Z:AHH7F7CCXX.sample_name.6 is found 4 times)

filename.cram:21937611  99  6   28999893    60  150M    =   29000058    315 AACATACCCATGTTCCCATTATGACGTTTCTTCATTTAATTAATAAATTAGAAAAAAATTCTTGTTGGATGAGTTACTAATGCCCTGAAGAATTGGATTAACCACTGGTCATACTGACACTACAGTGCCATTCACACTTAAATGCAACAG  AAFFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK  MC:Z:150M   MD:Z:150    PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.3   NM:i:0  MQ:i:60 AS:i:150    XS:i:0
filename.cram:21937611  147 6   29000058    60  150M    =   28999893    -315    ATAGAAGTCTCTATTTAATGTGGATATTGGAAGTAAACTAAATGTGGGACTGGTGAAAATCCTTAATTAGGTTTGGTTAAATATATTTTCGGTTGGCTATTTGATGTCTTTTTAATGTATACTCTTGTTATTCATATTTACAGCTAGATT  KKKKKFA<FFKKKFFFKKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFFFAA  MC:Z:150M   MD:Z:150    PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.3   NM:i:0  MQ:i:60 AS:i:150    XS:i:20
filename.cram:21937611  1187    6   29028730    60  150M    =   29028863    185 ACATCCCAGGGATGAAGCTGACTAGCTCATGGTGAATAAGCTTTTTGATGTGCTGTTGAATTGTTTGCTAGTATTTTATTGAAGATTTTCACATCAATGTCCATCAGGGATATTGGCCTGAAATTTTCTTTTTTGTTGTTGTTGTGTCTC  AAA<FKKFKFK,,AAKKKFK<FFKKFKKKAFKKFKKKFAF,,77A,77,F7<KKK7FFA,K7FAAAFA7AFKFK<F7<FKKKKKF7FFKKKKKAKK,,AFAA<FFKKK,7,7F,A,,,A7AA,,<<FKKF,7,,<AF,,AFA,,,,<,7<  MC:Z:18S52M80S  MD:Z:150    PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.4   NM:i:0  MQ:i:40 AS:i:150    XS:i:66
filename.cram:21937611  1107    6   29028863    40  18S52M80S   =   29028730    -185    GTCTTTTATTTGTCTTTGTGTTGTTGTTGTGTCTCTGCCCTGTTTTGTTAACAATATGATGCTGTCTTCAAACTTAGAGTTATGTAGGAAATCCCCTCTTACTTTATTGTGGATTAGTTTCTGAAGGAAGGATACCAGTTCATCTTAGTA  ,,A,,,,,,,,,,,<,,,A<A,,K<<,,7,AA7A,,(AA<,,,7,7,,,,,7<,7,,<,7KFA77,KF<7,7,,,,,7,FFA7,,,,,7,,,(,,,,,,,,KK7F,KF,7<FA,7,AF,,A,<<,,7,,,FF7KFA<,,A,F,A<FAA,,  XA:Z:2,-117731944,14S36M100S,1;4,+75904852,111S26M13S,0;    MC:Z:150M   MD:Z:29G2T3G9A5 PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.4   NM:i:4  MQ:i:60 AS:i:32 XS:i:31
filename.cram:21937611  99  6   29086353    60  119M31S =   29086353    119 AGACATATATGTTGTCTTGGGTGGCCACAACATAGTTAGATTTTGCACTGCAGTCCTCCAGATCTAGGGCTTATCTTTTGAGGAAAGTATACTTGCTCTGAAAGAAACATGTAGGTAGCAGATCGGAAGAGCGTCGTGTAGGGAAAGAGT  AAFFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFKKAKAFKKKKKKKKKA  MC:Z:31S119M    MD:Z:119    PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.7   NM:i:0  MQ:i:60 AS:i:119    XS:i:20
filename.cram:21937611  147 6   29086353    60  31S119M =   29086353    -119    ACTGGAGTTCAGACGTGTGCTCTTCCGATCTAGACATATATGTTGTCTTGGGTGGCCACAACATAGTTAGATTTTGCACTGCAGTCCTCCAGATCTAGGGCTTATCTTTTGAGGAAAGTATACTTGCTCTGAAAGAAACATGTAGGTAGC  KKKKKKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFFFAA  MC:Z:119M31S    MD:Z:119    PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.7   NM:i:0  MQ:i:60 AS:i:119    XS:i:20
filename.cram:21937611  163 6   29115837    60  150M    =   29116033    346 GTAACCCCTTTTACTACCTACCCCAGAGTTTCTTTTATTATTAACAACTTGCATTCATGTGGTACATTTGTTATAATTGATAAGCCAATATTAATACATTATTAGTAACCAAATTCCATAGTTTACATTAGGGTTGATGGTGTGTGTTTT  AAFFFAKKKKKFKKA,F<F,A,AFKKKKAFKAKKKK7KKKKKKFKF<<AFK<AFK,AAKKKKKKKKKKKKKFKKKKFAAKK<FFF,7<FKKKFKKF,AFAKKK7AKKF<<,A7,<A<FA7<,AK77,<FAFAF<AFKKKKKKKKKKAKKF  MC:Z:150M   MD:Z:15C3C130   PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.8   NM:i:2  MQ:i:60 AS:i:140    XS:i:34
filename.cram:21937611  83  6   29116033    60  150M    =   29115837    -346    TTCAGTACCATAAAGAATGGTTTCACTGCCTTAAAAATTCCCTGTTCTCCTTCCATTCATCATTTCCTCCCCTCCTCCCCGGGAGCCCCTGACAACCACTGATTTTTTATTGTTTCCATAACTGTGCTTTTTCCAGAATATCATACAATT  <,,,7<,,FKAA,,,KFF<,AFA7FF7KKKAFF,7KAF,AKKFFF,KKF,FFKF7F,KKKF,KAF,FFFA,FA(F(KFA77A7<(FFKKFAFFF<FKAKK,AKKKKKKKKKKKKFK<,AFFFKKKKKKKFF,KKKKKKAKFKKFKFFAAA  MC:Z:150M   MD:Z:7T142  PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.8   NM:i:1  MQ:i:60 AS:i:145    XS:i:23
filename.cram:21937611  99  6   29144947    40  150M    =   29145234    437 NACACTGACTTCCACAATGGTTGAACTAGTTTACAGTCACACCAACAGTGTAAAAGTGTTCCTATTTCTCCACATCCTGTCCAGCACCTGTTGTTTCGTGACTTTTTAATGATTGCCATTCTAACTGGTGTGAGGTGATATCTCCTTGTG  #AAAFF<AFKKKKAKKF7FKAKFK<FK,FKKKKKKFFA,7F,FKKKFKKF,,<F77FKKKKKKFFFKKKK,AFFKKKAKKK7FKKKKFF<KFKK7<<7<<7AFKKKKFF,,7F,<,A<,,,7,,,7,,,,,,,,A7A,,,,,,,,,,,,,  MC:Z:150M   MD:Z:0C37C58C39G6A5PG:Z:MarkDuplicates  RG:Z:AHH7F7CCXX.sample_name.5   NM:i:5  MQ:i:60 AS:i:129    XS:i:128
filename.cram:21937611  147 6   29145234    60  150M    =   29144947    -437    CTTGTAAATTTGTTTGAGTTCATTGTAGATTCTGGATATTAGCCCTTTGTAAGATGAGTAGCTTGCAAAAATTTTCTCCCATTCTGTAGGTTGCCTATTCACTCTGATGGTAGTTTCTTTTGCTGTGCAGAAACTCTTTAGTTTAATTAG  FF7<<777FKKKFKKA,7A7,,,<A7A,FFKKKKKKFF<AFKKKFFAF7F7A,F,,KKFA7AFF,,FF7FF,,FKKFKKKKKKKKKKKKKFFFAKKKKKAA7<AKKKAFFKFFKKKKKKFKAKKKKAFAAKKKKKKKKKFKKKKKF<F<<  MC:Z:150M   MD:Z:50C99  PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.5   NM:i:1  MQ:i:40 AS:i:145    XS:i:135
filename.cram:21937611  163 6   29174463    60  150M    =   29174749    436 TGTATTGAGATATAATTTATTTACAATACAGTTCACCCATTTAAAGTGTACAAGTCAATGATTTTTGGTATATTTCATTACTAATTTATAACATTGTGGTAATATATAACATAAAATTTGCCATTTTAACTATTTTTAAGTGTACAATTT  AAFFFKKKKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFAAFKKKKKKKKKKKKKKFKAFFKKAAFFAFKKKKFKKK7,A,AA,AAFK<  MC:Z:150M   MD:Z:150    PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.6   NM:i:0  MQ:i:60 AS:i:150    XS:i:26
filename.cram:21937611  83  6   29174749    60  150M    =   29174463    -436    CTGCCTTATGTCTGTATGAATTTGCATGTTCCAGATATTTCATATTATTGGAAACATACAATTTTTGCCCTTTTAGGTCTGGCTTATTCATTTAGCATAAGATGTACATATTGCTTTATAGCCTGCTTTTTCACTTAGCAGCATATTGTG  ,,,,,<A,<<7<A,AFFA<,<FA,,<A<AFFFA<,F<KKKAAKKKKKKFFKKKKKKKKKKFKKKKA7AFAAA<F,FFKKF<7FFFKKKKFF7KKKKKKKKKKKKKKKFKKKKKKKKKKKFKKKKKKKKKKKKFKKKKKKKKKKKKFFFAA  MC:Z:150M   MD:Z:2A147  PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.6   NM:i:1  MQ:i:60 AS:i:147    XS:i:24
filename.cram:21937611  99  6   29202989    60  150M    =   29203110    271 CATGAAATATTTAAAGAGTTTTATTCTGAGCCAAATGTGAGTAATTGACGGCCTGAGGCGCAGTCTCAAGACATCCTGAGAACAAGTGCCCAAGGTGATTGAATTACAACTTGATTTACATTTTAGGAGCATGTAAAAAATCAGTCAATA  AAFFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKA<FKKKKKKKKKKKKKKKKKKKAFAFKKKKKKKKKFFKKFKFKFKKFKKKAK  MC:Z:150M   MD:Z:150    PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.6   NM:i:0  MQ:i:60 AS:i:150    XS:i:30
filename.cram:21937611  147 6   29203110    60  150M    =   29202989    -271    TTTAGGAGCATGTAAAAAATCAGTCAATACATGTGGGGTCTGTGTTGGTTCAGTCATGAAAGGTGGAACAACTCAAAGTGGGGCCTTCCACATCAAAGGTAAATTCAAAGATTTTCTTATTGGCAATTGGTTGAAAAACTTGTTACTATT  FKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFFFAA  MC:Z:150M   MD:Z:150    PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.6   NM:i:0  MQ:i:60 AS:i:150    XS:i:26
filename.cram:21937611  163 6   29292208    60  150M    =   29292473    415 AAGAAAAAATATTAATAGAGCTATTTAACCTGGTTTGTTGAATGTGATAATGATTACAGAAAACATTGACTAATGCTGCTTATAATAATGAAAAACTGGAAAAGTGACTGGTATACAATAGGCAATCAATAAATATTTGTTAAATAAATA  AAAFAKAKFFKKKKKKKKFKKKFFAAKAKKFKK7FKFFKKKKAKAKKKK<FKKAKFFKFFFFFKKKK7KKKFKFFFKKKAAFFFF<KAKFK7FA<FA7FAAFAKKFFKKFKKKKKAK7<FFKKKK<AAFAAFFFFF7,,A7,<AA<,AF7  MC:Z:150M   MD:Z:150    PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.5   NM:i:0  MQ:i:60 AS:i:150    XS:i:20
filename.cram:21937611  83  6   29292473    60  150M    =   29292208    -415    TATTAAAAATACTAAAATGTGTATACAAGATGGTGAGACAACTCCAAAATATAAAACAACATTATGGACTTACCTTTATTTCTAAGTACTTACAAAGTTGTTGCAAAAATTTTAATATACATATTAATGGATTTAGCTATCATATGAGAT  FA7A,77<A,<<AA,A<A<F<<7<FA,7A,,,A<7AF7<,7KKKKKKKF7FAFKA7FKKKKFFKFF7AKFF,<FFFKKKKKKKKKKKKFKKKKKFKKKFKFFKKKKFAKKFFAK<7AFAKKKKKKKKKKKKKKKFAAFAKKKKKKFFFA,  MC:Z:150M   MD:Z:150    PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.5   NM:i:0  MQ:i:60 AS:i:150    XS:i:20

And read group information (one sample, one library, 8 flow cells):

@RG ID:AHH7F7CCXX.sample_name.1 PU:AHH7F7CCXX.sample_name.1 LB:SX441_sample_name.v1 SM:sample_name  CN:U    PL:Illumina
@RG ID:AHH7F7CCXX.sample_name.2 PU:AHH7F7CCXX.sample_name.2 LB:SX441_sample_name.v1 SM:sample_name  CN:U    PL:Illumina
@RG ID:AHH7F7CCXX.sample_name.3 PU:AHH7F7CCXX.sample_name.3 LB:SX441_sample_name.v1 SM:sample_name  CN:U    PL:Illumina
@RG ID:AHH7F7CCXX.sample_name.4 PU:AHH7F7CCXX.sample_name.4 LB:SX441_sample_name.v1 SM:sample_name  CN:U    PL:Illumina
@RG ID:AHH7F7CCXX.sample_name.5 PU:AHH7F7CCXX.sample_name.5 LB:SX441_sample_name.v1 SM:sample_name  CN:U    PL:Illumina
@RG ID:AHH7F7CCXX.sample_name.6 PU:AHH7F7CCXX.sample_name.6 LB:SX441_sample_name.v1 SM:sample_name  CN:U    PL:Illumina
@RG ID:AHH7F7CCXX.sample_name.7 PU:AHH7F7CCXX.sample_name.7 LB:SX441_sample_name.v1 SM:sample_name  CN:U    PL:Illumina
@RG ID:AHH7F7CCXX.sample_name.8 PU:AHH7F7CCXX.sample_name.8 LB:SX441_sample_name.v1 SM:sample_name  CN:U    PL:Illumina
sequencing samtools bam cram • 189 views
ADD COMMENTlink modified 14 days ago by jkbonfield440 • written 18 days ago by cl0

New names will be automatically generated during decoding.

Is this part not working? Or the names you show above turn out to be identical?

ADD REPLYlink modified 18 days ago • written 18 days ago by GenoMax92k

New names have been automatically generated during decoding, but the names are identical for many reads. To give one example: After converting a 5 mb region from a CRAM file into a BAM file, I have 2,797,216 reads in the BAM file. However, in this BAM file, I only have 177,983 unique read names. Most of the read names have been converted into a format such as filename.cram:21937611, and multiple read pairs are assigned identical read names.

ADD REPLYlink written 18 days ago by cl0

Sounds like only way would be to write some code to rename these so they become unique. Scan this file and each time you see a new filename.cram:NNNNNN start numbering them filename.cram:NNNNNN_1, filename.cram:NNNNNN_2 until you hit next number.

What is Also see the name_prefix option.? Does that allow you to make the read names unique?

ADD REPLYlink modified 18 days ago • written 18 days ago by GenoMax92k

That would of course be a possibility, but then each read would become a single-end read, and I would of course like to retain the paired-end read (i.e. one uniqe read name for a pair of reads). I wonder if it is possible to match paired reads from the other information in the alignment section of the BAM/CRAM file? Column 4 has information about a the current read and column 7-8 the position of the paired read. On top of that, the RG information in column ~15 will give information about the lane, so to some extent I should be able to identify paired reads and give them the same name. But for reads mapping to multiple positions, split reads etc., this may become very complex.

ADD REPLYlink written 18 days ago by cl0

this may become very complex

Absolutely. This is going to be a big task and there is nothing off-the-shelf you can use.

I am surprised that CRAM group did not think of someone like you wanting to convert the data back to fastq.

ADD REPLYlink modified 14 days ago by John Marshall2.1k • written 18 days ago by GenoMax92k
1
gravatar for jkbonfield
14 days ago by
jkbonfield440
jkbonfield440 wrote:

See my reply to your question on github too.

However I suspect this is a processing error somewhere else which has "baked in" the pregenerated names at some point, and then a merge or something has caused the collision.

CRAM does track read-pairs even when the names are removed. However it's possible to cause collisions if you merge from multiple files all with the same filename (but in different dirs). I'm thinking with hindsight we should have used a per-file random hash as part of the name along with read group so the chance of errors is reduced.

ADD COMMENTlink written 14 days ago by jkbonfield440

GitHub thread referenced above: https://github.com/samtools/samtools/issues/1341

ADD REPLYlink modified 14 days ago • written 14 days ago by GenoMax92k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2141 users visited in the last hour