Question

Formating RNA-seq with UMIs in unmapped BAM files to UMI-tools compatible FASTQ files

0

Entering edit mode

5.1 years ago

graeme.thorn ▴ 100

I'm about to take delivery of large numbers of unmapped (demultiplexed) BAM files from a BGI sequencer with pairs of reads in the following form:

A1  77  *   0   0   *   *   0   0   <sequence>  <quality>   RG:Z:rL1    RX:Z:GCGCCCC    QX:Z:B(,:.)'    BC:Z:GTCTAAACAG QT:Z:.',(/-+*;&
A1  141 *   0   0   *   *   0   0   <sequence>  <quality>   RG:Z:rL1    RX:Z:GCGCCCC    QX:Z:B(,:.)'    BC:Z:GTCTAAACAG QT:Z:.',(/-+*;&

The barcodes (BC:/QT:) and the unique molecular indices (RX:/QX:) have already been removed from the reads and deposited in the relevant fields of the BAM file.

I need to process this into FASTQ format for the paired ends (flags 77 and 141 for a mate pair) to eventually use UMI-tools on the mapped data.

Is there a script anywhere that will take an unmapped BAM file in this format and turn it into a FASTQ of the format required to map before deduplicating through UMI-tools? I can brew my own, but if there's an off-the-shelf solution I could use, then I'd be grateful.

RNA-Seq umi pre-processing • 2.2k views

ADD COMMENT • link 5.1 years ago by graeme.thorn ▴ 100

2

Entering edit mode

If you can find a mapper that will handle the unmapped BAM as input, UMI-tools is more than happy to take the library barcodes and UMI sequences from a BAM tag, rather then the read name.

ADD REPLY • link 5.1 years ago by i.sudbery 19k

0

Entering edit mode

Thanks, just spotted those options --umi-tag and --extract-umi-method=tag. STAR can use unmapped BAM files as input and (according to its docs) retains all tags when mapping so it can be used prior to that. Now just to find a solution to clip/trim the unmapped BAM before trying to align.

ADD REPLY • link 5.1 years ago by graeme.thorn ▴ 100

1

Entering edit mode

Now just to find a solution to clip/trim the unmapped BAM before trying to align

STAR should soft clip during alignment.

ADD REPLY • link 5.1 years ago by GenoMax 141k

0

Entering edit mode

Is this 10x data?

ADD REPLY • link 5.1 years ago by GenoMax 141k

0

Entering edit mode

I'm not aware that it is, just that it will be in the above form - that is taken from an initial run with some known other samples so I can get familiarised with the format and develop pipelines before it arrives.

ADD REPLY • link 5.1 years ago by graeme.thorn ▴ 100