Question

Parsing Fastq Files

1

Entering edit mode

13.5 years ago

deepthitheresa ▴ 20

Hi all,

I have Fastq reads something like

@HWI-ST1162:73:C0KEFACXX:6:1101:1816:1918 1:N:0:CGATGT
NACCCTAGAAATTATAAATCTCTTCAAGTGAGATTGTAAGGAGAAGGAGAAACTTGGTCTGGAATTTGTTATAAAAGCACTT
+
#1=DDFFFHHGGHIJJJJJIJJJJJJJJCHGHIIJJEFHIJIJJIIJIIIIJHHIJJFHIIJJJJJJJIJIJIJIIJHEHHHHFFFFFFEEEDEEEDCDDC

I aligned this fastq file with a reference genome using bowtie. How can I identify the sample name from this record?

I have demultiplexed fastq files for each sample and I also have barcode information file in the format

sample name    Index sequence
BC1                  CGATGT
BC2                  CGATGA

When I try to retrieve the alignment information using $sam->features() the seqID will be returned as

@HWI-ST1162:73:C0KEFACXX:6:1101:1816:1918

How can I get the 1:N:0:CGATGT part from the alignment information?

Thanks, Deeps

fastq parsing • 5.0k views

ADD COMMENT • link updated 13.5 years ago by jingtao09 ▴ 110 • written 13.5 years ago by deepthitheresa ▴ 20

score 2 · Answer 1 · 2012-05-08

2

Entering edit mode

13.5 years ago

Sean Davis 27k

I'd suggest that you use SAM Read Groups to track samples. This would be done at the alignment stage....

ADD COMMENT • link 13.5 years ago by Sean Davis 27k

0

Entering edit mode

Good suggestion. It helped me a lot

ADD REPLY • link 13.2 years ago by deepthitheresa ▴ 20

Istvan Albert · Answer 2 · 2012-05-09

If you want to keep the barcode in SAM file, you can add a non-space character in between the main header and the barcode section.

@HWI-ST1162:73:C0KEFACXX:6:1101:1816:1918 1:N:0:CGATGT

to be

@HWI-ST1162:73:C0KEFACXX:6:1101:1816:1918:1:N:0:CGATGT

here I used a colon ":", so if you parse this header, you can use split function to get the barcode.in Python

header="@HWI-ST1162:73:C0KEFACXX:6:1101:1816:1918:1:N:0:CGATGT"
barcode=header.rstrip("\n").split(":")[-1]

Normally, most of the mapper, i.e BWA or BOWTIE will truncate the header name after a space. so if you preprocess your FASTQ file into this new format you will save alot time. Otherwise, if you are not able to do the modification on the FASTQ reads, you can open the original FASTQ file and SAM file at same time to calibrate the line numbers and parse out the barcode.