Question: Parsing Fastq Files
1
gravatar for deepthitheresa
7.4 years ago by
Canada
deepthitheresa20 wrote:

Hi all,

I have Fastq reads something like

@HWI-ST1162:73:C0KEFACXX:6:1101:1816:1918 1:N:0:CGATGT
NACCCTAGAAATTATAAATCTCTTCAAGTGAGATTGTAAGGAGAAGGAGAAACTTGGTCTGGAATTTGTTATAAAAGCACTT
+
#1=DDFFFHHGGHIJJJJJIJJJJJJJJCHGHIIJJEFHIJIJJIIJIIIIJHHIJJFHIIJJJJJJJIJIJIJIIJHEHHHHFFFFFFEEEDEEEDCDDC

I aligned this fastq file with a reference genome using bowtie. How can I identify the sample name from this record?

I have demultiplexed fastq files for each sample and I also have barcode information file in the format

sample name    Index sequence
BC1                  CGATGT
BC2                  CGATGA

When I try to retrieve the alignment information using $sam->features() the seqID will be returned as

@HWI-ST1162:73:C0KEFACXX:6:1101:1816:1918

How can I get the 1:N:0:CGATGT part from the alignment information?

Thanks, Deeps

fastq parsing • 2.9k views
ADD COMMENTlink modified 7.4 years ago by jingtao09110 • written 7.4 years ago by deepthitheresa20
2
gravatar for Sean Davis
7.4 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

I'd suggest that you use SAM Read Groups to track samples. This would be done at the alignment stage....

ADD COMMENTlink written 7.4 years ago by Sean Davis25k

Good suggestion. It helped me a lot

ADD REPLYlink written 7.1 years ago by deepthitheresa20
1
gravatar for jingtao09
7.4 years ago by
jingtao09110
jingtao09110 wrote:

If you want to keep the barcode in SAM file, you can add a non-space character in between the main header and the barcode section.

@HWI-ST1162:73:C0KEFACXX:6:1101:1816:1918 1:N:0:CGATGT

to be

@HWI-ST1162:73:C0KEFACXX:6:1101:1816:1918:1:N:0:CGATGT

here I used a colon ":", so if you parse this header, you can use split function to get the barcode.in Python

header="@HWI-ST1162:73:C0KEFACXX:6:1101:1816:1918:1:N:0:CGATGT"
barcode=header.rstrip("\n").split(":")[-1]

Normally, most of the mapper, i.e BWA or BOWTIE will truncate the header name after a space. so if you preprocess your FASTQ file into this new format you will save alot time. Otherwise, if you are not able to do the modification on the FASTQ reads, you can open the original FASTQ file and SAM file at same time to calibrate the line numbers and parse out the barcode.

ADD COMMENTlink modified 7.4 years ago by Istvan Albert ♦♦ 81k • written 7.4 years ago by jingtao09110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2235 users visited in the last hour