Parsing Fastq Files
        2 
    
    
    
        
        
        
        
            
                
                
                    
                        
                    
                
                    
                        Hi all,
I have Fastq reads something like 
@HWI-ST1162:73:C0KEFACXX:6:1101:1816:1918 1:N:0:CGATGT
NACCCTAGAAATTATAAATCTCTTCAAGTGAGATTGTAAGGAGAAGGAGAAACTTGGTCTGGAATTTGTTATAAAAGCACTT
+
#1=DDFFFHHGGHIJJJJJIJJJJJJJJCHGHIIJJEFHIJIJJIIJIIIIJHHIJJFHIIJJJJJJJIJIJIJIIJHEHHHHFFFFFFEEEDEEEDCDDC
 
I aligned this fastq file with a reference genome using bowtie.  How can I identify the sample name from this record?
I have demultiplexed fastq files for each sample and I also have barcode information file in the format
sample name    Index sequence
BC1                  CGATGT
BC2                  CGATGA
 
When I try to retrieve the alignment information using $sam->features() the seqID will be returned as 
@HWI-ST1162:73:C0KEFACXX:6:1101:1816:1918
 
How can I get the 1:N:0:CGATGT part from the alignment information?
Thanks,
Deeps
                    
                 
                 
                
                
                    
                    
    
        
        
            fastq
         
        
    
        
        
            parsing
         
        
    
    
        • 5.0k views
    
 
                
                 
                
                
 
             
            
            
         
     
 
     
    
        
            
                
 
    
    
    
    
        
        
        
        
            
                
                
                    
                        
                    
                
                    
                        I'd suggest that you use SAM Read Groups to track samples.  This would be done at the alignment stage....
                    
                 
                 
                
                
                 
                
                
 
             
            
            
         
     
 
         
        
            
                
 
    
    
    
    
        
        
        
        
            
                
                
                    
                        
                    
                
                    
                        If you want to keep the barcode in SAM file, you can add a non-space character in between the main header and the barcode section.
@HWI-ST1162:73:C0KEFACXX:6:1101:1816:1918 1:N:0:CGATGT
 
to be
@HWI-ST1162:73:C0KEFACXX:6:1101:1816:1918:1:N:0:CGATGT
 
here I used a colon  ":", so if you parse this header, you can use split function to get the  barcode.in Python
header="@HWI-ST1162:73:C0KEFACXX:6:1101:1816:1918:1:N:0:CGATGT"
barcode=header.rstrip("\n").split(":")[-1]
 
Normally, most of the mapper, i.e BWA or BOWTIE will truncate the header name after a space.
so if you preprocess your FASTQ file into this new format you will save alot time. Otherwise, if you are not able to do the modification on the FASTQ reads, you can open the original FASTQ file and SAM file at same time  to calibrate the line numbers and parse out the barcode.
                    
                 
                 
                
                
                 
                
                
 
             
            
            
         
     
 
         
        
 
    
    
        
            
                 Login  before adding your answer.
         
    
    
         
        
            
        
     
    
    Traffic: 5653 users visited in the last hour
         
    
    
        
    
    
 
Good suggestion. It helped me a lot