Question: Extract the DNA sequence header!
0
gravatar for fufuyou
4.8 years ago by
fufuyou110
United States
fufuyou110 wrote:

I have question about our DNA sequence data. 
If the format of sequence data is
@HWI-ST1234:136:C5F6VACXX:6:1101:4121:2231 1:N:0:ACTTGA
TATGGGTTTCCACGGAGCACAGTGCCTAGTGCTCACTCCCCAGTTGTATCTTATTTTTCAGGTCAGCAGGTCGGGCCGGGAGTGTGACATGACGGAGCAGA
+
CCCFFFDDHHHHHJJGIJJJJJHIJJIJJHIJIJIJJJJJJIIIIJGIIBFHHFHJJJG>FHIJIGIIEHAHBBAB@BDBDD<?ACA>CDDDDDD5<BBD?. 
I  can extract the sequence identifier as @HWI-ST1234:136:C5F6VACXX:6:1101:4121:2231 1:N:0:ACTTGA using the code.
If the format of sequence data is
@HWI-ST1234:136:C5F6VACXX:6:1101:4295:2242 1:N:0:ACTTGA
AATACTTGTACGAGGGTGTTTTGCCACACCATATCTCATAAGGTGTGTTGGGTACATCTTTACTTGTCATTCTATTCAAAATATGTGTTGTTGTTTC
+
@@@ADD?DH8FH1CGG2A<F@FH?@?FC1DFGEDB9?BFHHIF?8?DBC=FB5@CDA;@)=.))..).;;B@B?@>>BDCCCCCD>B;?=5??<?CC
I can not extract the sequence identifier.

So I think the problem is the sequence data. The first symbol of  second one is @. The first symbol of identifier also is @. So the code can not extract the correct sequencing identifier from our sequence data.
I want to extract the sequence identifier form my sequence data. The identifier format is @HWI-ST1234:136:C5F6VACXX:6:1101:4295:2242 1:N:0:ACTTGA. Could you help me do it?
Thanks,
Fuyou

tool genome • 1.3k views
ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by fufuyou110
cat Input.fastq | paste - - - - | cut -f1 > ReadIDs.txt

Goutham's solution that purely uses awk should be much faster.

ADD REPLYlink modified 6 months ago by RamRS27k • written 4.8 years ago by Ashutosh Pandey12k

Goutham and Ashutosh,

Thanks,

It is working.

ADD REPLYlink modified 6 months ago by RamRS27k • written 4.8 years ago by fufuyou110
1
gravatar for geek_y
4.8 years ago by
geek_y11k
Barcelona
geek_y11k wrote:

You just need to print the read name which is the first line of every 4 lines in fastq format. something like:

awk '{if (NR%4==1) print}'​
ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by geek_y11k

or just

awk 'NR%4==1'
ADD REPLYlink written 4.8 years ago by Pierre Lindenbaum128k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1748 users visited in the last hour