Extract the DNA sequence header!
2
0
Entering edit mode
8.6 years ago
fufuyou ▴ 110

I have question about our DNA sequence data. If the format of sequence data is

@HWI-ST1234:136:C5F6VACXX:6:1101:4121:2231 1:N:0:ACTTGA
TATGGGTTTCCACGGAGCACAGTGCCTAGTGCTCACTCCCCAGTTGTATCTTATTTTTCAGGTCAGCAGGTCGGGCCGGGAGTGTGACATGACGGAGCAGA
+
CCCFFFDDHHHHHJJGIJJJJJHIJJIJJHIJIJIJJJJJJIIIIJGIIBFHHFHJJJG>FHIJIGIIEHAHBBAB@BDBDD<?ACA>CDDDDDD5<BBD?.

I can extract the sequence identifier as @HWI-ST1234:136:C5F6VACXX:6:1101:4121:2231 1:N:0:ACTTGA using the code.

If the format of sequence data is

@HWI-ST1234:136:C5F6VACXX:6:1101:4295:2242 1:N:0:ACTTGA
AATACTTGTACGAGGGTGTTTTGCCACACCATATCTCATAAGGTGTGTTGGGTACATCTTTACTTGTCATTCTATTCAAAATATGTGTTGTTGTTTC
+
@@@ADD?DH8FH1CGG2A<F@FH?@?FC1DFGEDB9?BFHHIF?8?DBC=FB5@CDA;@)=.))..).;;B@B?@>>BDCCCCCD>B;?=5??<?CC

I can not extract the sequence identifier.

So I think the problem is the sequence data. The first symbol of second one is @. The first symbol of identifier also is @. So the code can not extract the correct sequencing identifier from our sequence data.

I want to extract the sequence identifier from my sequence data. The identifier format is @HWI-ST1234:136:C5F6VACXX:6:1101:4295:2242 1:N:0:ACTTGA. Could you help me do it?

Thanks,
Fuyou

genome fastq • 1.9k views
ADD COMMENT
0
Entering edit mode

Goutham and Ashutosh,

Thanks,

It is working.

ADD REPLY
2
Entering edit mode
8.6 years ago

You just need to print the read name which is the first line of every 4 lines in fastq format. Something like:

awk '{if (NR%4==1) print}'​
ADD COMMENT
0
Entering edit mode

Or just

awk 'NR%4==1'
ADD REPLY
1
Entering edit mode
8.6 years ago
cat Input.fastq | paste - - - - | cut -f1 > ReadIDs.txt

Goutham's solution that purely uses awk should be much faster.

ADD COMMENT

Login before adding your answer.

Traffic: 2971 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6