Counting speceficspecific lines that dosen't conaint speceficdon't contain specific word
1
0
Entering edit mode
4.0 years ago
Bioinfo ▴ 20

Please I have question: I have a file like this

@HWI-ST273:296:C0EFRACXX:2:2101:17125:145325/1
TTAATACACCCAACCAGAAGTTAGCTCCTTCACTTTCAGCTAAATAAAAG
+
8?8A;DDDD;@?++8A?;C;F92+2A@19:1*1?DDDECDE?B4:BDEEI
@BBBB-ST273:296:C0EFRACXX:2:1303:5281:183410/1
TAGCTCCTTCGCTTTCAGCTAAATAAAAGCCCAGTACTTCTTTTTTACCA
+
CCBFFFFFFHHHHJJJJJJJJJIIJJJJJJJJJJJJJJJJJJJIJJJJJI
@HWI-ST273:296:C0EFRACXX:2:1103:16617:140195/1
AAGTTAGCTCCTTCGCTTTCAGCTAAATAAAAGCCCAGTACTTCTTTTTT
+
@C@FF?EDGFDHH@HGHIIGEGIIIIIEDIIGIIIGHHHIIIIIIIIIII
@HWI-ST273:296:C0EFRACXX:2:1207:14316:145263/1
AATACACCCAACCAGAAGTTAGCTCCTTCGCTTTCAGCTAAATAAAAGCC
+
CCCFFFFFHHHHHJJJJJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJIJ

I

I'm interested just about the line that starts with '@HWI', but I want to count all the lines that are not starting with '@HWI'. In the example shown, the result will be 1 because there's one line that starts with '@BBB'.

To be more clear: I just want to know know the number of the first line of the patterns (that are 4 line that repeated) that are not '@HWI'; I hope I'm clear enough. Please tell me if you need more clarification

Assembly alignment sequencing • 886 views
ADD COMMENT
1
Entering edit mode

Counting speceficspecific lines that dosen't conaint speceficdon't contain specific word

Please take another look at your title.

ADD REPLY
0
Entering edit mode

What have you tried ?

You already asked several question on this site. Please review the correct answers and validate them (green mark on the left).

ADD REPLY
0
Entering edit mode
4.0 years ago

Given test input:

$ cat test.fq
@HWI-ST273:296:C0EFRACXX:2:2101:17125:145325/1
TTAATACACCCAACCAGAAGTTAGCTCCTTCACTTTCAGCTAAATAAAAG
+
8?8A;DDDD;@?++8A?;C;F92+2A@19:1*1?DDDECDE?B4:BDEEI
@BBBB-ST273:296:C0EFRACXX:2:1303:5281:183410/1
TAGCTCCTTCGCTTTCAGCTAAATAAAAGCCCAGTACTTCTTTTTTACCA
+
CCBFFFFFFHHHHJJJJJJJJJIIJJJJJJJJJJJJJJJJJJJIJJJJJI
@HWI-ST273:296:C0EFRACXX:2:1103:16617:140195/1
AAGTTAGCTCCTTCGCTTTCAGCTAAATAAAAGCCCAGTACTTCTTTTTT
+
@C@FF?EDGFDHH@HGHIIGEGIIIIIEDIIGIIIGHHHIIIIIIIIIII
@HWI-ST273:296:C0EFRACXX:2:1207:14316:145263/1
AATACACCCAACCAGAAGTTAGCTCCTTCGCTTTCAGCTAAATAAAAGCC
+
CCCFFFFFHHHHHJJJJJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJIJ

Then you can use awk to filter out the first line of each FASTQ record, test if it does not start with @HWI, and count any of the remaining lines with wc -l:

$ awk '((NR % 4 == 1) && ($1 !~ /^@HWI/))' test.fq | wc -l
       1

If you want to see what those lines are, remove wc -l:

$ awk '((NR % 4 == 1) && ($1 !~ /^@HWI/))' test.fq
@BBBB-ST273:296:C0EFRACXX:2:1303:5281:183410/1

This assumes all your FASTQ records are in standard, four-line blocks.

ADD COMMENT
1
Entering edit mode

Would it be better to anchor the regex using ^?

ADD REPLY
0
Entering edit mode

Ah yes, will fix that now.

ADD REPLY

Login before adding your answer.

Traffic: 2523 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6