Question: How can keep the change in my fastq file when i use grep and sed to edit it?
0
gravatar for Zeason
7 months ago by
Zeason0
Zeason0 wrote:

I just want to do some modification to my read id in my fastq file. And i use grep to get the id i want to edit , then i use sed to make the change . But i find there is no change in my original fastq file . here is my command:

cat test.fastq |grep '^@.*/1'| sed 's/@/@ILUMINA/g'

how can i solve it , thanks a lot !!!

ADD COMMENTlink modified 7 months ago by Malcolm.Cook1.0k • written 7 months ago by Zeason0
1

never use '@' as a signal that the line is the header, because '@' is also a valid character for the fastq quality.

ADD REPLYlink written 7 months ago by Pierre Lindenbaum121k

simply run the sed command on your original file to modify it, omitting the grep part

Keep in mind though that the original will then not be present anymore (as you will have changed it), a better approach might be to redirect it to a new file

cat test.fastq | sed 's/@/@ILUMINA/g' > some-new_file

this might not be restrictive enough though, as it will also change all other occurrences of '@'

ADD REPLYlink modified 7 months ago • written 7 months ago by lieven.sterck5.5k

re: "simply run the sed command" - note: you must pass -i to modify it in place (assuming GNU sed)

ADD REPLYlink modified 7 months ago • written 7 months ago by Malcolm.Cook1.0k
3

That's not really something I would advise to novice users. Great way to lose your input data.

ADD REPLYlink written 7 months ago by WouterDeCoster40k

Agreed. Don't use the -i switch unless you're really sure what the sed does and you're sure you don't need the unmodified content later.

ADD REPLYlink written 7 months ago by RamRS22k

i just want change the each id of my reads . i think the way you recommend will change the quality also. the "grep '^@.*/1'" in my command just restrict the row i want to change to the id line in my fastq file. anyway ,thanks a lot

ADD REPLYlink written 7 months ago by Zeason0

Your grep command wouldn’t have solved that issue anyway, as it would still match a quality line that begins with @

ADD REPLYlink written 7 months ago by jrj.healey13k

Out of curiosity: why do you want to add "ILLUMINA" to every header?

ADD REPLYlink written 7 months ago by h.mon26k

just a example , i just want to prefix the id. because the stupid sequencing company give me the pair-end fastq file whose id like this : @307/1 it cant support me to do markduplicate in GATK that really make me mad :(

ADD REPLYlink written 7 months ago by Zeason0

And adding "ILLUMINA" to the headers will make markduplicate work? Are you referring to Picard MarkDuplicates? I thought it was supposed to work on bam files, not on fastq files.

Did you ask the sequencing company why the headers are like this? Illumina headers follow a different naming convention.

ADD REPLYlink modified 7 months ago • written 7 months ago by h.mon26k

beacause picard just told me "Value was put into PairInfoMap more than once", and when i find solution on the net , i just find someone said this error results from some lane id in the fastq file is repeat. so i just want to edit the id of reads to solve it . this way really solve the problem at least now. maybe the way you told me works well ,but i dont how to do it. :(

ADD REPLYlink modified 7 months ago • written 7 months ago by Zeason0

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.
Upvote|Bookmark|Accept

ADD REPLYlink written 7 months ago by WouterDeCoster40k
4
gravatar for jrj.healey
7 months ago by
jrj.healey13k
United Kingdom
jrj.healey13k wrote:

To avoid issues with @ in the quality line as Pierre points out:

 sed '1~4s/^@/@ILUMINA/' file.fastq > edited_file.fastq

And lieven's advice about leaving your original file unmodified is good advice, so redirect to a new file.

ADD COMMENTlink modified 7 months ago • written 7 months ago by jrj.healey13k

I don't know if this is of any particular consequence for what you want to do, but you've missed an L out in ILLUMINA. You may also want to consider changing the substitution to:

/^@/@ILUMINA:/ since all the fields in the header lines are : delimited, and this might make it easier to separate out the string later on.

Use at own risk though, as messing with the FASTQ headers is liable to break other programs.

ADD REPLYlink written 7 months ago by jrj.healey13k

can you just explain the meaning of "1~4" to me ? thanks a lot

ADD REPLYlink written 7 months ago by Zeason0

x~y is generic syntax for sed called an ‘address’ that basically says: starting on the 1st line, and every 4th thereafter, (~4), make the substitution defined in the /.../.../. This way it knows to ignore the quality line if it finds an @ at the start

ADD REPLYlink written 7 months ago by jrj.healey13k

you are such a nice person , thank you very much! i think i should buy a more advanced book rather than a basic book to study linux command. thanks a lot again!

ADD REPLYlink written 7 months ago by Zeason0
2

You don’t even really need a book, all you need is Google, and:

a well formulated question.

For example, this question, once you really think about what needs to happen is you need to process all lines starting with “@“ right? Well, no, as Pierre and others mention, we can’t use @! - Oh no, we need to think about the problem another way.

What else do we know about FASTQ format? Well, every entry is always 4 lines (assuming the file isn’t malformed, but if it is you have other, bigger, problems). So, all we really need to do is “edit every nth line of a file (with sed)”. And this right here is your google search phrase.

The first result that search returns is:

https://superuser.com/questions/396536/how-to-keep-only-every-nth-line-of-a-file

Now the title of that thread might not seem immediately relevant, but it is. You’ve just found out the magic of how to edit every nth line, now you need only combine that with what you already know about how sed works (i.e. the substitution part) and you’re done!

ADD REPLYlink modified 7 months ago • written 7 months ago by jrj.healey13k

ok , i got it :) i will try the way you recommend thanks a lot

ADD REPLYlink written 7 months ago by Zeason0

always start with the basics ....there is a reason why they call it 'basic' ;) once you got the hang of that, you can move on to 'advanced' stuff

ADD REPLYlink written 7 months ago by lieven.sterck5.5k

thanks , i will take it step by step

ADD REPLYlink written 7 months ago by Zeason0

i will try it and thanks a lot :) maybe my question is really stupid , but i really suffered from it. Because i am new to Linux.

ADD REPLYlink written 7 months ago by Zeason0
3
gravatar for Malcolm.Cook
7 months ago by
Malcolm.Cook1.0k
kansas, usa
Malcolm.Cook1.0k wrote:

I understand "keep the change in my fastq file " to mean precisely the opposite of "Leaving your original file unmodified", to wit:

GNU sed provides the -i option to apply the edit in place

sed -i '1~4s/^@/@ILLUMINA/' test.fastq

Perl too, allowing

perl -p -i -e  's/^@/@ILLUMINA/ unless $i++%4 ' test.fastq

TIP:

Useful to know, but not needed here, is the sponge command from moreutils which can be used to perform in-place edits using any command even if it does not support -i for in-place edits. Example:

anyCommand test.fastq | sponge test.fastq

in which test.fastq won't be re-written unless anyCommand completes without error.


ADD COMMENTlink modified 7 months ago • written 7 months ago by Malcolm.Cook1.0k

That is what OP asked, but I specifically didn’t offer up the -i flag because I think OP should be told that it is a bad idea (generally).

ADD REPLYlink written 7 months ago by jrj.healey13k

thank you very much ,i will try it

ADD REPLYlink written 7 months ago by Zeason0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1559 users visited in the last hour