Question

How to concatenate four fastq files of NextSeq in linux?

0

Entering edit mode

4.3 years ago

harshraje19 ▴ 40

Hi everyone, I am new in the field of bioinformatics. I have generated RNA Seq data, my sequencing was run on NextSeq with single end reads. I have 4 fastq files per sample and therefore need to combine these 4 files into one file. I have used the cat command in linux to combine these four files into one. But when, I am using the combined file for mapping it is throwing the error. (showing that file is not in proper fastq format for mapping). Does anyone has experience on combing these file and using them for mapping?. Or can I map these files separately.

Thank you.

RNA-Seq software error alignment assembly genome • 3.0k views

ADD COMMENT • link 4.2 years ago by harshraje19 ▴ 40

1

Entering edit mode

Is this from a single run (I mean a single = 1 run, not single-end) or was it four different runs?

Please show the output of head -n 4 file.fastq for every file. Please also say how the files are labelled once you received them.

ADD REPLY • link 4.3 years ago by ATpoint 82k

0

Entering edit mode

All four files are from single run.

following is the head -n5 of all four files

harshraj@harshraj-XPS-8930:~/Propriety_Carlos_GrapeVineRNASeq/AGRF_CAGRF15892_HYMHJBGX2_20170821$ head -5 P3_HYMHJBGX2_TTACCGAC_L001_R1.fastq 
@NS500468:254:HYMHJBGX2:1:11101:14046:1054 1:N:0:TTACCGAC
GGATCNTGGCAGCAAGGCCACTCTGCCACTTACAATACCCCGTCGCGTAATTAAGTCGTCGGCAAAGGATTCTAA
+
AAAAA#/EEEEEE6/AEEAEEEEE/EEAE<EEEE/EA/EE/EEEEE/A//EEE66</EA<///E/E/A//EA/E/
@NS500468:254:HYMHJBGX2:1:11101:3000:1055 1:N:0:TTACCGAC
harshraj@harshraj-XPS-8930:~/Propriety_Carlos_GrapeVineRNASeq/AGRF_CAGRF15892_HYMHJBGX2_20170821$ head -5 P3_HYMHJBGX2_TTACCGAC_L002_R1.fastq 
@NS500468:254:HYMHJBGX2:2:11101:10808:1046 1:N:0:TTACCGAC
CCCAANTCTGCATTGTTGATGCTTTTAGCACATGTAACTGCAGCATCATGAGTGTCAAAAGTGACAAAACCAAAA
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEAEEE/EEEEEE
@NS500468:254:HYMHJBGX2:2:11101:18260:1048 1:N:0:TTACCGAC
harshraj@harshraj-XPS-8930:~/Propriety_Carlos_GrapeVineRNASeq/AGRF_CAGRF15892_HYMHJBGX2_20170821$ head -5 P3_HYMHJBGX2_TTACCGAC_L003_R1.fastq 
@NS500468:254:HYMHJBGX2:3:11401:17885:1021 1:N:0:TTACCGAC
NTTAATCGACCAACACCCTTTGTGGGTTCTAGGTTAGCGCGCAGTTGGGCACCGTAACCCGGCTTCCGGTTCCTC
+
#AAAAEEEEEAEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEAEEEEEEAEEE//EE
@NS500468:254:HYMHJBGX2:3:11401:18904:1021 1:N:0:TTACCGAC
harshraj@harshraj-XPS-8930:~/Propriety_Carlos_GrapeVineRNASeq/AGRF_CAGRF15892_HYMHJBGX2_20170821$ head -5 P3_HYMHJBGX2_TTACCGAC_L004_R1.fastq 
@NS500468:254:HYMHJBGX2:4:11401:18851:1025 1:N:0:TTACCGAC
NAGACATTGATAGACAAGAAGGCTTGGCCATATGTCCAGATGGATCTCCGTCTCAAGGCAGAATATCTCCGGTAC
+
#AA/6EEEEEEEEEEAEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEE<EEEEE

ADD REPLY • link updated 4.2 years ago by GenoMax 141k • written 4.2 years ago by harshraje19 ▴ 40

1

Entering edit mode

If this is paired end data then you need to cat the files in exactly the same order for both R1 and R2 reads.

ADD REPLY • link 4.2 years ago by GenoMax 141k

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. SUBMIT ANSWER is for new answers to original question.

ADD REPLY • link 4.2 years ago by GenoMax 141k

0

Entering edit mode

if you are a cat person, use cat in *nix and if you are a reptile person, use merge_fastq library.

ADD REPLY • link updated 4.2 years ago by Ram 43k • written 4.2 years ago by cpad0112 21k

0

Entering edit mode

dear friends thank you for your suggestions.

I have mapped all four lanes fast files to genome separately and them 4 sam file generated in this analysis were used together for ht-seq count. Then after ht-seq it generated four gene counts in one txt file and in excel performed gene count of lane 1 + gene count of lane 2 + gene count of lane 3 + gene count of lane 4 = total gene count.

Any thoughts on this.

Thank you everyone.

ADD REPLY • link 4.2 years ago by harshraje19 ▴ 40

0

Entering edit mode

Please don't post unrelated questions as answers in the original thread.

ADD REPLY • link 4.2 years ago by GenoMax 141k

0

Entering edit mode

please read my question and response carefully

ADD REPLY • link 4.2 years ago by harshraje19 ▴ 40

1

Entering edit mode

Your new post is a question that is an off-shoot of your original question, and hence needs to be its own post. genomax is right in pointing that out
You posted that as an answer, which is the wrong place to post anything that is not an answer to the top level question. genomax is right in pointing that out as well.

It looks like your question and response were indeed carefully read.

ADD REPLY • link 4.2 years ago by Ram 43k

0

Entering edit mode

harshraje19 @ I am not sure if that is the way I would handle. Most of the aligners, for a given sample, allow the user to furnish multiple read files and use them in alignment to produce one single alignment file (SAM/BAM), followed by quantification per sample. But every one has his/her way to do the analysis.

ADD REPLY • link 4.2 years ago by cpad0112 21k

0

Entering edit mode

Thank you everyone for taking time to answer the question.

ADD REPLY • link 4.2 years ago by harshraje19 ▴ 40

0

Entering edit mode

Please stop adding answers. This is the third time in this post that you're being asked to stop. If you continue adding answers, your account might be suspended for a period of time.

ADD REPLY • link 4.2 years ago by Ram 43k

score 1 · Answer 1 · 2020-01-28

1

Entering edit mode

4.2 years ago

GokalpC ▴ 100

It is better not to merge/cat fastq data from different lanes. Just start processing them as individual until duplicate marking stage or another where you can merge bam files with different read group names.

ADD COMMENT • link 4.2 years ago by GokalpC ▴ 100

0

Entering edit mode

Lane replicates show typically very little to no batch variation from what I have heard from people working in core facility settings who tested this on their machines. I think processing independently simply increases the workload, but indeed is safer, though unnecessary imho.

ADD REPLY • link 4.2 years ago by ATpoint 82k

0

Entering edit mode

I agree. I usually run lane-specific FASTQC + adapter trimming, then combine the FASTQs. Our sequencing facility splits samples across lanes, so that counters any lane-specific batch effect as well.

ADD REPLY • link 4.2 years ago by Ram 43k

score 0 · Answer 2 · 2020-01-27

0

Entering edit mode

4.2 years ago

ATpoint 82k

So it seems these are simply different lanes, therefore please see:

How do concatenate different fasta file

ADD COMMENT • link 4.2 years ago by ATpoint 82k