Question: How to extract the first and last N bases from a read in a fastq file?
0
gravatar for alhamidi.reem
13 months ago by
alhamidi.reem10 wrote:

how can I extract the first and last N bases from a read in a fastq file?

I have used the following command to extract the last 1000 bases of a read from a fastq file but I'd also like to incorportate the first 1000 bases to the command as well:

$$  grep -A 4 "read_name_identifier" filename.fq | sed -n '2~4p' | grep -o '.{1000}$'

Also, how can I use the new command for the first and last N bases on a perl script as I have >450 reads in a fastq file?

Many thanks,

Any help will be appreciated.

bioinformatics • 505 views
ADD COMMENTlink modified 13 months ago by genomax83k • written 13 months ago by alhamidi.reem10

If you want to use perl (or python), I would suggest parsing the file 'properly' with the Bio module. Extracting this information will be fairly trivial, and much more robust.

ADD REPLYlink written 13 months ago by Joe16k
1
gravatar for genomax
13 months ago by
genomax83k
United States
genomax83k wrote:

Here is one way (using part of your own solution) :

grep -A 4 "read_name_identifier" filename.fq | sed -n '2~4p' | cut -c 1-1000

OR

grep -A 4 "read_name_identifier" filename.fq | sed -n '2~4p' | sed 's/.//1001g'

Not sure why you are referring to perl in your question since there is no perl involved.

Hint: For things like this search stackoverflow for solutions.

ADD COMMENTlink modified 13 months ago • written 13 months ago by genomax83k

Thank you @genomax.

I was wondering if there's a syntax I could use to extract both the 1st and last 1kb of sequence from each read?

ADD REPLYlink written 5 months ago by alhamidi.reem10

How about this?

zcat MyReads.fastq.gz | head
@1:NM_014620:16:182
GTTCCGAGCGCTCCGCAGAACAGTCCTCCCTGTAAGAGCCTAACCATTGC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@2:NM_014620:1094:172
ATGAAAAAAATTCACGTTAGCACGGTGAACCCCAATTATAACGGAGGGGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@3:NM_022658:294:172
TGTACGGGCCCGGCGGCTCGGCGCCCGGCTTCCAGCACGCTTCGCACCAC

bases=10 ;
zcat MyReads.fastq.gz | awk -v len="$bases" -F "" '{if (NR % 4 == 2) {for (i=1; i<=NF; i++) {if (i <= len || (NF-i) < len) {printf $(i)}}; printf "\n"}}' | head
GTTCCGAGCGCTAACCATTGC
ATGAAAAAAAAACGGAGGGGA
TGTACGGGCCCTTCGCACCAC
ACTAGATGTAAATGAAGAAAG
GCGTGAGGAGGGCAGCGCTGC
GGAAGGACTCTGGGGCAAGGA
CTACCTATAGATGCGGTTTTG
AAGGGACTGTGGAGGCGTCTC
ATAACATCCATCTGTTTTGTT
CCAACCTGCCCCCGGGGAGCC

In this example, I am just taking the first and last 10 bases. To retrieve 1000bp, change the value of bases before the awk command

ADD REPLYlink written 5 months ago by Kevin Blighe59k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2126 users visited in the last hour