Hi there, I'm new in bioinformatics tools and I need help. I tell you what I have done :)
I have an assembly from Trinity and I wanted to know the CG content of each assembly so I generated a file with this information and I imported to R.
Then, I calculated the CG content of all my contigs in R by doing:
## load data
data= read.table("CG_content_contig.txt", header = T)
## Create a new variable called CG
data$CG <- (data$C+data$G)/(data$A+data$C+data$G+data$T)
## Create a file with CG content > 23%
CG_content_more_than_23 <- data[which(data$CG > '0.23'), ]
## Create a file only with the names of the contigs
name_contigs <- CG_content_more_than_23$chr
## Download .txt file.
write.table(name_contigs, file="name_contigs.txt", row.names=FALSE, sep='\t')
As you see, I have created a file with those contigs with an CG content higher than 23%. I have uploaded this file (name_contigs.txt) to my server in order to work with it. It looks like this:
TRINITY_DN134693_c0_g1_i1
TRINITY_DN109669_c0_g1_i1
TRINITY_DN109679_c0_g1_i1
TRINITY_DN114999_c0_g1_i1
TRINITY_DN114910_c0_g1_i1
I have a .fq file that looks like this:
>TRINITY_DN134617_c0_g1_i1
AATAAAAATAAATAAAAATCAATAAAAATATTATAATACAATATAATATAAAATAATATAAAAATTCTACAATAAGAATAAAGTATAATTTTTTAGATTATAAGAGGATATGTTAATACATAGTATTCTGTTTGTTATTGTAGAAAAAACATACAGAAACTTTTTGTATATATAGTCTCATTTTATATATATAAATAAAAATGAACATTAATGAAATGAAATTAAGAGTCGTTTTATTAAAAATAGCTATAAAAAATAACAACA
>TRINITY_DN134643_c0_g1_i1
GCATGGTAGTAAAGTATAATGACATAGCAAAAATATTTAAAATAAAAAAAAATTACTATTATAATTTTTTCTGTATAACATAAACGTTTTTAATGATATTATATTAATTACATATAAAAATAGCATAATAAAAATATTTAGTTATAAAATTTATTATTTTATTTTTTTTTTTTTGTTATATACTTTCTCAGAACATTAATTTGTCATCAGTTCTATTATATTGATAAACTATTCAATTGCTTTAATA
What I want to do is to keep only the contigs that are NOT in the .txt file.
Maybe I should create a pipeline with grep command?