Trimming and tabulating fastq reads

Anna has been testing her transposon sequencing pipeline, and needing some help processing some of her Illumina reads. In short, she needed to remove sequenced invariant transposon region (essentially a 5′ adapter sequence), trim the remaining (hopefully genomic) sequence to a reasonable 40nt, and then tabulate the reads since there were likely going to be duplicates in there that don’t need to be considered independently. Here is what I did.

# For removing the adapter and trimming the reads down, I used a program called cutadapt. Here's information for it, as well as how I installed and used it below.
# https://cutadapt.readthedocs.io/en/stable/installation.html
# https://bioconda.github.io/

## Run the commands below in Bash (they tell conda where else to look for the program)
$ conda config --add channels defaults
$ conda config --add channels bioconda
$ conda config --add channels conda-forge
$ conda config --set channel_priority strict

## Since my laptop uses an M1 processor
$ CONDA_SUBDIR=osx-64 conda create -n cutadaptenv cutadapt

## Activate the conda environment
$ conda activate cutadaptenv

## Now trying this for the actual transposon sequencing files
$ cutadapt -g AGAATGCATGCGTCAATTTTACGCAGACTATCTTTGTAGGGTTAA -l 40 -o sample1_trimmed.fastq sample1.assembled.fastq

This should have created a file called “sample1_trimmed.fastq”. OK, next is tabulating the reads that are there. I used a program called 2fast2q for this.

## I liked to do this in the same cutadaptenv environment, so in case it was deactivated, here I am activating it again.
$ conda activate cutadaptenv

## Installing with pip, which is easy.
$ pip install fast2q

## Now running it on the actual file. I think you have to already be in the directory with the file you want (since you don't specify the file in the command).
$ python -m fast2q -c --mo EC --m 2

## Note: the "python -m fast2q -c" is to run it on the command line, rather than the graphical interface. "--mo EC" is to run it in the Extract and Count mode. "--m 2" is to allow 2 nucleotides of mismatches.