Common Plasmidsaurus Errors

OK, so we all know that Plasmidsaurus nanopore sequencing isn’t perfect. Every time I see the mistake at the 5′ end of the IRES sequence I know to ignore it, but there are bunch of other ones that I still repeatedly run into (but not quite as frequently) such that I don’t have it memorized and am not sure if I should be ignoring it right off the bat. Thus, I’m going to keep a list of repeated erroneous calls here on this page so I’m reminded to ignore them in the future.

Visual evidence of individual example listed below. But here’s a summary of Plasmidsaurus errors to ignore:

  1. IRES – deletions near the 5′ end
  2. mCherry – W63R or Q114R
  3. mScarlet – errors at R71 (sometimes R71G) and S113/L114 (including L114P).
  4. mKG – L96P
  5. Puromycin resistance gene PAC – R18G or L125P
  6. shBleR – Q56R
  7. Silent or frameshift mutations at the NPGP motif at the 3’end of the P2A sequence

In fact, be very suspicious of any unexpected L -> P mutant through Plasmidsaurus seq. And maybe Q -> R muts too.

Since almost all of my plasmids have this IRES sequence in it, I almost always run across this error (although it’s usually a 1nt miscall rather than 2nt like this example).
This Puromycin R18G error is annoying b/c it looks like it could be really problematic.
I don’t use mScarlet-I all that often, but when I do, Plasmidsaurus sometimes gives me this L114P erroneous call.
Here it gets screwed up in the same area but mysteriously called an A insertion, making it an S113fs.
It also has issues with mScarlet-I R71. Sometimes it calls it a silent mutation, but other times it calls it as R71G.
An insertion (which, if true, would make a frameshift) in the NPGP motif toward the 3’end of P2A.
Puro L125P
mCherry W63R
mCherry Q114R
mKG L96P
shBleR Q56R.

Edit 2/9/24: Here’s another one. A nt insertion in Asp residue at around position 4 or so of the histone 2A protein.

Ordering oligos at CWRU

Here’s a price comparison I did back in 2019 (presumably still correct?). But in short, per nt price was cheapest through ThermoFisher.

Thus, we’ve been almost exclusively buying oligos from them, with $7,220 spent (as of June 2022) since our first orders starting December 2019.

Here’s what the histogram of oligo costs have shaped up as.

But, well, don’t order degenerate nucleotides oligos from them as they’ll likely be T biased.

If anyone sees anything better on campus, let me know!

Consistent Plasmidsaurus sequencing miscalls

As I noted in this Twitter exchange, plasmid nanopore sequencing via Plasmidsaurus is great, but not perfect. For example, there seem to be some “achilles heal” sequences, where nanopore reproducibly (like 100% of the time with different plasmid submissions) miscalls certain parts of our plasmids. How do we know they’re miscalls? B/c the Sanger traces of the same exact plasmids show the expected sequence very clearly. Here are two that we commonly see:

A single deleted C nucleotide in the beginning of our IRES sequence:

A phantom T>C base miscall that incorrectly tells us we have a W566R nonsynonymous change in every single one of our human ACE2 constructs.

Both are related to C repeats, but there are plenty of other C repeats in the plasmids we submit and it’s ALWAYS these sequences that give Plasmidsaurus problems. Once I figured this one, it’s really NBD, since I know to ignore these changes, although it did inform our current molecular biology workflow in the lab of 1) Screen colony minipreps via Sanger -> 2) Sequence candidate good constructs with Plasmidsaurus / nanopore -> 3) Sanger to resolve unexpected discrepancies between the expected / intended and Plasmidsaurus sequences.

Command line BLAST

One of the pseudo-projects in the lab requires looking for a particular peptide motif in genomic data. While small scale searches can be done using the web interface, the idea is to do this in a pretty comprehensive / high throughput manner, so shifting to the command line makes sense for this work. I last did this back in 2018 for some preliminary studies, so I’m going to have to re-install the software on my new computer and re-run some of those analyses. I figure I’ll write down my notes as I re-do this, so that I (and others) can use this post as a reference.

Installing BLAST+

The instructions on how to download the program can be found here. I’m on a mac, so I downloaded “ncbi-blast-2.13.0+.dmg” and double clicked and ran the package installer.

Assuming it’s been correctly installed, writing the command …

blastp -task blastp-short -query <(echo -e ">Name\nAAWLIEKGVASAEE") -db nr -remote -outfmt 1

… into the terminal should actually reveal some BLAST-specific output, rather than throw an error.

Running protein motif-specific blast searches

Type in the following into your terminal:

psiblast -phi_pattern PHI-Blast_2A_pattern.txt -db nr -remote -query <(echo -e ">Name\nGATNFSLLKQAGDVEENPGP") -max_hsps 1 -max_target_seqs 10000 -out phi_blast_output.csv -outfmt 10

Note: The above command will require having a text file specifying the pattern constraint (“PHI-Blast_2A_pattern.txt” above), which can be found here. This should yield a 25 KB file csv output, like so.

Extracting just the accession numbers

I don’t remember if there are other BLAST+ outputs that give you the full hit sequence. If so, the method I ended up taking back in 2018 would seem to be unnecessarily roundabout. But, until I figure that out, I’ll follow the old method. As you can see in the aforementioned output format, it doesn’t output the hit protein sequence, and instead just gives the accession number. Thus, the next step is using the accession number to actually figure out the protein sequence. To do this, we’ll use Entrez Direct. To install Entrez Direct, follow the instructions here. Briefly, type in the following into the terminal:

sh -c "$(curl -fsSL ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh

In order to complete the configuration process, execute the following:

echo "source ~/.bash_profile" >> $HOME/.bashrc
echo "export PATH=\${PATH}:/Users/kmatreyek/edirect" >> $HOME/.bash_profile

OK, now that it’s installed, here’s how I’ve used it:

First, the output file above has more info than the accession number. To have it pare down to only the accession number, I used this script, which can be run by entering the following into the terminal, assuming you have the previous output csv file somewhere in the directory with the script (can even be in other folders within that directory):

python3 3_Blast_to_accession.py

This will create a file called “3A_prot_accession_list_complete.txt” (example output file here) which will be the unique-ified list of accession numbers to give to Entrez Direct. (Uniquifying is important if you have multiple .csv outputs you wanted to compile into a single master list).

This can be fed into Entrez Direct using this shell script, which you can run by typing in:

sh 4_Accession_to_fasta.sh

You should now have an output file called “4A_prot_fasta.txt” with the resulting protein sequences in fasta format, like so.

Now you can search for your desired sequence (in its full protein context) within the resulting file.

To be continued…

Are there other steps in this process related to this project? Sure. Like what do you do with all of these full sequences containing the hits? Well, that’s beyond the scope of this post.

ODs on the spec and nanodrop

So there are two ways to measure bacterial culture ODs in the lab. The first is to use the nearby ~ $10,000 Thermofisher Nanodrop One (no cuvette option). The second option is to use a relatively cheaply made cuvette-based spectrophotometer I bought off of Amazon for ~ $100. To make it clear, this comparison is not a statement about the value of a Nanodrop (though I will say that having an instrument like a Nanodrop is essentially a must in a mol biol lab). This is more about if the Nanodrop is already being used by someone and waiting would get in the way of some bacterial speccing timepoints, can I purchase a $100 piece of equipment to relieve such a conflict? Especially for bacterial cultures, where volume isn’t really an issue and the measurement is simply the reading at 600 nm, not even requiring some algebra to make a conversion to more practical units (like ng/uL for DNA).

So to do this comparison, over a number of independent instances, I took the same bacterial culture and put 1mL into a cuvette and ran it on the old spec, and took 2 uL and put it on the Nanodrop pedestal and measured there. I made a table of the results, and graphed it in the plot below.

So the readings on the two instruments certainly correlate (that’s good), although it’s not an exact 1:1 relationship. In fact, the nanodrop gave numbers roughly 1.5 times higher than the spec. But if the two instruments give two different readings, then the question becomes “which is right?”

And to that, I essentially say there is no right answer. Each is a proxy for bacterial cell density (ie. Billions of bacteria / mL), but there’s no “absolute” information encoded in the OD number that tells us that specifically for our bacteria, and we’d still have to come up with a conversion factor either way (ie. my doing limiting dilutions of specc’d cultures and counting colonies), and once we have that, both will be right with that context. Sure, it would be nice if we had a method that was the most in-line with whatever ODs that were being described by various papers in the literature, but who knows what they used (recent papers may be using ODs from the nanodrop [with some perhaps using the cuvette option but many others not], while the older publications certainly didn’t have and instead likely used some old-school form of spec). But even that’s going to be heterogeneous, and will only give limited information anyway.

Well, good record-keeping to the rescue. We’ve transformed the positive control plasmid enough times to sample a range of various ODs just by chance, to see if certain bacterial ODs correlate with transformation efficiency. And boy, there’s been a whole lot of nothing there so far (which is actually quite notable; see below).

(FYI: I don’t remember which instrument I used to measure the OD A600 readings. Probably mostly the old spec, tho).

So yea, I’ve generally used cultures with ODs at the time of collection between 0.1 and 0.45, and they’ve collectively given me transformation rates of ~ 20,000 using our standard “positive control” plasmid. So there seems to be a pretty wide window of workable ODs. But generally speaking, I see no issue with having a culture of 0.1 to 0.4 OD as measured with either machine for use with chemical transformation.

Setting up a hybrid lab meeting

Both due to child-care and pandemic reasons, our originally 100% in-person lab meeting for a time was 100% remote and for the last few months have been 100% hybrid. For overall accessibility reasons, I’ll likely have hybrid remain the default option, and only not bother to set up the Zoom when it’s clear absolutely everybody is going to be in attendance in-person. Over time, I think I’ve better learned how I should be setting up the hybrid lab meeting, and I figure I’d write down the steps here so I can remember (and anybody else can do so if they’re setting things up).

Standard flexible format (ie. Nobody needs presenter mode)
Here, the laptop plugged into the projector is providing the sights and sounds of the conference room, but is simply serving as a “viewer” of the slides in Zoom. Here, I’ll assume this is being done with the common lab laptop (Kenny’s old laptop from 2019), although anyone’s laptop should work.

  1. Plug in the 360 degree camera into the common lab laptop. If you also want to use a different external microphone (ie. if you don’t want to use the microphone associated with the 360 degree camera), then plug that in now too.
  2. Log into the lab meeting on Zoom. Confirm that the right camera and microphone are selected. Make sure the sound is up to the maximum, and that this computer remains unmuted.
  3. Using the USB-C adapter, hook up the laptop to either the projector or the wheeled TV. The adapter will allow for connecting to the projector with the existing VGA cord, or the TV with an HDMI connection.
  4. Make sure the Zoom screen is showing on the projector / TV screen.
  5. To actually use this setup, the idea will be that 1) anybody with their own computer can log into the lab meeting Zoom and share their desktop or window with the presentation (powerpoint file or google slides, for example), or 2) if it’s someone without their own computer, they can use Kenny’s or Anna’s computer to screen share (assuming the presentation file is somewhere easily accessible, like the “Lab_meetings” directory of the lab Google Drive).
    Note: While these computers are being used to share the slides, all sound (input and output) should be happening from the common lab computer connected to the projection device.

One speaker / Longer-form talk where someone needs presenter mode
The main difference here is that the laptop presenting the slides has to be plugged into the projector / monitor, and is thus not simply a “viewer” in the Zoom call. Here, I’ll assume this is being done with the common lab laptop, a

  1. Plug in the 360 degree camera into the common lab laptop. If you also want to use a different external microphone (ie. if you don’t want to use the microphone associated with the 360 degree camera), then plug that in now too.
  2. Log into the lab meeting on Zoom. Confirm that the right camera and microphone are selected. Make sure the sound is up to the maximum, and that this computer remains unmuted.
  3. Open the presentation file. Windows may get much harder to navigate once connected to the second screen, so you may as well get everything set up beforehand.
  4. Using the USB-C adapter, hook up the laptop to either the projector or the wheeled TV. The adapter will allow for connecting to the projector with the existing VGA cord, or the TV with an HDMI connection.
  5. Now, go to Zoom and hit “Share screen”. Probably makes sense to choose the screen with the presentation on it, though it doesn’t really matter at this point since you can adjust it later.
  6. Once the screen is sharing, go to the slide / presentation software you’re using. Assuming it’s Powerpoint, then hit “presenter view”.
  7. If the wrong screen is showing on the projector / TV monitor, then hit swap displays in the Zoom panel until it does.
  8. Now, if the wrong screen is being shared on Zoom (ie. people in the Zoom call are saying they’re seeing your presenter view), then hit the “Share screen” button again and choose the correct screen to cast.

That should do it!

Designing Amplicon-EZ primers

Amplicon-EZ is a pretty convenient service from Genewiz. In short, they’ll perform Illumina sequencing on a 150-500 nt DNA fragment you send them (they’l perform 2 x 250 cycles of sequencing, so fragments smaller than 500nt will have paired read regions). For $50 per sample, they’ll return ~50,000 reads (although in our experience, they tend to return more than this). Turnaround times can be kind of slow (while one can minimize the delay if you time things perfectly, it’s taken between 14 and 19 days to get data back following submission). That said, we’re still only running full kits a couple of times a year, so obviously a lot faster turnaround than that. Thus, definitely good for getting an initial look into something you may want to sequence more deeply later. My general policy for that lab is that if you make any library, it’s worth submitting the library to Illumina sequencing via Amp-EZ pretty early on so you can be confident that the library is good and worthy of further experiments.

Designing primers

Primers are pretty simple to design. Essentially, you’ll want to make a pair of PCR primers with Amp-EZ adapters on the 5′ ends (and of course, DNA hybridizing sequences on the 3′ ends). As shown in the above link, the adapter sequences are:

For the forward sequencing read: 5’-ACACTCTTTCCCTACACGACGCTCTTCCGATCT-3’

For the reverse sequencing read: 5’-GACTGGAGTTCAGACGTGTGCTCTTCCGATCT-3’. <- Reverse strand on plasmid map

Here’s an example of a map and corresponding annotated primers used to sequence the ACE2 Kozak library plasmid, and a similar map for sequencing the library after its been integrated into landing pad cells.

Amplifying your fragment

For the actual protocols to do this, it’s probably worth asking Sarah or Nidhi how they do it. The basic steps are going to be PCR, gel extraction of the band, and Qubit quantitation of the extracted DNA. Some things to keep in mind are that they’ll want a fair amount of DNA (500 ng), so you’ll either want to make sure you do a lot of cycles and extract a pretty hefty band, or you’ll need to do a second amplification from the initially extracted DNA.

Illumina Library Mixing

It’s taken us a while, but at this point we’ve now submitted two multiplex NextSeq Illumina kits through CWRU genomics. We’ve performed the multiplexing by appending dual indices during targeted amplification of the barcodes. In order to mix everything together, we’ve taken a rather simple approach of gel extracting the amplified bands and quantitating the amount of extracted DNA using Qubit, and mixing every extracted amplicon to equimolar amounts. This mixture is then submitted to the CWRU genomics core for qPCR-based quantification, and loaded onto the kit for sequencing.

Since everything has been mixed to equimolar amounts so far, it’s quite simple to see how well our samples were quantitated and mixed together using the Qubit readings. This is done by taking the number of paired reads associated with each pair of indices, and seeing how their counts / frequencies compare to each other. The distributions tended to be a log-normal distribution, as seen in the below plots:

The second kit seemed to do a little better than the first one. Each kit ended up having a singular poorly sequenced outlier, with the sample in the first kit lower than the median by about ~300-fold, and the sample in the second kit lower by about 10x. The first kit also had about 4 samples that were below the median by about 10-fold.

Regardless, this log-normal distribution is showing what kind of mixing precision we can expect to consistent get with this scheme, even when its working well. Both distributions had a sd(log) of ~0.5. The coefficients of variation were ~ 0.04.

What are the red lines you ask? Well, those are the idealized / hypothetical number of reads we should have gotten if we got all of the reads from the kit (130 million for the mid kit used in kit 1, and 400 million for the high kit used in kit 2) divided by the total number of samples. Clearly, the real life read numbers aren’t hitting that idealized number, which is as expected.

To frame it another way though, sure having read a sample with too much read-depth in inefficient, but what is arguably more annoying is if end up under-reading a sample because we were off in out estimates. What seems like a reasonable approach is to choose a target read-count that will ensure that ~95% of our log-normal distribution hits that minimum read-number needed to get interpretable data. Looking at these n = 2 results, it seems reasonable that we would want to give each sample ~ 5 to 10-fold more reads than one would expect based on idealized numbers.



Generating new indices for multiplex Illumina sequencing

Nidhi recently needed to design new reverse primers for making dual-indexed amplicons for Illumina sequencing, but this involved generating novel indexes that could be used with existing primer sets. Luckily, I had already asked Anh to create a script for such a purpose, so once I remembered that he had already done that, all I had to do was find it and implement it. It’s pretty neat, since it goes to a Google Sheet in the cloud (so not a local file), and extracts the existing indices so we know to avoid things similar to them. Since it was a pretty easy script to understand, I ended up making some tweaks to it: 1) Adding a user input function, rather than hard-coding the number of new indices needed into the script itself, 2) Having the script consider the forward and reverse sequences of the existing indices to avoid both, and 3) spitting out an identity matrix of the pairwise comparisons of the existing 10 nucleotide indices, so I can visually verify that none are really close in sequence.

The script lives here in this GitHub repo, with the newest version I modified called “generateIndex_user_input.py”.

To be specific, rather than an identity matrix, it’s actually a distance matrix (how many positions within the 10 nucleotides are NOT identical), with the white diagonal in the above plot essentially a positive control since that’s showing that in the pairwise comparisons, the same sequence when compared to it self shows 0 nucleotides of difference. Notably, there are no other purely white blocks in that matrix, partially b/c 1) the chance that two randomly generated sequences are identical is ~ 1/(4^10), which is a very small number, and 2) Anh’s script is likely working, and keeping the number of identical matches between any newly generated indices and existing indices to a minimum. Actually, it looks like most differ by 7 or 8 nucleotides, and only a tiny fraction only differ by 3 nucleotides or so (and thus match at 7 positions). If those are the combinations with the highest similarity, it really isn’t bad at all.

So ya, I think we’re doing a decent job of managing our indices to make sure they are not overlapping!

Posters

So a couple of the trainees in the lab are preparing posters for the Dept retreat. In helping them (and other future trainees) get started, I dug out some of my own digital versions of posters out of the digital storage closet, and put them in a “Poster_examples” directory on the lab google drive. But along with those examples, here are a series of tips that I’ve learned over the years, which I’m remembering and committing to writing in this process.

  1. Be cognizant of the size of your poster. I think the most common dimensions are 36″ high by 48″ wide. For some conferences, they will allow you to make larger posters (probably depends on what type of poster boards they have). But also, different poster printers may have different dimensions. I’m sure most do 36″ or 48″ as one of their standard sizes, though you also want to figure out sizes your poster printer will do as one of its standard dimensions.
  2. Know how long it takes to print the poster. There are logistics to everything. Where are you going to get the poster printed? How far ahead of the guaranteed printing deadline do they need the submission? Again, it’s worth knowing this up front so you’re not scrambling at the end. Also, getting started a little bit early can definitely save you some misery later on.
  3. Plan for column-based viewing rather than rows. IMO, it makes much more sense to tell the story in a series of columns, where the viewer starts on the left of the poster, reads down the column, takes a step to their right, reads down the column, and so forth 3 or so times until they’ve seen the whole poster. Probably generally easier to read this way rather than horizontally through rows, and probably the ONLY way to see / go through a poster when it’s really crowded.
  4. Plan out how you’re going to talk about your poster as you’re making it. What normally happens in a poster session, is that you stand around all awkward and lonely in front of your poster until someone comes by that seems interested, and then you’ll introduce yourself and offer to explain it (and you can ask them who they are and what their background is so you can figure out what parts they may be particularly interested in or may struggle in understanding). Then you’ll explain this story, starting at the top left of your poster and winding down to the bottom right. Now, what would really help in telling this story is if the things on your poster actually illustrates what you’re trying to say. So, think about what exactly is the story you’re saying as you’re making the poster, so it’s all there when you need it.
  5. Always start by presenting the problem. It’s easy to get sucked into just saying what you’ve did. But nobody is going to understand why you did something until you tell them WHY you’re doing it. Which means there must be a problem presented, and your data is part of the solution in answering it. But if you want people to be on board, you definitely want to spell out the problem, and why it’s so important, up front.
  6. Do a light (white) background. The backing paper is going to be white. So why force them to cover the entire surface of the paper with ink?
  7. Consistency in fonts / font sizes. This is one of the hard ones. If you’re essentially making a scrapbook of different figures from different sources, it may be hard to get all of the fonts from the figures. In those cases, you may need to crop out the existing labels / text and add in your own of known / controllable size. Also, this is a good reason to always use the same fonts (eg. Arial) in all of your figures.
  8. Text size hierarchy. I hadn’t really thought about this before, but I think this makes sense: For the main text of the poster, it probably makes sense to have about 2 different text sizes: one for the main narrative text you may want the reader to read, and a smaller size for text that holds some detailed information (like a figure legend). The larger one is always meant to be read, while the smaller one is only to be read when necessary. The larger one should be able to be read a good couple feet away, so it’s definitely gotta be >= size 20 or something. Headers / section titles may also make sense too, and you can denote those using bold or by using a slightly larger text size. Of course, this is all not counting the poster title, which should be large and visible from across the room. Of course, you don’t want to overcomplicate things by having like 5 different font sizes.
  9. Not stretching / squishing figures. Nobody likes seeing images in weird proportions; sure, maybe it works with mirrors in a circus funhouse, but it definitely doesn’t work for conveying scientific information in a professional setting. So ya, make sure that if you’re adjusting a figure’s size, that you have the “height and width proportionality” box checked as you do it.
  10. Avoid having too much text. Nobody is at the poster to read the next great American novel. You’ll literally want the bare bones amount of text that you need to give just about all of the big structural information you need to. Everything else, like all of the details, you can convey in person.
  11. Don’t make it too crowded. Again, you want it to be inviting and easy to follow, so you want to give figures ample space to breathe. I’d also say don’t make it too sparse, but while it’s technically possible, I can’t imagine any serious poster has ever had that be an issue (eg. too much stuff to want to get in there).
  12. Consider expanding your illustrative palette. If you’re on your first few posters, it totally makes sense to use Powerpoint (or equivalent) to create your poster. That said, eventually, you may want to include other options. I’ve slowly shifted to using Inkscape, but partially since I’m much better / faster in Inkscape now than I am in Powerpoint. Also, BioRender is a great option for making some really aesthetically pleasing images / graphics relatively quickly with an easy-to-use interface. At worst, I suggest you generate some plots there, and just take screenshots that you can insert into your posters made in Powerpoint.

Also, I realize some of my own posters probably break some of these rules (eg. the “Avoid having too much text” rule). Nobody’s perfect!

Addendum:

Nisha just did the legwork on figuring out how we print posters here, and this is what she found out:
– FedEx office in Thwing (9am-5pm Monday thru Friday).
– General size that they work with is 36″x48″ , so that might be what you want your poster size to be.
– You can have it on a usb drive, or email it to them directly at [email protected]
– If submitted early enough in the day, they may be able to do same day turnaround, but better to submit 2 to 3 days before you need it just in case.
– Will update this once we know how payment works exactly, but I suggest showing up knowing one of the lab speedtypes to charge directly to one of them (R35 is a good candidate, but can always use the startup speedtype if needed).