Pymol Basic Tutorial

Alright CWRU students; today we will talk about installing PyMol and using it for some basic analysis of protein structure.

1) INSTALLATION

Firstly, download PyMol. CWRU has a subscription, so go to this link and read the license agreement. If you agree, press “I agree”. Then, enter your CWRU SSO info to be logged into the institutional software page. Hit any of the logos in the “PyMOL AxPyMOL v2.4” section (they all redirect to the same place).

As of 10/20/20, they do some weird web-store thing (never seen this before). You’ll have to add the version for whatever platform / OS you’re on, and hit add to cart, and then on the next pop up, hit “Check Out”. It’s confusing, but it’s all free, so this seems to just be a formality. You’ll get an email confirmation, but this email confirmation really doesn’t do much.

You will get to a screen that looks like this though:

First, while you’re still on the above screen, make sure you hit the “Important Notice” link. This will tell you the information you’ll need later for registering. The important notice screen will look like this (just without the blacked out parts).

Now that you know that info, go back to the previous screen (with the big “Download” button), add the version for the platform / OS you’re on, and hit download. You’ll then get to another screen where you’ll have to hit download again.

Now you’re on the registration screen. Type in the info from earlier, and type in your case email address. You’ll get an email to your school email address bringing you to the download link.

I whited out some of the token link, so you’ll have a link that looks much longer than above.

Follow that link and you’ll get to another page that looks like this:

Once you type in your case email address again, you should FINALLY get to the download page, which will look like this:

There’s the lot of links there, but the most important one is the very first link, under “Download PyMOL”. Hitting that should bring you to this page:

You’ll need to do two things here. First, before you forget, download the license file. This is the button on the bottom right. Following the instructions will allow you to download a “pymol-license.lic” file. I like to put it into the same Application folder, so it’s easy to find. As shown in the picture provided by the website, you’ll need this file when you first open up PyMol on your computer.

Next, download the program itself, based on your platform (again, I’m on a Mac). Once downloaded, follow the steps for installing Pymol (for macs, it’s opening up the disk image, and then dragging the Pymol application into your computer Application folder). Double click to open the program, and it will ask about activation. Click the “Browse for License File” button and select that license file you recently downloaded. Voila!, you have successfully navigated the gauntlet of website complexity to achieve your goal. A world of exploring protein structures awaits.

2) USAGE

Now that it’s downloaded and properly linked to the license, you should get a nice blank screen like this.

I could explain what everything is, but it’s easier to just to get started. If you’re on an internet connection, the easiest way to load protein structures is to use the “fetch” command. Essentially 99% of the protein structures you’ll ever want to see will be deposited at the RCSB Protein Data Bank. Go to that website, type in your favorite protein, and see what hits you have. For the purposes of today, I’ll just use a structure I’ve stared at many times over the last 5 years: that of the tumor suppressor protein PTEN. I have the code memorized, and it’s “1d5r”. So, type in “Fetch 1d5r” into either the bottom “PyMOL>_” prompt area, or in the top “PyMOL>” prompt area. Both spaces work / are largely redundant, although the top area actually records what your previous commands were, so I have a slight preference for this area. You should then now see the below protein pop up into your PyMol window.

From here, it’s easiest if you have a two button mouse connected. If you do, you can do a left-click hold and drag to flip the protein around. You can also do a right-click hold and drag to zoom in and out. And finally, you can also hold down either option or command (if you’re on a mac) and left click to actually move the entire molecule around on the screen. if you have a scroll wheel, you may notice that turning it makes part of the protein appear or disappear. This is called “clipping” and will be useful in certain cases where you need just the right picture of part of the protein, but we won’t be getting into this for today.

On the other hand, you may be using a laptop without a two-button mouse. While slightly less ideal, you can still do everything pretty easily as long as you go to the top menu-bar, click and over over “mouse” and click down on “1 Button Viewing Mode”. Now, if you go back to the Pymol window, you can still turn the object clicking and dragging on your trackpad, but you zoom by holding down control while clicking and dragging, and moving the entire molecule around (relative to your frame of view) by first holding down option before clicking and dragging.

The default is a “cartoon” view, which shows the backbone of the peptide, as well as secondary structure with alpha helices shown as helices, and beta strands shown as those flat arrows. I find cartoon views to be the most generally useful, so we’ll keep it like that for now. The small red plusses are water molecules the authors have in their model. I find these to be kind of annoying, so I like to hide them in the beginning, only brining them back in much later when we’re considering how various side-chains may be interacting with water in the medium. To do this, I just type in “hide all” which makes everything disappear, and then “show cartoon”. Now all of the extra molecules aside from the peptide backbone should be gone. In the case of PTEN, there’s a small tartrate molecule that was bound to the active site. It’s worth making this visible. The easiest way to select it is to click the “S” button in the bottom right menubar…

This will call up the sequence at the top of the protein viewer window, below the command-line area. The tartrate molecule is after the entire PTEN sequence, but before the waters. Scroll, there, and click on “TLA” to select it.

The sequence pops up, and default to the very beginning (starting with chain A)
After scrolling to the right, you can see the TLA substrate mimic in the sequence list.

This will make a new selection called “sele”. Now you can tell the program to make a visual representation of the selected “TLA” molecule. One option is to type in “show sticks, sele”. This will be pretty subtle. On the other hand, if you want to be able to see the atoms of the TLA molecule from pretty far, you can instead type “show spheres, sele”. The molecule will then look like this:

Nice. From here, feel free to explore. If you want to select particular residues, it’s helpful to use the “resi” denotation. For example, if we wanted to select residues 124, 129, 130, and 38, you can type in “select important_residues, resi 124+129+130+38′. There will now be a new selection called “important_residues”. To show where these are, you can choose to show these as spheres also (“show spheres, important residues”), and color them a different color to make them easier to see, such as by typing “color yellow, important_residues”. You’ll have something that looks like this.

3) OUTPUTS

So you’re doing doing all of your exploratory analyses, and feel like making an image to put into a presentation. First thing to consider is the background. It defaults to black, but white is usually better for most presentations / paper figures. You can change this either by selecting Display/Background/White from the top menu, or by simply typing in “set bg_rgb, white”.

While you can get a decent picture even with this, you can get much nicer images if you tell the program to render the image first. You can do this by clicking the “Draw/Ray” button at the top right corner of the screen, and fill out the values you want, or you can tell it to do its ray-tracing / rendering from the command line by typing something like ‘ray 900″. After a little bit of waiting, you’ll get a rendered image.

Now you can go to “File/Export image as/PNG…” and export your image file to one of your computer directories. You’ve made a simple, high quality figure. Hurrah! You may want to save your Pymol analysis file by going to “File/Save Session As…”. Thus, if you want to get back to this step without starting from scratch, all you need to do is open this session file again and you’re back to the same spot.

Hopefully that was a useful introduction to some basic operations in Pymol. There’s a ton more you can do with the program, but I’ll leave it here for now.

4) ADDENDUMS

So the above the the most basic things to get you started, but as I interact with trainees here (and see what we need to do for their work), I’ll amend this page with other more specific operations.

A) Show the protein surface.
Cartoon diagrams are great to understand how the protein is structured, but doesn’t give you a sense of what the surface of the protein looks like. That’s where showing the protein surface comes in handy. You can do this quite simply by typing in “show surface, all” (replacing “all” with whatever object name there is). This will create something like this:

The reason for the blue, red, and yellow is that the default coloring scheme in pymol is color carbon atoms green, nitrogen atoms blue, oxygen atoms red, and sulfer atoms yellow. If that’s too distracting, you can just color everything green by typing in “color green, all”.

But what if you want to see both the protein surface as well as the underlaying secondary structure? One approach is to make the surface representation semi-transparent. You can do this by typing something like “set transparency, 0.75”.

B) Selecting atoms near another selection of atoms
This can be a pretty useful feature. For example, you may be curious which residues of a protein is near a substrate mimic molecule. Or maybe the structure shows a protein-protein interaction, and you’re trying to figure out which residues in protein A are in close contact with protein B. Below is a third example, which is figuring out which atoms of a protein are pretty close to water molecules on the surface of the protein.

As described on this page, you can use the “around” command for this. Here’s a series of commands to show the atoms of the protein as spheres, color them all green, and then select only the atoms near water and color them blue:
“show spheres, 1d5r and not resn HOH”
“color green, 1d5r and not resn HOH”
“select near_water, (resn HOH) around 3.5”
“color blue, near_water”

Ordering primers

This is going to be a very lab-specific instructional post, since it’s literally only going to talk about how we normally order primers here in the lab.

So as I mentioned in this previous post, at least for us here at CWRU, primers are cheapest when ordered from ThermoFisher. There’s an unfortunate delay on how long you have to wait until you can receive the primers, but for the most part it isn’t a big problem. Regardless, I’ve essentially streamlined how we order and organize primers as much as possible, so here are the steps.

Any permanent member of the lab should have their own tab (titled with their initials) in the “MatreyekLab_Primer_Inventory” sheet. I also have my own tab, which is called “KAM”. Every primer you order should have its own unique identifier (like KAM2199). Since you’re the only one adding to your own tab, there is no reason you should ever have duplicate identifiers. You want to do this step first, so you know what identifier to give the primer you intend on ordering now (The most recent primer I ordered was KAM3495, so the primer I’ll order today is KAM3496).

I usually keep a separate “scratch” google sheet where I write down things I’m in the process of doing. Here, I like to make a set of fields like this (I usually just copy-paste from the last time I ordered primers, actually). But, if you don’t already have one of these sheets already, I just made a “Scratch_area” tab in the google sheet.

Here, I like to list a few things. There’s an area I intend to later copy-paste into the primer inventory, which lists the date, the length of the oligo (using the =LEN() function of the actual primer sequence), the primer identifier, the actual primer sequence, and then a short description for what the primer is for.

Next, I like to have another section which repeats a subset of the information, but in a slightly different order. The reason for this is that the ThermoFisher website requires the primer ordering sample sheets to have the data entered in a specific order (sequence, identifier, Name of person ordering). Since this is the same information as in some of the previous columns, I just have those cells point to each of the relevant cells so that I don’t need to fill in that same info again.

If you’re ordering primers for sequencing, then you’re all set. But if you’re ordering primers for some cloning, I like to also write a short section that writes down which primers are supposed to be paired with which other primers to make a particular plasmid, also listing the template plasmid that will be used for the reaction. It’s nice to write this info now, so you don’t have to try to remember everything / redo the effort you just did to remember how you need to pair things to actually perform the PCR when you receive the primer. I usually copy-paste these cells into the “MatreyekLab_Hifi_Reactions” google sheet, into an area below where I have written “** denotes primers we may not have yet“.

So next is actually ordering the primers. There is a folder in the lab google drive where all of the primer order sheets are kept (MatreyekLab_GoogleDrive/Primer_order_templates). I usually just find a recent file, duplicate it, and rename it for today’s date and whoever is ordering (eg. KAM).

Once I copy-paste the new info in, it should look like this in the end.

K, great. Now to actually order it. Log into our thermofisher account by going to www.thermofisher.com, hitting “Sign In” at the top right. Lab account email is “[email protected]” and the password is the lab password (which you should already have memorized!). Then I hover over “Popular” at the top left, and click on “Oligos, Primers, & Probes”.

Then hit “Order now” under “Custom made to order”

Then I always do the “Bulk Upload” button:

Next hit “choose file”:

Then click on the samplesheet you want to upload. Your primer should now be entered.

Cool, so now is actually checking out. Hover over the cart and then click ‘View cart & check out”.

Next is a series of somewhat annoying confirmation screens. Scroll down and hit “Begin Checkout”

Default settings are fine, so scroll down and hit “Continue to Payment”

Default settings (including the speed type) are still fine, so scroll down and hit “Continue to Review Order”

You’re almost done! Now scroll down, click on the “You agree to the thermofisher.com terms and conditions…” and then click “Submit Order”.

You should now see a screen that says “Thank you for your order!” at the top. The “[email protected]” account will also have received a confirmation email (but you don’t need to check this or anything). You are FINALLY done. Log out of the account and go on with your day.

Setting up a shared Github repo for an RStudio project

4/29/23 edit: At some point, this stopped working well for me, and I switched over to just using the GitHub desktop app, and that’s been working fine.

As I work with more trainees as they join the lab, it could be helpful to have a single RStudio GitHub repo that we share so we can work on the same project / analysis script. Here are a series of steps we can do to set this up.

Someone first has to initiate the repo. Log into your account on GitHub (make an account if you don’t already have one). Make a new Repo. If you’ve been added to the lab, then make a new repo within the lab. Add some kind of description, click “Add a README file”, and then hit “Create Repository”.

Here’s the blank repo that should now exist:

Now go into RStudio (which you should have already installed), and go to File -> New Project.

Click on Version Control:

And then click “Git”:

Next you’ll need to enter in some info. Add in the web address fo the repo you just made and want to link with your RStudio, make sure you’re happy with the project name, and then enter in where on your computer you want the project to be located.

Cool, now you have a very blank looking project.

Now let’s start adding things to the project. For example, let’s make an R Markdown file where you will start making the analysis script. To do this, go to File -> New File -> R Markdown…

A new screen will pop up. Add whatever details you want, and hit “OK”.

The file isn’t saved, so save it in the same directory as your Rproj and Readme file, giving it whatever name you want to give it (probably the same name as the repo).

If you use the default settings, you’ll get a generically prepopulated RMarkdown file. Get rid of everything but the first chunk, since you’ll just reuse the first chunk. Add in some generally useful initial code as I have in the picture below:

So you’ve already made some changes to the repository than when you first created it, since you 1) Made an R Proj file and 2) Made a R Markdown file. We can now send these changes back to the repo. Go to the Git tab on the top right panel…

Click on the two things you want to update (in this case, the Rproj and Rmd files), and hit Commit.

The next screen should be the commit screen. If you click on the items, it should tell you what’s different from the previous version. New things should be green (since these files are new, the whole thing is green). Write a message to generally describe what’s different with this new set of committed files, and then hit “Commit”.

You’ll get an output screen, like this:

This essentially means you approve of the changes you made, and you’re now ready to send it back to the repo (cloud). The commit screen is now different, and this time hit “Push”.

You should now get a different commit output screen, which says something like this, which means you sent it back to the master branch on the repo:

You can check to see if the repo was indeed changed by going back into your web-browser where you had logged into GitHub and had made the new repo. If you look, you’ll see that the new files are there.

Great. So the idea now is that you can now make changes to the contents of this repo, and send it back into the Cloud so everyone can now see your updated versions. For best performance, you’ll want to hit “pull” before you start working on something (to make sure you have the most up-to-date version of the repo), and once you’ve done your work, hit “push”, so other people can see and work with what you’ve done.

If you’re not the person who first initialized this repo, then you’ll have to follow a similar but slightly different series of steps (You won’t need to make the repo yourself). This resource was super helpful in helping me figure out how to do this, and will likely be quite helpful for you too!

8/26/2021 Update: So Github doesn’t allow you to do this with a password anymore, and you need something like a personal access token. Thus, follow the instructions listed here on how to create such a token. I then use gitcreds to actually handle the token, as described here.

Pseudovirus Infection Assays

TL;DR -> If you’re doing infection assays, it’s worth sticking to < 20% EGFP positive, and depending on how many cells you count, > 0.03% EGFP positive cells.

Can’t say I expected it, but the SARS CoV-2 pandemic has brought virology back into my sights. This includes some familiar tools, such as lentiviral pseudovirus infection assays which I had used a lot of in graduate school. It’s the same basic process as using lentiviral vectors for cell engineering, except you’re changing out the transgene of interest for a reporter protein, like EGFP. If the pseudovirus is able to (more-or-less) accurately model an aspect of the infection process, then you now have yourself a pretty nice molecular / cell biology surrogate to learn some biology with. Essentially, it tells you how efficiently the virus is getting into cells. Since all of the “guts” can look like HIV, you can us it to study parts of the HIV life cycle after it’s entered the cell but before its integrated into the host cell genome (this is what my PhD was in). It’s arguably more common use is instead as a vehicle for studying the protein interactions that various viruses make to enter the cell cytoplasms.

The loveliness of the assay (especially the fluorescent protein version) is that it is inherently a binary readout (infected or uninfected) at a per-cell level. But, since there are *MANY* cells in the well, it becomes a probability distribution, where the number of infected cells captures how efficient the conditions of entry / infection were.

But of course there is also some nuance to the stats here. For example, if you mixed a volume containing 100,000 cells with another volume containing 100,000 viral particles (and thus a multiplicity of infection (MOI) of 1), 100% of the cells don’t get infected. In actuality, you would only get about 63% of cells getting infected. That’s b/c each virus isn’t a heat-seeking missile capable of infecting the only remaining uninfected cells around; instead, it’s because each infection event is (*more or less) an independent event. Thus, cells can be multiply infected. Notably, based on an EGFP readout, you can’t really figure that out (ie. cells that are doubly infected are essentially indistinguishable from singly infected cells based on their fluorescence levels). But, since we know how the system should function, we know that it should largely follow a Poisson distribution.

Here’s a plot I made with this script to demonstrate what that relationship is between the MOI and the % of green cells you should expect to see:

Things look nice and linear at low MOIs, but as soon as you start getting close to an MOI of 1 then the number of multiply infected cells start taking on a bigger role, until eventually every cell becomes multiply infected as the last few “lucky” uninfected cells get so overrun that there is no escaping infection. In terms of trying to get nice quantitative data, that saturation of the assay is kind of problematic. Generally speaking I find somewhere about 20-30% of GFP positive cells where you’re really started losing that linearity (shown by the dotted line).

Why not just always do small numbers, like 1% or even 0.1%? Well, b/c you’re kind of caught between a rock and a hard place: I guess the rock is the unbreakable boundary of 100% infected cells, while if you go too low it gets hard to sample enough events to accurate quantitate something that’s really rare.

I do think there’s still a lot of meat in that assay though, as long as you count sufficient cells. I generally quantitate the number of infected + uninfected cells using flow cytometry. Being able to count 10,000+ events a second mean you can run through all of the cells of a 24-well in under a minute. That means you can pretty reasonably sample 100,000 cells per condition. Assuming that’s the case, how does our ability to accurately quantitate our sample drop as the number of infected cells fall?

IMO, it begins to become unusable when you get to roughly 0.03 percent GFP positive cells. That’s because past that, the amount of variability you get in trying to measure the mean is roughly about the size of the mean itself. Furthermore, your chances for encountering replicate experiments where you get *zero* GFP positive cells really starts to increase. And those zeros a pretty annoying, since they are quite uninformative.

So just understanding the basic stats, it looks like my informative range is going to be ~ 0.03 percent positive to ~ 30 percent positive. So roughly 3 orders of magnitude. Ironically, that just about exactly what I saw when I ran serial dilutions of a SARS CoV pseudovirus on ACE2 overexpressing cells.

So there you have it. The theoretical range of the assay, backed experimentally by what I saw in real life. The wonderfulness of science. You can find the script I used to these these estimations here.

Sanger seq analysis – finding correct clones

We do a lot of molecular cloning in the lab. Sarah has been working on a great protocols.io page dedicated to writing up the entire process, which I’ll link to once it’s completely set. But once you’ve successfully extracted DNA from individual transformants, a key tool in the molecular cloning pipeline is to find the clonal DNA prep that has the intended recombinant DNA in it. We use Sanger sequencing to identify those clones. That is what this instructional post will be about.

First, gather all of your data. I have a plasmid map folder on the lab google drive, where all of the physical .gb files for each unique construct is stored. Create a new folder (named after your .gb file), and *COPY* all of the relevant Sanger sequencing traces (these are .ab1) files into that folder. That will just help organize everything down the road. *PREFIX* the Sanger sequencing files with the clone id (usually “A”, “B”, or “C”, etc). This will make things easier to interpet down the road.

You presumably already have this plasmid map on Benchling, since that’s likely where you designed the plasmid, so open that up. if you don’t already have it on there, then import it. Once open, click on the to the “alignment” button on the right hand side.

If this is your first time trying to align things to this plasmid, then your only option will be to “Create new alignment”; press this button.

Next you’ll get to another screen which will allow you to add in your ab1 files. Click the choose files button, go into the new folder you had made in the “Plasmid” directory of the lab google drive, and import the selected abi files.

If you’ve successfully added the abi files, then the screen should look like this:

The default settings are fine for most things, so you can go ahead and hit the “create alignment” button. It will take a few seconds, but at the end you should get a new screen that looks like this.

The above screen is showing you your template at the top, as well as the Sanger seq peaks for each of your Sanger runs. If you’re trying to screen miniprep clones to see which prep might have the right construct, then go to the part of the map / alignment that was at the junction. For example, in this above construct, I had shuttled in the iRFP670 in place of mCherry in the parental construct, so anything that now has iRFP670 in there in place of mCherry is an intended construct. Looks like I went 3/3 this time.

This is a good chance to look for any discrepancies, which are signified by red tick marks in the bottom visualization. I’ve now moved the zoomed portion to that area, and as you can see, the top two clones seem to have an extra G in the sequence triggering these red lines. This is where a bit of experience / intuition is important. Since this is toward the end of the sequencing reaction and the peaks are getting really broad (compare to the crisp peaks in the prior image), the peak-calling program was having a hard time with this stretch of multiple Gs, thus calling it 4 Gs instead of 3. I’m not concerned about this at all, since it’s more of an artifact of the sequencing rather than a legit mutation in the construct. If we were to sequence this area again with a primer that was closer, I’m almost sure all of the constructs will show they only have 3 Gs.

Since all three constructs seem to have the right insertion any without any seemingly legit errors (yet), I typically choose one close and move forward with fully confirming it. All things being equal, I make my life simpler by choosing the clone that is earliest on the alphabet (so A in this case). That said, if all three looked fine but Clone C had by far the highest quality sequencing (say, the Clone A trace only went 400nt while the Clone C trace went 800+, I’d instead go with Clone C).

Next is sequencing the rest of the open reading frame. if you already chose a sequencing primer, then you’ve probably already attached a bunch of primers onto this map. But let’s pretend you haven’t already done this. To do this, go to the “primer” button and hit “attach new primers”, as below.

You may already have a primer folder selected. In that case, you’re all set to go. Otherwise, you’ll have to select the most recent primer folder. Click the “Add locations” area, open the triangle for the folder that says “O_Primers” and select the most recent folder (since this is currently august, this would be “20200811”. If selected, it should show up in the window, like the below picture.

Hit “Find binding sites” and it will come up with a long list of primers *that we already have* that are located in your construct.

This part gets kind confusing. To actually use these primers, you’ll first have to hit the top left box, located on the header row of the big table of primers. You should then see a check mark next to every primer.

Once you do that, hit the “Attach selected primers” button that it’s in green at the top right of the screen.

Once you do that, all of the primers that were previously listed but colored white before should now be colored green. NOW you’re free to switch back to your alignment.

Once you switch back to your alignment, the top should have a big yellow box that says “Out of Sync” and have a blue button that says “Realign”. Hit the realign button, which will call up one of your earlier screens, and you just have to hit realign once more.

Now all of the attached primers should show up in the top row of the alignment window. Since I have so many overlapping primers, this eats up a bunch of the space on the screen (you can turn off the primers if needed by clicking on the arrow next to “Template” and clicking off the box next to Primers). Still, now you can move around the map and find primers that will be able to sequence different parts o the plasmid you have not sequenced yet. In this case, I’ll likely just move forwards with one clone (such as Clone A), and I’ll start sequencing other parts of the ORF, such as the remaining parts of mCherry which I can likely get with the primer KAM1042.

Congrats, you’re now a Sanger sequencing analysis master.

Flowjo Analysis of GFP positive cells

We do a lot of flow cytometry in the lab. Inevitably, what ends up being the most practical tool for analysis of low cytometry data is FlowJo. While I’ve been using FlowJo for a long time, I realize it isn’t super intuitive and new people to the lab may first struggle in using it. Thus, here’s a short set of instructions for using it to do a basic process, such as determining what percentage of live cells are also GFP positive.

Obviously, if you don’t have FlowJo yet, then download it from the website. Next, log into FlowJo Portal. I’m obviously not going to share my login and password here; ask someone in the lab or consult the lab google docs.

Once logged in, you’ll be starting with a blank analysis workspace, as below.

Before I forget, an annoying default setting of FlowJo is that it only lists two decimal points in most of its values. This can be prohibitively uninformative if you have very low percentages that you’re trying to accurate quantitate. Thus, click on the “Preferences” button:

Then click on the “Workspaces button”

And finally once in that final window, change the “Decimal Precision” value to something like 8.

With that out of the way, now you can perform your analysis. Before you start dragging in samples, I find it useful to make a group for the specific set of samples you may want to analyze. Thus, I hit the “Create Group” button and type in the name of the group I’ll be analyzing.

Now that the group is made, I select it, and then drag the new sample files into it, like below:

Now to actually start analyzing the flow data. Start by choosing a representative sample (eg. the first sample), and double clicking on it. By default, a scatterplot should show up. Set it so forward scatter (FSC-A) is on the X-axis, and side scatter (SSC-A) is on the Y-axis. Since we’re mostly using HEK cells, that means that main thing we will be doing in this screen is gating for the population of cells while excluding debris (small FSC-A but high SSC-A). Thus, make a gate like this:

Once you have made that gate, you’ll want to keep it constant between samples. Thus, right click on the “Live” population in the workspace and hit “Copy to Group”. Once you do that, the population should now be in bold, with the same text color as the group name.

Next is doublet gating. So the live cell population will already be enriched for singlets, but having a second “doublet gating” step will make it that much more pure. Here is the best description of doublet gating I’ve seen to date. To do this, make a scatterplot where FSC-A is on the X-axis, and FSC-H is on the Y-axis. Then only gate the cells directly on the diagonal, thus excluding those that have more FSC-A relative to FSC-H. Name these “Singlets”.

And like before, copy this to the group.

Next is actually setting up the analysis for the response variable we were looking to measure. In this case, it’s GFP positivity, captured by the BL1-A detector. While this can be done in histogram format, I generally also do this with a scatterplot, since it allows me to see even small numbers of events (which would be smashed against the bottom of the plot if it were a smoothed histogram). Of course, a scatterplot needs a second axis, so I just used mCherry fluorescence (or the lack of it, since these were just normal 293T cells), captured by the YL2-A detector.

And of course copy that to the group as well (you should know how to do this by now). Lastly, the easiest way to output this data is to hit the Table Editor button near the top of the screen to open up a new window. Once in this window, select the populations / statistics you want to include from the main workspace, and drag it into the table editor, so you have something that looks like this.

Some of those statistics aren’t what we’re looking for. For example, I find it much more informative to have the singlets show total count, rather than Freq of parent. To do this, double click on that row, and select the statistic you want to include.

And you should now have something that looks like this:

With the settings fixed, you can hit the “Create Table” button at the top of the main workspace. This will make a new window, holding the table you wanted. To actually use this data elsewhere (such as with R), export it into a csv format which can be easily imported by other programs.

FYI, if you followed everything exactly up to here, you should only have 2 data columns and not 3. I had simplified some things, but forgot to update this last image so it’s now no longer 100% right (though the general idea is still correct).

Congratulations. You are now a FlowJo master.

Optimal Laser and Detector Filter Combinations for Fluorescent Proteins

The people at the CWRU flow cytometry core recent did a clean reinstall of one of their instruments, which meant that we had to re-set up our acquisition template. I still ended up eyeballing what would be the best laser / filter sets based on the pages over at FPbase.org, but I had a little bit of free time today, so I decided to work on a project I had been meaning to do for a while.

In short, between the downloadable fluorescent spectra at FPbase, as well as known instrument lasers and detector bandpass filters, I figured I could just write a script that essentially takes in whatever fluorescent protein with spectra that you have downloaded, and essentially makes a table showing you which laser + detector filter combinations give you the highest amount of fluorescence.

Here’s the R script on the lab GitHub page. I made it for the two flow cytometers and two sorters I use at the CWRU cytometry core, although it would presumably be pretty easy to change the script to make it applicable for whatever instruments are at your place of work. So here’s a screenshot of a compendium of the results for these instruments:

Nothing too surprising here, although it’s still nice / interesting to see the actual results. While it’s somewhat obvious since the standard Aria has no Green or Yellow-Green laser, we should not do any sorting with mCherry on it. Instead, we should use the Aria-SORP, which has the full complement of lasers we need.

Designing Primers for Targeted Mutagenesis

Now that my lab is fully equipped, I’m taking on rotation students. Unfortunately, with the pandemic, it’s harder to have one-on-one meetings where I can sit down and walk the new students through every method. Furthermore, why repeat teaching the same thing to multiple students when I can just make an initial written record that everyone can reference and just ask me questions about? Thus, here’s my instructional tutorial on how I design primers in the lab.

First, it’s good to start out by making a new benchling file for whatever you’re trying to engineer. If you’re just making a missense mutation, then you can start out by copying the map for the plasmid you’re going to use as a template. Today, we’ll be mutating a plasmid called “G619C_AttB_hTrim-hCPSF6(301-358)-IRES-mCherry-P2A-PuroR” to encode the F321N mutation the CPSF6 region. This should abrogate the binding of this peptide to the HIV capsid protein. Eventually every plasmid in the lab gets a unique identifier based on the order it gets created (this is the GXXXX name). Since we haven’t actually started making this plasmid yet, I usually just stick an “X” in front of the name of the new file, to signify that it’s *planned* to be a new plasmid, with G619C being used as the template. Furthermore, I write in the mutation that I’m planning to make in it. Thus, this new plasmid map is now temporarily being called “XG619C_AttB_hTrim-hCPSF6(301-358)-F321N-IRES-mCherry-P2A-PuroR”

That’s what the overall plasmid looks like. We’ll be mutating a few nucleotides in the 4,000 nt area of the plasmid.

I’ve now zoomed into the part of the plasmid we actually want to mutate. The residue is Phe321 in the full length CPSF6 protein, but in the case of this Trim-fusion, it’s actually residue 344.

I next like to “write in” the mutation I want to make, as this 1) makes everything easier, and 2) is part of the goal of making a new map that now incorporates that mutation. Thus. I’ve now replaced the first two T’s of the Phe codon “TTT” with two A’s, making the “AAT” codon which encodes Asn (see the image above)

Next is planning the primers. So there are a few ways one could design primers to make the mutation. I like to create a pair of overlapping (~ 17 nt), inverse primers, where one of the primers encodes the new mutation in it. PCR amplification with these primers should result in a single “around-the-circle” amplicon, where there is ~ 17 nt of homology on the terminal ends. These ends can then be brought together and closed using Gibson assembly.

So first to design the forward primer. This is the primer that will go [5’end] –[17 nt homology] — [mutated codon] — [primer binding region] — [3’end]. So the first step is to figure out the primer binding region.

In a cloning scheme like this, I like to start selecting the nucleotides directly 3′ of the codon to be mutated, and select enough nucleotides such that the melting temperature is ~ 55*C. In actuality, the melting temperature will be slightly higher, since 1) we will end up having 17 nt of matching sequence 5′ of the mutated codon, and 2) the 3rd nt in the codon, T, will actually be matching as well.

Now that I’ve determined how long I need that 3′ binding region to be, I select the entire set of nucleotides I want in my full primer. In this case, this ended up being a primer 36 nt in length (see below).

Since this is the forward primer, I can just copy the “sense” version of this sequence of nucleotides.

OK, so next to design the reverse primer. This is simpler, since it’s literally just a series of nucleotides going in the antisense orientation directly 5′ of the codon (as it’s shown in the sense-stranded, plasmid map). I shoot for ~ 55*C to 60*C, usually just doing a little bit under 60*C.

Since this is the reverse primer, we want the REVERSE COMPLEMENT of what we see on the plasmid map.

Voila, we now have the two primers we need. We just now need to order these oligos (we order from ThermoFisher, since it’s the cheapest option at CWRU, and can then perform the standard MatreyekLab cloning workflow).

Using prior data to optimize the future

As of this posting, we’ve cloned 176 constructs in the lab. I’ve kept pretty meticulous notes about what standard protocol we’ve used each time, how many clones we’ve screened, and how many clones had DNA where the intended insertions / deletions / mutations were present. With this data, I wondered whether I could take a quick retrospective look on my observed success / failure rates to see if I could use to see if my basic workflow / pipeline was optimized to maximize benefit (ie. getting the recombinant DNA we want) while limiting cost (ie. Time, effort, $$$ for reagents and services). I particularly focused on 2-part Gibsons, since that’s the workhorse approach utilized for most molecular cloning in the lab.

First, here’s a density distribution reflecting reaction-based success rates (X number of correct clones in Y number of total screened clones, or X / Y = success rate).

I then randomly repeatedly sampled N-times from that distribution, ranging from N-values 1 through 5, effectively pretending that I was screening 1 clone, 2 clones … up to 5 clones for each PCR + Gibson reaction we were performing. Since 1 good clone is really all you need, for each sampling of N clones, I checked whether any of them were a success (giving that reaction a value of “1”) or a whether all of them failed (giving that reaction a value of “0”). I repeated this process 100 times, and counted the sum of “1” and “0” values, and divided by 100 to get an overall success rate. I repeated this process 50 overall times to get a sense of the variability of outcome with each condition. Here are the results:

We screen 3 clones per reaction in our standard protocol, and I think that’s a pretty good number. We capture at least 1 successful clone 3/4 of the time. Sure, maybe we increase how often we get the correct clone on the first pass if we instead screen 4 or 5 clones at at time, but the extra effort / time / cost doesn’t really seem worth it, especially since it’s totally possible to screen a larger number on a second pass for those though-but-worth-it clones. Some of those reactions are also going to be ones that are just bad, period, and need to be re-started from the beginning (perhaps even by designing new primers), which is a screening hill that certainly isn’t worth dying on.

9/10/20 edit: In my effort to make it easier for trainees to learn / recreate what I’m doing, I posted the data and analysis script to the lab GitHub.

Modeling bacterial growth

I do a lot of molecular cloning, which means a lot of transformations of chemically competent e.coli. Using 50 uL of purchased competent bacteria would cost about $10 per transformation, which would be an AWFUL waste of money, especially with this being a highly recurring expense in the lab. I had never made my own competent cells before, so I had to figure this out shortly after starting my lab. It took a couple of days of dedicated effort, but it ended up being quite simple (I’ll link to my protocol a bit later on). Though my frozen stocks ended up working fine, I became quite used to creating fresh cells every time I need to do a transformation. The critical step here is taking a saturated overnight starter culture, and diluting it so you can harvest a larger volume of log-phase bacteria some short time later. A range of ODs [optical density here defined as absorbance at 600 nm] work, though I like to use bacteria at an OD around 0.2. I had gotten pretty good at being able to eyeball when a culture was ready for harvesting (for LB in a 250 mL flask, I found this was right when I started seeing turbidity), but I figured there was a better way to know when it’s worth sampling and harvesting.

I started keeping good notes about 1) the starting density of my prep culture (OD of the overnight culture divided by the dilution factor), 2) the amount of time I left the prep culture growing, and 3) the final OD the prep culture. I converted everything into cell density which is a bit more intuitive than OD (I found 1 OD[A600] of my bacteria roughly corresponded to 5e8 bacteria per mL), and worked in those units from there on out. Knowing bacteria exhibit exponential growth, I log base-10 transformed the counts. Much like the increasing number of COVID-19 deaths experienced by the US from early March through early April, exponential growth becomes linear in log-transformed space. I figured I could thus estimate the growth of my prep culture of competent cells by making a multi-variate linear model, where the final density of the bacteria was dependent on the starting bacterial density and how long I left it growing. I figured the lag-phase from taking the saturated culture and sticking it into cold-LB would end up being a constant in the model. Here’s my dataset, and here’s my R Markdown analysis script. My linear model seemed to perform pretty well, as you can see in the below plot. As of writing this, the Pearson’s r was 0.98.

The aforementioned analysis script has a final chunk that allows you to input the starting OD of your starter culture, and assuming a 1000-fold dilution, tells you how long you likely need to wait to hit the right OD of your prep culture. Then again. I don’t think anyone really wants to enter this info into a computer every time they want to set up a culture, so I made a handy little “look-up plot”, shown below, where a lab member could just look at their starter culture OD on the x-axis, choose the dilution they want to do (staying 2x within 1000-fold since I don’t know if smaller dilutions can affect bacterial competency), and figure out when they need to be back to harvest (or at least stick the culture on ice). I’ve now printed this plot out and left it by my bacterial shaker-incubator.

Note: The above data was collected when diluting starter culture bacteria into *COLD* LB that was stored in the fridge. We’ve since shifted to diluting the bacteria into room-temp LB (~ 25*C), which has somewhat expectedly resulted in slightly faster times to reach the desired OD. If you’re doing that too, I would suggest subtracting ~ 30min of incubation time from the above times to make sure you don’t overshoot your desired OD.

I’m still much more of a wet-lab scientist than a computational one. That said, god damn do I still think the moderate amount of computational work I can do is still empowering.