31 May 2020
I’ve been working on single-cell sequencing pipelines at work. My favorite collaboration was with Rob Amezquita, a brilliant postdoc. I handled the data engineering and infrastructure, and he handled the science.
Three weeks ago, I found out he would be leaving. I had to do a knowledge transfer, from a postdoc, on an area of research that wasn’t my specialty, from home, in a week and a half.
Here’s the guide I wish I had when I started…
I’ve worked with the results from next-generation sequencers for a while, since it’s crucial to my work.
There’s a lot to learn about the topic and its applications; I’ve found these resources the best place to get started:
From the various books and articles I’ve read, there appear to be 4 key steps to single-cell sequencing:
I won’t go into much depth about the sequencing methods themselves. The HBC page on it goes into far better detail than I could myself.
Sequencing is how physical objects (cells, RNA, proteins) become digital. It’s critical.
This isn’t often seen as a critical step in single-cell work, but I have seen many projects struggle without it.
I designed my workflows/pipelines to be simple. Very, very simple. There’s no sense adding complicated code on top of a complicated subject. It helps that there are a lot of tools; I can stand on the shoulders of giants.
There are key things to set up:
I won’t go into too many details here; it’s a post all on its own.
Alignment is the process of figuring out where in a genome the each read fragment came from. We can take the result, aggregate, and create a count matrix. Once again, the HBC deck is one of the best.
|After days waiting for cellranger pipelines to finish, I’ve switched over to STAR. On occasion I will use kallisto or [kallisto||bustools (‘kb’)](https://www.kallistobus.tools/tutorials).|
The Garbage In, Garbage Out concept applies to both bioinformatics and software engineering. Before going any further, we need to analyze quality checks.
There are different kinds of alignment (feature barcoding, velocity). The basic version creates expression counts in a cell-by-gene matrix.
Information about sequencing quality is so important that sequencers have it built in. Sequencing output often has a quality score for each base read, using Phred encoding.
When we talk about quality checks, we are creating reports/summaries. Researchers look at the stats and decide if the data has is good enough to proceed.
Some useful metrics and graphs are:
There are far, far more. This is an area I’m still learning about. If I’m looking at cells and genes, I’d look at cell count, gene count, and mean/median genes detected per cell.
Finally, after having aligned data, it’s time for analysis.
The first step is usually some kind of clustering / dimension reduction approach. Some common algorithms are t-SNE (but it’s slow), PCA, and UMAP. I hear about umap the most nowadays.
Finally there’s a visualization component. I won’t go into this much; it seems very different every time.
Every scientist I’ve spoken to does iterative analysis. The code for dimension reduction, visualization, and stats should be fast and repeatable.
In many ways this seems familiar. Data generation, processing, and analysis. Sounds like an ETL pipeline, doesn’t it? A good data engineer recognizes the patterns. The details, though, are very different.