An Introduction to Single-Cell Sequencing

31 May 2020

I’ve been working on single-cell sequencing pipelines at work. My favorite collaboration was with Rob Amezquita, a brilliant postdoc. I handled the data engineering and infrastructure, and he handled the science.

Three weeks ago, I found out he would be leaving. I had to do a knowledge transfer, from a postdoc, on an area of research that wasn’t my specialty, from home, in a week and a half.

Head exploding

Here’s the guide I wish I had when I started…

Say what?

I’ve worked with the results from next-generation sequencers for a while, since it’s crucial to my work.

Single-cell sequencing is a different beast. It’s the process of sequencing every single cell in a sample. In particular, I had to learn about single-cell RNA sequencing (scRNAseq).

It’s great for analyzing the gene expression of rare cell types, or the phylogeny of cell development.

There’s a lot to learn about the topic and its applications; I’ve found these resources the best place to get started:

From the various books and articles I’ve read, there appear to be 4 key steps to single-cell sequencing:

  1. Sequencing
  2. Setup+reproducibility
  3. Alignment
  4. Analysis. Dimension reduction. Viz. Umaps

1. Sequencing

I won’t go into much depth about the sequencing methods themselves. The HBC page on it goes into far better detail than I could myself.

Sequencing is how physical objects (cells, RNA, proteins) become digital. It’s critical.

2. Computational Setup

This isn’t often seen as a critical step in single-cell work, but I have seen many projects struggle without it.

I designed my workflows/pipelines to be simple. Very, very simple. There’s no sense adding complicated code on top of a complicated subject. It helps that there are a lot of tools; I can stand on the shoulders of giants.

There are key things to set up:

  • The organization of data in a sequencing process, alignment, and analysis
  • The infrastructure for a compute pipeline. Save yourself time and brainpower. Use a workflow tool, containerized software, and immutable data locations.

I won’t go into too many details here; it’s a post all on its own.

3. Alignment

Alignment is the process of figuring out where in a genome the each read fragment came from. We can take the result, aggregate, and create a count matrix. Once again, the HBC deck is one of the best.

After days waiting for cellranger pipelines to finish, I’ve switched over to STAR. On occasion I will use kallisto or [kallisto bustools (‘kb’)](

The Garbage In, Garbage Out concept applies to both bioinformatics and software engineering. Before going any further, we need to analyze quality checks.

There are different kinds of alignment (feature barcoding, velocity). The basic version creates expression counts in a cell-by-gene matrix.

Quality Checks

Information about sequencing quality is so important that sequencers have it built in. Sequencing output often has a quality score for each base read, using Phred encoding.

When we talk about quality checks, we are creating reports/summaries. Researchers look at the stats and decide if the data has is good enough to proceed.

Some useful metrics and graphs are:

  • Knee plots
  • Median-reads-per-droplet
  • Number of reads detected
  • Number of unique reads
  • Number of aligned reads

There are far, far more. This is an area I’m still learning about. If I’m looking at cells and genes, I’d look at cell count, gene count, and mean/median genes detected per cell.

4. Analysis

Finally, after having aligned data, it’s time for analysis.

The first step is usually some kind of clustering / dimension reduction approach. Some common algorithms are t-SNE (but it’s slow), PCA, and UMAP. I hear about umap the most nowadays.

Finally there’s a visualization component. I won’t go into this much; it seems very different every time.

Every scientist I’ve spoken to does iterative analysis. The code for dimension reduction, visualization, and stats should be fast and repeatable.


In many ways this seems familiar. Data generation, processing, and analysis. Sounds like an ETL pipeline, doesn’t it? A good data engineer recognizes the patterns. The details, though, are very different.

Happy coding!


Chaos and Order

05 February 2020

I have too much going on. In the last 15 minutes before writing this I thought about 3 work projects, legal paperwork, trip planning, house repairs, what I’m going to cook, and helping a friend move.

My life is a careful balance of chaos and order.

Creativity is chaotic. My thoughts are scattered and rapid when I’m puzzling through a work problem or making notes for music. My best ideas come in the shower, or when I’m daydreaming.

When I have a new idea, I need to retain everything. It’s not organized, so I need to capture unstructured information.

However, incoherent information isn’t useful on its own. It’s only after I process it that my ideas are effective. Creating order from chaos is an essential second step.

Requirements for Chaos

Chaotic systems have structure.

Good ideas show up all the time. They’re transient, and unpredictably structured. A system to collect chaotic information must:

  • Support unstructured information
  • Be accessible all the time
  • Be very quick
  • Be simple

There arethings that are not required:

  • Easy to share
  • Permanant
  • Searchable

I found a great tool for this, the ‘World’s Oldest Data Structure’.

Paper. You read that right. Paper.

I always carry paper and a pen. It’s highly accessible, practically free, secure, quick, and flexible. It supports text, diagrams, music, even mind maps. My friends laugh and smile, presumably in admiration.

Order from Chaos

Random notes must be turned into something, like grist for the mill. I have a simple method:

I’m Already Doing It

The most common note is for an existing project: a work effort, piece of music, or discussion with a friend. I’ll have a Todoist project, wiki page, or email thread going, so I add the note into it.

Something New to Do

Let’s say I have a note for something new. I’ll create a new entry in an existing system, perhaps a single Todoist task, a recurring, or a wiki page. Then I’ll go through my memory palace and copy over anything relevant.

Maybe Someday

This is where I put everything else. If I’m not going to use it in the near future, I add the note to a wiki page. This system makes assumptions:

  • Don’t discard anything. I’m not phlegmatic about losing data
  • It is easy to adjust
  • It is quick. I process notes several times a day.

I’m accreting knowledge. Ensembling is powerful; just look at the wisdom of crowds.

Why all the fuss? Because people are chaos and order. Life is order and chaos. I try to embrace that, one step at a time, using paper.

What do you do?