Data Engineering in the Field

16 June 2020

Tomorrow, I’m giving a presentation to students in the University of Washington’s Big Data Technologies certificate program.

The topic? The practicalities of data engineering. Or, as I think of it, Data Engineering in the field.

Permalink

An Introduction to Single-Cell Sequencing

31 May 2020

I’ve been working on single-cell sequencing pipelines at work. My favorite collaboration was with Rob Amezquita, a brilliant postdoc. I handled the data engineering and infrastructure, and he handled the science.

Three weeks ago, I found out he would be leaving. I had to do a knowledge transfer, from a postdoc, on an area of research that wasn’t my specialty, from home, in a week and a half.

Head exploding

Here’s the guide I wish I had when I started…

Say what?

I’ve worked with the results from next-generation sequencers for a while, since it’s crucial to my work.

Single-cell sequencing is a different beast. It’s the process of sequencing every single cell in a sample. In particular, I had to learn about single-cell RNA sequencing (scRNAseq).

It’s great for analyzing the gene expression of rare cell types, or the phylogeny of cell development.

There’s a lot to learn about the topic and its applications; I’ve found these resources the best place to get started:

From the various books and articles I’ve read, there appear to be 4 key steps to single-cell sequencing:

  1. Sequencing
  2. Setup+reproducibility
  3. Alignment
  4. Analysis. Dimension reduction. Viz. Umaps

1. Sequencing

I won’t go into much depth about the sequencing methods themselves. The HBC page on it goes into far better detail than I could myself.

Sequencing is how physical objects (cells, RNA, proteins) become digital. It’s critical.

2. Computational Setup

This isn’t often seen as a critical step in single-cell work, but I have seen many projects struggle without it.

I designed my workflows/pipelines to be simple. Very, very simple. There’s no sense adding complicated code on top of a complicated subject. It helps that there are a lot of tools; I can stand on the shoulders of giants.

There are key things to set up:

  • The organization of data in a sequencing process, alignment, and analysis
  • The infrastructure for a compute pipeline. Save yourself time and brainpower. Use a workflow tool, containerized software, and immutable data locations.

I won’t go into too many details here; it’s a post all on its own.

3. Alignment

Alignment is the process of figuring out where in a genome the each read fragment came from. We can take the result, aggregate, and create a count matrix. Once again, the HBC deck is one of the best.

After days waiting for cellranger pipelines to finish, I’ve switched over to STAR. On occasion I will use kallisto or [kallisto bustools (‘kb’)](https://www.kallistobus.tools/tutorials).

The Garbage In, Garbage Out concept applies to both bioinformatics and software engineering. Before going any further, we need to analyze quality checks.

There are different kinds of alignment (feature barcoding, velocity). The basic version creates expression counts in a cell-by-gene matrix.

Quality Checks

Information about sequencing quality is so important that sequencers have it built in. Sequencing output often has a quality score for each base read, using Phred encoding.

When we talk about quality checks, we are creating reports/summaries. Researchers look at the stats and decide if the data has is good enough to proceed.

Some useful metrics and graphs are:

  • Knee plots
  • Median-reads-per-droplet
  • Number of reads detected
  • Number of unique reads
  • Number of aligned reads

There are far, far more. This is an area I’m still learning about. If I’m looking at cells and genes, I’d look at cell count, gene count, and mean/median genes detected per cell.

4. Analysis

Finally, after having aligned data, it’s time for analysis.

The first step is usually some kind of clustering / dimension reduction approach. Some common algorithms are t-SNE (but it’s slow), PCA, and UMAP. I hear about umap the most nowadays.

Finally there’s a visualization component. I won’t go into this much; it seems very different every time.

Every scientist I’ve spoken to does iterative analysis. The code for dimension reduction, visualization, and stats should be fast and repeatable.

Recognition

In many ways this seems familiar. Data generation, processing, and analysis. Sounds like an ETL pipeline, doesn’t it? A good data engineer recognizes the patterns. The details, though, are very different.

Happy coding!

Permalink