I’ve been working on single-cell sequencing pipelines at work. My favorite collaboration was with Rob Amezquita, a brilliant postdoc. I handled the data engineering and infrastructure, and he handled the science.
Three weeks ago, I found out he would be leaving. I had to do a knowledge transfer, from a postdoc, on an area of research that wasn’t my specialty, from home, in a week and a half.
Here’s the guide I wish I had when I started…
I’ve worked with the results from next-generation sequencers for a while, since it’s crucial to my work.
There’s a lot to learn about the topic and its applications; I’ve found these resources the best place to get started:
From the various books and articles I’ve read, there appear to be 4 key steps to single-cell sequencing:
I won’t go into much depth about the sequencing methods themselves. The HBC page on it goes into far better detail than I could myself.
Sequencing is how physical objects (cells, RNA, proteins) become digital. It’s critical.
This isn’t often seen as a critical step in single-cell work, but I have seen many projects struggle without it.
I designed my workflows/pipelines to be simple. Very, very simple. There’s no sense adding complicated code on top of a complicated subject. It helps that there are a lot of tools; I can stand on the shoulders of giants.
There are key things to set up:
I won’t go into too many details here; it’s a post all on its own.
Alignment is the process of figuring out where in a genome the each read fragment came from. We can take the result, aggregate, and create a count matrix. Once again, the HBC deck is one of the best.
|After days waiting for cellranger pipelines to finish, I’ve switched over to STAR. On occasion I will use kallisto or [kallisto||bustools (‘kb’)](https://www.kallistobus.tools/tutorials).|
The Garbage In, Garbage Out concept applies to both bioinformatics and software engineering. Before going any further, we need to analyze quality checks.
There are different kinds of alignment (feature barcoding, velocity). The basic version creates expression counts in a cell-by-gene matrix.
Information about sequencing quality is so important that sequencers have it built in. Sequencing output often has a quality score for each base read, using Phred encoding.
When we talk about quality checks, we are creating reports/summaries. Researchers look at the stats and decide if the data has is good enough to proceed.
Some useful metrics and graphs are:
There are far, far more. This is an area I’m still learning about. If I’m looking at cells and genes, I’d look at cell count, gene count, and mean/median genes detected per cell.
Finally, after having aligned data, it’s time for analysis.
The first step is usually some kind of clustering / dimension reduction approach. Some common algorithms are t-SNE (but it’s slow), PCA, and UMAP. I hear about umap the most nowadays.
Finally there’s a visualization component. I won’t go into this much; it seems very different every time.
Every scientist I’ve spoken to does iterative analysis. The code for dimension reduction, visualization, and stats should be fast and repeatable.
In many ways this seems familiar. Data generation, processing, and analysis. Sounds like an ETL pipeline, doesn’t it? A good data engineer recognizes the patterns. The details, though, are very different.
I have too much going on. In the last 15 minutes before writing this I thought about 3 work projects, legal paperwork, trip planning, house repairs, what I’m going to cook, and helping a friend move.
My life is a careful balance of chaos and order.
Creativity is chaotic. My thoughts are scattered and rapid when I’m puzzling through a work problem or making notes for music. My best ideas come in the shower, or when I’m daydreaming.
When I have a new idea, I need to retain everything. It’s not organized, so I need to capture unstructured information.
However, incoherent information isn’t useful on its own. It’s only after I process it that my ideas are effective. Creating order from chaos is an essential second step.
Chaotic systems have structure.
Good ideas show up all the time. They’re transient, and unpredictably structured. A system to collect chaotic information must:
There arethings that are not required:
I found a great tool for this, the ‘World’s Oldest Data Structure’.
Paper. You read that right. Paper.
I always carry paper and a pen. It’s highly accessible, practically free, secure, quick, and flexible. It supports text, diagrams, music, even mind maps. My friends laugh and smile, presumably in admiration.
Random notes must be turned into something, like grist for the mill. I have a simple method:
I’m Already Doing It
Something New to Do
Let’s say I have a note for something new. I’ll create a new entry in an existing system, perhaps a single Todoist task, a recurring, or a wiki page. Then I’ll go through my memory palace and copy over anything relevant.
This is where I put everything else. If I’m not going to use it in the near future, I add the note to a wiki page. This system makes assumptions:
Why all the fuss? Because people are chaos and order. Life is order and chaos. I try to embrace that, one step at a time, using paper.
What do you do?Permalink