Written By: Tim Bushnell, Ph.D.
As a follow-up to our post on tSNE where we compared the speed of calculation in leading software packages, let’s consider the case of SPADE (Spanning-tree Progression Analysis of Density-normalized Events). A favored algorithm in the flow cytometry community, SPADE is used when dealing with highly multidimensional or otherwise complex datasets. Like tSNE, SPADE extracts information across events in your data unsupervised and presents the result in a unique visual format.
Unlike tSNE, which is a dimensionality-reduction algorithm that presents a multidimensional dataset in 2 dimensions (tSNE-1 and tSNE-2), SPADE is a clustering and graph-layout algorithm. The result is quite different from the two-dimensional plot of tSNE, and rather resembles a phylogenetic tree in its branching structure (Fig. 1). As with a phylogenetic tree, similar clusters are grouped closer together, and dissimilar clusters are located more distally on the tree. If you’d like to read more on the theory and development of the SPADE algorithm, see the original literature [Qiu P., et al., Nat. Biotech. (2011); Qiu, P., PLoS One (2012)].
Figure 1 SPADE trees with 100 clusters colored on the same parameter (TCR-beta) in five different software.
Having already argued for the need for speed, and given the growing popularity of these kinds of algorithms for dealing with complex datasets, we figured it would be of interest to the flow cytometry community to undertake a similar test across popular software packages. It was assumed this would would be easy enough to do – we were wrong. Of the five software packages we tested (Cytobank, FCS Express, FlowJo, R, and the original, free software made available by the author of SPADE), only Cytobank and FCS Express were able to reliably return results from various FCS files – which is to say, they reliably returned some results while the others returned nothing at all!
Peng Qiu has generously made his original program, SPADE3, freely available to the flow community. However, he notes outright that it cannot necessarily handle the plethora of different formats produced by all available cytometers. Hiccups like these are to be expected of non-commercial software, and in any case, it’s hard to argue with free software of this caliber. Plus, Peng invites you to contact him for help if your data isn’t recognized by his software, which might be worth a try considering that SPADE3 is surprisingly attractive and easy to use – albeit with limited functionality compared to paid-for, full-suite cytometry data analysis packages.
Commercial flow cytometry software is naturally held to a higher standard, especially if it touts the ability to run SPADE out of the box on your files. SPADE within Cytobank and FCS Express are integrated with the general user interface, and require no plugins or separate installations.
Before we talk results, let’s start by covering certain challenges regarding some of the aforementioned software and how they factored into the speed testing process.
Despite its commercial status, FlowJo was one headache after another, starting with getting the SPADE implementation working. For SPADE, FlowJo requires the installation of “R” on your workstation, and must be told via its Preferences where this R installation is on your computer – in our case, this was tricky. Also required is installation of the SPADE package for R developed by the Nolan lab, which the Nolan lab no longer supports distribution via the Comprehensive R Archive Network (CRAN) or Bioconductor. Perhaps for these reasons, although links to R and the SPADE package are provided on the FlowJo website, some troubleshooting of the installation may be necessary.
Unfortunately, once those hurdles were cleared, it still wasn’t smooth sailing. FlowJo failed to run the SPADE algorithm on:
- 3 of 5 Becton Dickinson (BD) FACSDiva files,
- 1 of 1 Beckman Coulter (BC) Cytoflex file,
- 1 of 2 Fluidigm CYTOF files,
- 1 of 1 BC Gallios file (the same file we successfully ran tSNE on in our recent tests)
- 1 of 1 BC Summit file
In each case, FlowJo returned cryptic error messages e.g., “Could not create Gating ML elements” and “the algorithm did not generate a CSV result file.” And this despite the fact that we were attempting to calculate SPADE on ungated data. None of the datafiles we tested were more than a couple of years old – many were from 2018 and verified as exports from updated acquisition software packages, so it’s hard to imagine that defunct formats were an issue. To be fair, we ran a few of these files in R, which also resulted in error messages – so the fault may originate there.
Ultimately, with regard to whether SPADE will actually work for your data in FlowJo, time and effort spent to get the system working will essentially result in a crapshoot. Nonetheless, the speed tests had to go on, as some of you may have files that do work in all five software packages without issue. We finally lucked out and found such a datafile, a 27-MB, 500,000-event, 14-parameter anonymized file provided on Peng Qiu’s SPADE website; full methods available here.
Before we reveal the results, we have to preface them with another discussion of downsampling. As in the case of tSNE, the type and degree of downsampling affects your results (as well as your ability to get those results with realistically attainable computing power) and even accounts for much of the calculation time. So let’s discuss the downsampling choices provided by each software package in alphabetical order.
Cytobank’s downsampling is stated to be density-dependent, and weighted to preferentially include rare events. We could not find out how strong this weight is, numerically speaking, nor did it appear to be user-variable. Downsampling is also performed to a target number of events or percentage of the original file(s), and these values are controllable by the user.
To put it bluntly, the nature of downsampling in FlowJo is both opaque and inflexible, unless perhaps you are something of an R maven; we are not. Moreover, it took some detective work well outside of the FlowJo documentation (and within the exported files from FlowJo) to verify that the defaults of the Nolan SPADE package were being employed, i.e. downsampling to 10% of the original number of cells and throwing out 1% of local low-density outliers. In our experience, changing these default percentages would require a level of R competency not possessed by the typical flow cytometrist.
FCS Express provides several downsampling options, including:
All of these options are fully defined in the documentation for FCS Express, and these and other related attributes can be numerically specified by the user. It is worth noting that FCS Express allows for the exclusion of low or high-local density events (or both or neither); it does not presuppose an interest in rare or abundant populations, but lets the user decide “if” and “how much.” The user may also specify a minimum percent of cells defining a cluster so that essentially bogus clusters (in terms of representing so few cells as to be of neither biological nor statistical importance) do not confound data interpretation.
For those R mavens who enjoy running SPADE directly within R, it may be possible to change the nature of the downsampling (e.g., to choose to exclude or not exclude both low- and high-density outliers) in the SPADE package within R, as these settings were well documented by the Nolan lab when the SPADE package was created. That is something we did not attempt, as we are not experts in R. If you are reasonably confident in your R skills, it should be possible to at least change the default percentages for the downsampling mentioned with respect to FlowJo above. Again, we did not attempt to stray from the defaults mentioned above, so the forthcoming comparison between R and FlowJo is strictly “apples-to-apples.”
Peng Qiu’s SPADE3 allows for a user-defined local density of events to be excluded as noise, along with a user-defined local density of cells or number of events corresponding to the rare populations intended to be captured.
Table 1 lists the different options tested in FCS Express. FCSE-A and FCSE-C are most similar to the downsampling parameters for FlowJo and Cytobank.
|FCSE SPADE Options|
|Downsample (sample size)||Downsampling Method||Sample Size||Local Density Method|
Without further ado, here are the results of the speed tests (Fig. 2), performed in triplicate.
Figure 2: Results of SPADE Speed tests using 5 different software packages. Test file from Peng Qiu’s SPADE website (27-MB, 500,000 event, 14-parameter anonymized file)
In discussing these results, let’s first note the obvious: The SPADE calculation took almost exactly the same amount of time in R as it did in FlowJo (~ 12 minutes (min)) This was expected, considering that Flowjo makes use of R. Interestingly, at least at this modest sample size, Cytobank took nearly the same amount of time (~12 min) as FlowJo and R. The time required to upload samples to the cloud, usually a few minutes, was included as in our previous tests for tSNE; the higher deviation between replicates is due to the queue time for the task, which presumably varies according to server load. Peng Qiu’s SPADE3 software took the longest (~25 min), but again, considering that this is a freely available, non-commercial software, the fact that it took only about twice as long as the aforementioned software is actually quite respectable. It is included in this speed test primarily as a benchmark, as it represents the original implementation of the SPADE algorithm.
As we have seen with tSNE, FCS Express handily outperformed the competition. If we consider the Exact method for calculating local density (see the FCS Express manual for details), FCS was over 900% faster (at 1.3 min) than FlowJo or Cytobank; using the Kernel method, it was over 4600% faster (at 0.26 min). It is difficult to determine which is the more precise comparator because the documentation for the methods employed by FlowJo and Cytobank does not specify, but our money is on Kernel. It’s entirely possible that the full details reside in cited literature, but we only consulted the product manuals for each software package.
What about those other parameters that were tested? The results are shown below in Figure 3.
Figure 3: Results of different parameters tested in FCS Express. Notice the scale is 0 to 4 minutes compared to the 0 to 30 minute scale in Figure 2.
The results speak for themselves. How would these figures hold up for larger sample sizes? Due to the inability to run most datafiles in FlowJo or R, we only performed further tests between Cytobank and FCS Express. We used a 100-MB, 1.3-million-event, 20-parameter anonymized file produced by BD FACS Diva ver 8.0.1, and downsampled to 1 million events. Again, the full methods are available in the download.
Figure 3 1.3 mil event file downsampled to 1 mil events in FCSE vs cytobank
Again, FCS Express was faster than Cytobank, but this time, Cytobank was only 23% slower than FCS Express (60 vs. 49 min, respectively), in spite of the time required for uploading the file to the Cytobank servers and waiting in the queue for the task to actually begin (~15 min combined). While this suggests that Cytobank is handling large sample sizes efficiently, it also gives rise to the question: Why has the speed difference between the two software, formerly orders of magnitude apart, diminished so?
It’s hard to say, but it may have something to do with differences in upsampling. You may recall from our post about tSNE that downsampling, though necessary for obtaining results with a realistic amount of computing power, can result in the loss of rare populations. While both Cytobank and FCS Express can report statistics for the entire file (including unsampled events), Cytobank does not actually graph all events in the SPADE tree. The settings explicitly state that only 50,000 cells are randomly chosen for the graph.
Spade Plot Integration
Before we detail how all of this can affect you, let’s put it into the proper context by addressing the integration of SPADE plots with the rest of your analysis in the various software. Rather than going in alphabetical order, let’s start with the least integrated and end with the most integrated.
FlowJo and R
FlowJo and R are by far the least integrated. The results of a SPADE analysis in FlowJo and R consist of plots for all parameters exported as html and/or PDF, on which nothing (including the color scales in the legend, which are specified within the SPADE package in R), can be adjusted within the FlowJo workspace. That’s because these plots are not part of the live FlowJo workspace.
Speaking now only of FlowJo, there is interaction of neither the original FCS file, nor the SPADE plot, with the rest of the workspace. That is to say, you cannot apply a 2D-gate from a regular plot to the SPADE plot, nor can you create a gate on the SPADE plot and backgate it onto other 2D-plots or histograms. There are not even any per-SPADE clusters (number of event statistics) in the workspace where the list of files is found. It’s possible that this is a transient issue, but to obtain any per-cluster no. of event or parametric statistics from the SPADE plot, you have two options:
- Import the “clusters.fcs” file exported by FlowJo (which has added an additional parameter that effectively sorts the cells by cluster) back into FlowJo, and draw gates on those populations on a 2D plot around the clusters of interest.
- Look at the statistic-containing .csv files that are exported by the software in a zipped file.
It should be noted that in the second case, these CSV files are of limited utility due to the complete lack of live interplay between the statistics and the workspace. We should also note that FlowJo suggests viewing the exported GML files in the Java-dependent, free, open-source Cytoscape software for greater interplay between 2D plots and the SPADE trees that you can generate and edit there. However, current versions of Cytoscape no longer support the SPADE plugin. I suppose you could try installing the older, recommended version from 2011, but this version is no longer supported, and we are not certain whether it is compatible with updated versions of Java.
Peng Qiu’s SPADE3 software does a little better. The SPADE plots update live when you change the coloring parameter or the scale. The latter has formattable endpoints, so you can make them uniform across plots to facilitate visual comparison between files or parameters. Below the SPADE tree are smaller 2D plots that display the downsample-enriched population in any desired 2D-parametric space. However, as is the case with FlowJo, there is no backgating between 2D plots and the SPADE plot. This is not surprising, as SPADE3 does not claim to be a full-service cytometry package; in fact, the SPADE3 manual explains that any preliminary gating must be done in another program, and the gated events exported for use in SPADE3. Nonetheless, the SPADE3 program is worth a look if it recognizes your files. There are some nice creative features such as “autosuggest annotation”, which can simplify the process of breaking your SPADE tree down into related groups (similar to clades in phylogenetics).
Not surprisingly, Cytobank’s integration of SPADE with the rest of your analysis is pretty good, given that SPADE is a major draw. The SPADE plots update quickly in response to changing the coloring parameter or scale. Surprisingly, you cannot specify the endpoints of the scale as you can in SPADE3 or FCS Express, but there is a way to standardize the scale across all files in an experiment. You can also choose between “symmetric” and “asymmetric” color scaling. The parametric statistics on clusters or pooled clusters are easy to obtain within the live interface or via export. However, the interplay between the SPADE plot and the regular 2D plot at the side is not bidirectional – although you can use the “bubbles” functionality to backgate/apply a SPADE gate onto the 2D plot, you cannot create a gate on the 2D plot and apply that gate to the SPADE plot.
It’s thus impossible to, say, manually draw a gate on your FoxP3+ T cells, or on a more mysterious subset within your 2D plots, and see which cluster(s) they fall into on your SPADE plot. For software this popular for SPADE, this is a surprising deficiency. And now we’ll come back to our point from a few paragraphs ago, i.e., that only 50,000 cells are randomly chosen to graph in the SPADE plot.
Aside from the fact that we cannot even draw a gate on rare events in the 2D plot and verify that they are included within the SPADE tree, how do we even know that this randomly selected subset of cells is going to provide the most faithful SPADE tree representation to begin with? I can’t help but think that this method of plotting, though expedient, at least partially defeats the purpose of using large sample sizes (i.e. downsampling within reason).
Lastly, we come to FCS Express. Here, the integration of the SPADE plots with the rest of your plots and statistics is efficient and seamless. Like any other plot type, statistics on a per-cluster or per-cluster gate (using the “well gates” feature in FCS Express) basis are easily obtained within the layout or via export.
As with SPADE3 and Cytobank, you can change the coloring parameter or scale, and the SPADE plots update live. Notably, as in SPADE3, the color scaling can not only be made uniform across plots showing different parameters or files, but the endpoints can be specified. There are additional formatting options including percentile, mean +/- SD, and fixed range log/linear. These options are the most comprehensive among all software tested, so you should be able to obtain the SPADE trees you need to make your point. Interplay between regular 1D- or 2D-plots and the SPADE tree is fully bidirectional (gates on either can be applied or backgated onto the other like any other gate). In our opinion, this should be standard.
Consider at the very least how, despite this being an “unsupervised” method, at the early stages of implementing a SPADE analysis for a given experiment type, correlation with at least a few high-level manual gates is helpful in evaluating tree architecture. Finally, because 100% of the cells within your file are upsampled for plotting in the SPADE tree, you can be assured that all events, no matter how rare, are represented there.
In summary, the fastest software (FCS Express) was also the most replete with features and integration, but the slowest software (Peng Qiu’s original implementation in the freely available SPADE3) was not the most lacking in those regards – an interesting result to say the least. Thanks for staying with us through the end. What started out as a supposedly simple speed comparison became a bit more involved, due to unexpected differences among the software in terms of datafile compatibility and user interface. We hope we’ve inspired you to check out these and other differences on your own– as usual, we’d love to hear back from you with what you discover.
To learn more about We Tested 5 Major Flow Cytometry SPADE Programs for Speed – Here Are The Results, and to get access to all of our advanced materials including 20 training videos, presentations, workbooks, and private group membership, get on the Flow Cytometry Mastery Class wait list.
My other passions include grilling, wine tasting, and real food. To be honest, my biggest passion is flow cytometry, which is something that Carol and I share. My personal mission is to make flow cytometry education accessible, relevant, and fun. I’ve had a long history in the field starting all the way back in graduate school.
Latest posts by Tim Bushnell (see all)
- 3 Compensation Mistakes That Will Ruin Your Flow Cytometry Experiments - January 2, 2020
- We Tested 5 Major Flow Cytometry SPADE Programs for Speed – Here Are The Results - December 5, 2019
- Mass Cytometry Revolves Around These 5 Things - November 21, 2019