49 Comments

The problem is simple - the provenance of the genetic material used to sequence the so-called sars-cov-2 genome has not been determined i.e show me the virus (isolated, purified, seperated from other biological and genetic material)

Expand full comment

Excellent, thank you very much for this.

Expand full comment

"I could not find any compelling evidence, that the SARS-CoV-2 reference genome is valid"

"Valid" means what exactly? ;-)

Expand full comment

1/ HI Ben, It has been a while since your first post on assembly, and my comments there. At that time I did not yet read the Wu paper, nor did know the details of the assembly program Megahit. I imagined how an assembly program might work and imagined that it would take a very long time to run the program. I see that Megahit extracts "K-mers" of odd length (eg 27nt). A 27nt segment of the viral genome would be covered by many reads, but in the ideal world, the very same k-mer would be extracted from many reads, reducing redundancy, and making the program fast.

Expand full comment

RACE 3/ Back to the 5' end. I got all excited that you have many reads with the pattern "GGGG-bluebox", where bluebox is an insert of 4 or 5nt. Are the 4s and the 5s all the same? I want to see if "bluebox" extends the stem of "stem loop 1 of 3' UTR" (for ref https://doi.org/10.1016/j.cell.2021.02.008 figure 2c). Once we have the RACE data, do any of those reads that contain GGGG-bluebox remain, or do they vanish because they are just sequencing artifacts?

Expand full comment

7/ Considering all potential sources of mismatches, some k-mers will be "problematic", and generate branches in the assembly graph. This is where my current understanding of assembly programs stops: "how do different assembly programs trim branches". Apart from the problem of ends, the problem of alternative branches is the very crux of the matter. Why did Trinity stumble and output a shorter contig than Megahit? And perhaps why you do not achieve a perfect replication of genome length.

And why Wuhan did not rely on Megahit alone, but required refinement from Bowtie afterwards.

Expand full comment

RACE 2/ I looked at your figure 2 again, of sequences just upstream of the polyA tail. I got all excited that you had 3 reads with the variant pattern "CnnCnnA". But I couldn't find them in your trimmed reads. Again, I guess that your dataset doesn't contain the RACE reads for the 3' end.

Expand full comment

RACE 1/ I looked at figure 1 again. Roughly 50 reads claiming to be within 10nt downstream of the 5' cap. I guess that if your dataset contained the RACE reads, you would have hundreds or thousands of reads at that end?

Expand full comment

9/ "Just one more thing" - TV detective Columbo.

If a next generation sequence technology has an error rate of 1%, and if the reads are a bit over 100nt in length, then on average, each read is expected to contain one error (some 0, some 2, etc). So don't ask Kevin or anybody else to assemble a perfect match.

If we have a many-fold coverage, many of these errors should cancel out in the consensus.

There is one example of the "entire" genome of a single individual coronavirus being sequenced by nanopore (typical error rate of 10%), and 90% accuracy would be good enough to distinguish SARS-CoV-1 from SARS-CoV-2, but not good enough to track Covid19 "variants", sub-variants, sub-sub-variants, etc.

When constructing evolutionary trees, matching up long stretches of "only" 50% identity wouldn't be unusual. Matching 90% identity of >100nt reads isn't anything to complain about.

Expand full comment

8/ Apparently your results within the center of the genome mostly agrees with the Wuhan assembly. So why freak out? It is a matter of perspective. Is the glass 99% full or 1% empty? A glass that is 1% empty is no reason to doubt the validity of the existence of viruses.

But by all means, please continue to hunt down the remaining 1% discrepancy. I am hunting also, but I took a little break. Time to get back into the game. I want to know too.

/end (for now)

ps, part 4, I compared chimp vs human. Note that as far as I know, the distance between Covid19 variants is smaller than the distance between chimp and human, and more like the distance among humans.

Expand full comment

6/ For mismatches resulting from real biology we have mutations.

There is also the problem of "subgenomic" RNAs that generate "junctions".

Expand full comment

5/ A mismatch at a specific segment would generate at least two different k-mers that cover that segment. Megahit filters out some sequencing error mismatches because its default setting requires that a given k-mer must exist in at least two reads. For those types of sequencing errors that are random, it is unlikely that two reads would contain the same error, although a few will exist and pass through this filter.

Expand full comment

4/ In my comments from 6 months ago, I kept harping on the concept of "mismatch tolerance". Mismatches arise from two sources (1) sequencing errors, and (2) real biology.

Note well that any genome represents a statistical average where mismatches mostly cancel out.

If you were to sequence your genome, probably every cell contains at least one mutation that would generate a mismatch, but these should cancel out within a consensus sequence. Same all humans on the planet, all 8 billion of them, each one with a unique genome. Same for chimps, but human sequences would cluster with human and chimp with chimp, even if all human/ape sequences are 97% identical. So we expect the same for infections, the population of SARS-CoV-2 virions infecting a single person would exist as a swarm of related mutants.

Expand full comment

3/ In my comments from 6 months ago, I kept repeating that the problem would be at the ends. Note that I usually read papers on bacterial DNA sequencing, and sometimes I slip into the DNA mindset even though I know that we are dealing with an RNA virus. I forgot about the 5' cap and the 3' PolyA Tail. My concern about the ends was one of read abundance, not cap or tail. Looking at the PacBio Covid Kit, one can see the problem of end abundance. In an ideal world, extraction of k-mers from the Wuhan dataset, should overcome the the problem of end abundance, because I think that they had enough "depth of coverage".

If you read the Wuhan paper, you may notice that they sequenced their sample 3 times. First to get a de novo template. Second, using PCR. Third, a special technique for both ends. I don't know if your dataset only includes the first or includes the latter two.

In regards to the polyA tail, other papers claim that each individual virion differs in the length of the polyA tail. Maybe the answer to some of your questions lies there.

Expand full comment

2/ In my comments from 6 months ago, I suggested that you communicate with Kevin McKernan. Today I see that the two of you are now engaging on Twitter. Good. Kevin can be rough at times (at least impatient). Ouch! Ouch! Ouch! Please look past that and keep the dialog up. I don't post on Twitter, just watch, otherwise I would have jumped in.

Expand full comment

Kevin McKernan posted the following comments about this blog post (https://twitter.com/Kevin_McKernan/status/1674047174357680131):

"Reads don't map to the margin of a reference because the given insert size you have selected in library construction won't exist. If you have 200bp inserts, you won't get the 1st ~50bases of the reference. This is known by all in space except Ben who thinks it's a discovery."

"Read mapping 101. If you make 200bp insert libraries, the likelihood of you capturing the 1st 50bp of the reference is reduced. This is just poisson sampling. https://irepertoire.com/ngs-considerations-coverage-read-length-multiplexing/"

"Secondly, there are biological reasons why those pieces of DNA aren't captured. The 5' end has a 5' cap which can't be cloned unless it's decapped. The 3' end is a variable poly A tail which won't sequence or amplify without polymerase slippage. Reduced end seq is expected."

Expand full comment