USMortality

Share this post

Why the Ends of the Genome are Important

usmortality.substack.com

Discover more from USMortality

All-Cause Mortality & Genomics
Over 4,000 subscribers
Continue reading
Sign in

Why the Ends of the Genome are Important

Discovering what we can learn from simulating shotgun sequencing

Ben
Jul 14, 2023
22
Share this post

Why the Ends of the Genome are Important

usmortality.substack.com
20
Share

Introduction

Two weeks ago, after I posted another installment of my genomic series, I got into an argument with people on Twitter. Before I will post a detailed response, I first want to explain in detail, why in my opinion, it is important to understand why looking at the ends of the genome, in my opinion, is so important.

As outlined before, the original patient sample contained many different sources of RNA (human, other species, viral). Hence, if the authors of Wu et al. 2020, assumed that there was a new virus (or better multiple virions) in the sample, then the full genome should be in the sample.


Overview of genomics series

SARS-CoV-2 Genome Assembly

SARS-CoV-2 Genome Assembly

USMortality
·
October 3, 2022
Read full story
SARS-CoV-2 Genome Assembly (Part 2)

SARS-CoV-2 Genome Assembly (Part 2)

USMortality
·
Mar 27
Read full story
Is the SARS-CoV-2 Genome Valid?

Is the SARS-CoV-2 Genome Valid?

USMortality
·
Jun 27
Read full story

Fragmentation of RNA

A step in the RNA extraction process, as outlined by Wu et al. 2020, explains the usage of the “SMARTer Stranded Total RNA-Seq kit v.2” by Takara, Inc.

One step of this protocol is to break the total RNA into small pieces for the sequencing step, which can be done via heat or enzymes.

I have contacted the Takara Technical Support, in order to ask “if the position of the cleavage of the RNA in the fragmentation step is random or predetermined?” - here’s their response:

So according to this, the process is resulting in random RNA pieces. Based on this, we can now simulate this process:

  1. Use or generate a reference genome

  2. Fragment into random pieces/reads of ~100bp length

  3. Filter reads of 50-150bp length

  4. Align (or de-novo assemble) the resulting reads back to the reference genome.

Simulated Results

I have written a script to simulate the above process using the SARS-CoV-2 (MN908947.3) reference genome.

My script was configured to start with 100 intact genomes, and a read error rate of 1%. Here are the alignment results of these randomly generated reads (with bwa mem):

Entire Genome

Image
SARS-CoV-2 reference genome aligned with simulated reads.

Note, how the reads stack up towards the ends.

Head

Zooming into the head region of the Genome, we can see, that there are 16 reads which perfectly align against the head. There are no random inserts or deletions, and no massive read errors either.

Image
Head of the SARS-CoV-2 simulated read alignment

Tail

Image
Tail of the SARS-CoV-2 simulated read alignment

Same picture here, we can see (at least) 14 reads, that align against the tail perfectly.

I’ve repeated this step 20x and with 100 virions and 1% read error rate the average depth was 24 reads with a standard deviation of 6.

De-Novo Assembly

Using Megahit to de-novo assemble these reads, consistently results in the (almost) entire genome, if the depth (how many reference genomes are used) is sufficient. In this case, with n=100, we get an almost perfect assembly:

1 contigs, total 29890 bp, min 29890 bp, max 29890 bp, avg 29890 bp, N50 29890 bp

In this case 29890/29903 = 99.99% of the genome was assembled, and only one contig was found, that’s mainly because no other RNA was present in the sample.

The 13 missing bases were missing at the polyA tail, while the head was perfectly reconstructed. Repeating the process with n=1,000 resulted in 6 missing bases at the end, but even with n=10,000 Megahit was not able to assemble the end entirely.

I’ll dive into this into more detail with the next installment of the series.

Summary

So in summary, we could see that using 100 reference genomes, and an error rate of 1%, we end up with about ~15 perfectly aligned reads against the head and the tail (and a general coverage of 24). This means if an entire genome was present in the sample, we should be able to find reads that perfectly align against that novel genome.

Megahit did assemble the head and the ‘body’ of the genome entirely, but struggled with the (polyA)-tail. This could be due to the average read length of 100bp, but the tail just has 34x A’s, so it remains unclear why Megahit is not able to entirely assemble the genome.

22
Share this post

Why the Ends of the Genome are Important

usmortality.substack.com
20
Share
20 Comments
Share this discussion

Why the Ends of the Genome are Important

usmortality.substack.com
Martin Neil
Writes Where are the numbers? by Norma…
Jul 14Liked by Ben

I have had deep concerns about PCR since the start. Unfortunately, when I have presented the available evidence, in the form of the Instand report, which showed that positives for SARS-CoV-2 were being triggered from common cold samples, few would listen. Now we are in 2023 and not much has changed: they still deny the results from this report and defend PCR. Same when I show that much of the UK covid PCR testing was testing for one gene - I get barely a shrug. The whole edifice rests on sand.

Expand full comment
Reply
Share
12 replies by Ben and others
Denis Rancourt
Writes Denis’s Substack
Jul 14Liked by Ben

This work seems very important to me... I hope to have time to study it soon. Bravo for going in this direction! Genomic technology and its limitations must be unpacked. It has captured too much territory.

Expand full comment
Reply
Share
18 more comments...
Top
New
Community

No posts

Ready for more?

© 2023 USMortality.com
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing