75 Comments

No virus ever proven to exist directly in the fluids of a human. Are you just learning this now?

Expand full comment

Thanks, Ben, I listened to the Twitterspace months ago. I'm not very familiar with the details. But what I remember is that there were ~30000 contigs built in ~350000 possible. PC generated many and they took the longest version. Is that correct? Shame on the commenters in that space. Please keep going!

Expand full comment

Ive set it over my friend .

https://github.com/trinityrnaseq/trinityrnaseq/releases/tag/Trinity-v2.5.1

The make files in this version of Trinity points to an old compiler , so it often fails to build from source , but ive put instructions on how to modify it so it will build just fine .

You will just need to check your

$g++ --version

and make sure you edit the correct one

This version of Trinity was the one used in Fan wu

If you download the package it will probably be 2.13.2 which gives completely different results

but I am running it with

./Trinity --seqType fq --max_memory 40G --left SRR10971381_1.fastq --right SRR10971381_2.fastq --CPU 8 --output Trinity2-5.1

from within the trinity dir ( hence the ./ )

without that it will look at $PATH and use the latest packaged version .

The older version of megahit also needs a few tweaks to get running as well , which I will also add in the docs.

Expand full comment

It seems to be common for viral genomes to be missing a short piece from the beginning or end of the genome. When you're calculating a distance matrix between a set of aligned sequences, one strategy to avoid overestimating the distance between one sequence that is missing a part from the end and another sequence that is not is to simply remove the last few bases from the aligned set of sequences. For example in a paper where they produced an aligned set of genomes of SARS 1, they wrote: "For network analysis, an 81-bp block at the 5'-end including gaps and a 77-bp block at the 3'-end including gaps and the poly-A tails were trimmed out of the alignment, and the final alignment contains 30,169 nucleotides." (https://www.sciencedirect.com/science/article/pii/S2590053621001075)

I tried aligning ten random genomes of SARS 1:

curl -Lso sarslike.fa 'https://drive.google.com/uc?export=download&id=1eq9q_6jPbjvfvKnpAWDkbIyjgky7cjv5'

brew install mafft seqkit coreutils;seqkit seq -w0 sarslike.fa|paste - -|grep SARS\ coronavirus|gshuf --random-source=<(seq 1000)|head -n10|tr \\t \\n|mafft ->random.fa;mafft --clustalout random.fa # --clustalout uses a human-readable output format

Only three out of ten sequences included the poly(A) tail, but its length was really short in all of them: 4 bases in the first sequence and 2 bases in the second and third sequences. The sequence with the 4-base poly-A tail was the only sequence which didn't have any gaps at the end. Two sequences had 2 gaps at the end, three sequences had 4 gaps, two sequences had 24 gaps, one sequence had 45 gaps, and one sequence had 47 gaps. At the beginning of the aligned set of sequences, six sequences had no gaps, one sequence had 20 gaps, one sequence had 32 gaps, and two sequences had 40 gaps.

When I tried calculating a Hamming distance matrix between the aligned set of ten sequences, the maximum distance was 36 when I removed the last 47 and the first 40 bases of the alignment, but the maximum distance was 120 when I didn't remove any bases:

seqkit subseq -r 41:-48 random.fa>random.trim;Rscript -e 'library(Biostrings);max(stringDist(readAAStringSet("random.trim"),"hamming"))'

In order to not overestimate the distance between one sequence that is missing a piece at the end and another one that is not, another method is to only count nucleotide changes but to not count insertions or deletions (i.e. positions where one sequence has a gap and another sequence doesn't), like what's done by the snp-dists utility: `brew install brewsci/bio/snp-dists;snp-dists random.fa`.

I made the sarslike.fa file by doing a BLAST search for several SARS-like viruses, downloading a FASTA file for the 100 best matches of each virus from BLAST, and then concatenating the FASTA files and aligning the concatenated file with MAFFT: https://anthrogenica.com/showthread.php?27891-Shell-and-R-scripts-for-genetic-analysis-of-SARS-like-viruses.

Expand full comment
Mar 16, 2023·edited Mar 17, 2023Liked by Ben

I asked Kevin McKernan why your MEGAHIT script didn't reproduce the genome of Wuhan-Hu-1, but he replied: "He didn’t trim the reads and as a result had a different length poly A tail." (https://anandamide.substack.com/p/failure-of-the-linearization-reaction/comment/13626862) However when I tried aligning the reverse complement of your longest contiguous sequence with the third version of Wuhan-Hu-1, your sequence also had 69 missing bases before the poly(A) tail:

wget https://raw.githubusercontent.com/USMortality/Megahit-SARS-CoV-2/master/out/final.contigs.fa

curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=MN908947.3&rettype=fasta'>sars2.fa

cat sars2.fa <(echo \>longest_contig_reverse_complement;awk 'length>max{max=length;o=$0}END{print o}' final.contigs.fa|sed y/ATCG/TAGC/|rev)>temp.fa

brew install clustal-w;clustalw temp.fa

Anyway, I now tried reproducing your pipeline but I also trimmed the reads using Trimmomatic:

brew install megahit -s # -s compiles from source

brew install trimmomatic sratoolkit

vdb-config

fasterq-dump SRR10971381

trimmomatic PE SRR10971381_{1,2}.fastq {1,2}{,un}paired.fastq.gz AVGQUAL:20 HEADCROP:12 LEADING:3 TRAILING:3 MINLEN:75 -threads 4

megahit -1 1paired.fastq.gz -2 2paired.fastq.gz -o out

# megahit -1 SRR10971381_1.fastq -2 SRR10971381_2.fastq -o out2 # try running MEGAHIT with no trimming

The parameters I used with Trimmomatic are probably not optimal, but I just copied them from this article: https://research.csc.fi/metagenomics_quality.

At first I tried to use fastq-dump instead of fasterq-dump, but it only created a single file called SRR10971381.fastq instead of splitting the reads into two files called SRR10971381_1.fastq and SRR10971381_2.fastq.

I used `brew install megahit -s` to compile MEGAHIT from source, because without the `-s` option I got the error `dyld: Library not loaded: /usr/local/opt/gcc/lib/gcc/9/libgomp.1.dylib` (because I had GCC 11 and not 9).

When I didn't use Trimmomatic, my longest contiguous sequence was identical to your results, so it was 29802 bases long and it was missing the last 102 bases of the third version of Wuhan-Hu-1 and it had one extra base at the beginning. When I did use Trimmomatic, my longest contiguous sequence was now 29875 bases long, it was missing the last 30 bases from Wuhan-Hu-1 (so the poly(A) tail was only 3 instead of 33 bases long), and it still had one extra base at the beginning. I probably used the wrong parameters with Trimmomatic though, or I'm still missing some steps in the pipeline.

BTW you wrote: "I have run Megahit 1.1.3 and 1.2.9, with the latter producing a significant shorter contig with only 28,802bp". However based on the final.contigs.fa and output.txt files of your MEGAHIT 1.2.9 run, actually your longest contig was 29,802 bases long : https://github.com/USMortality/Megahit-SARS-CoV-2/tree/master/out.

Expand full comment
Oct 4, 2022Liked by Ben

Interesting work. Many people have called out the lack of “existence” for a long time now. Seems this reverse engineering may be another proof point?

Expand full comment
Feb 7, 2023·edited Feb 7, 2023

Zika/ I am curious as to the length of Zika or AIDS sequences found in the Wuhan dataset. Are we talking about a few reads with something like 20nt segments, or contiguous sequences of 100s or 1000s nt?

Expand full comment

PacBio 1 of 4, Intro/ First, I want to explore the Pacific Biosystems platform (and their Covid kit) in detail and how it confirms the validity of the Wuhan assembly, and second, use the PacBio literature as an example for my assertion that a major problem with assembly reproducibility deals with the very beginning and terminus of the virus but not its center.

PacBio has a specific kit for Covid: https://www.pacb.com/research-focus/microbiology/public-health/covid-19-sequencing-tools-and-resources/

For details: https://www.pacb.com/wp-content/uploads/Application-Note-HiFiViral-SARS-CoV-2-kit-enables-genomic-surveillance-of-COVID-19.pdf

Three YouTube videos: https://www.youtube.com/watch?v=U5fd3l56dqE , https://www.youtube.com/watch?v=VdXxvOF8QrU, and an older version of the kit with longer reads https://www.youtube.com/watch?v=S7dt0i9z820

For an application to Omicron, which I will discuss in detail later: https://www.pacb.com/blog/omicron-and-beyond-hifiviral-kit-provides-labs-with-a-future-proof-solution-for-emerging-covid-19-variants/

A collection of raw Omimcron reads if you want to compare assembly programs: https://downloads.pacbcloud.com/public/dataset/HiFiViral/Jan_2022/

The kit uses the Wuhan (or one of its revisions) as the reference (ie., template) for the very design of the kit, and for the assembly program (they normally don’t assemble from scratch, but they can substitute another assembly template besides Wuhan). I assert that if the assembly of Wuhan is way off, the design of the kit would have failed, and generation of consensus assemblies for new variants would also fail. The very success of this kit attests to tolerable accuracy of the Wuhan assembly. The kit and software is mismatch (mutation) tolerant.

Expand full comment

Human reads may have been masked for ethical reasons, since the SRA's website says: "Human metagenomic studies may contain human sequences and require that the donor provide consent to archive their data in an unprotected database. If you would like to archive human metagenomic sequences in the public SRA database please contact the SRA and we will screen and remove human sequence contaminants from your submission." (https://www.ncbi.nlm.nih.gov/sra/docs/submit/)

The NCBI has a tool called Human Scrubber which is used to mask human reads with N bases in SRA submissions (https://ncbiinsights.ncbi.nlm.nih.gov/2023/02/02/scrubbing-human-sequences-sra-submissions/):

> The Human Read Removal Tool (HRRT; also known as the Human Scrubber) is available on GitHub and DockerHub. The HRRT is based on the SRA Taxonomy Analysis Tool (STAT) that will take as input a fastq file and produce as output a fastq.clean file in which all reads identified as potentially of human origin are masked with 'N'.

> You may also request that NCBI applies the HRRT to all SRA data linked to your submitted BioProject (more information below). When requested, all data previously submitted to the BioProject will be queued for scrubbing, and any future data submitted to the BioProject will be automatically scrubbed at load time.

> This tool can be particularly helpful when a submission could be contaminated with human reads not consented for public display. Clinical pathogen and human metagenome samples are common submission types that benefit from applying the Human Scrubber tool.

The usage message of Human Scrubber also says that human reads are masked with N bases by default: https://github.com/ncbi/sra-human-scrubber/blob/master/scripts/scrub.sh.

Expand full comment

I tried to use ClustalW2 to align the genomes of the three different versions of Wuhan-Hu-1: `brew install clustalw-w;curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id='MN908947.{1,2,3}>sars.fa;clustalw sars.fa;cat sars.aln`.

The 27 first bases were "CGGTGACGCATACAAAACATTCCCACC" in the first two versions but "-----------ATTAAAGGTTT-----" the current version, where a dash indicates a deletion, and 6 bases out of the 12-base sequence in the middle were also changed.

Apart from that, the only difference between the three versions was that the last 598 bases of the first version were deleted in the second version, and in the third version a new 44-base sequence was added to the end: "AGGAGAATGACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA". The sequence of multiple A bases at the end is known as a poly(A) tail: https://bioinformatics.stackexchange.com/questions/11227/why-does-the-sars-cov2-coronavirus-genome-end-in-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa. (The accepted answer at Stack Exchange said that "for coronaviruses in particular, we know that the poly(A) tail is required for replication", but I don't know if the tail is absolutely required in all coronaviruses, because it was missing from 9 out of the 55 coronavirus genomes listed on this page: https://www.ncbi.nlm.nih.gov/genomes/GenomesGroup.cgi?taxid=11118.)

For some reason the ORF10 protein was also missing from the first and second versions in GenBank. See this thread: https://twitter.com/BillyBostickson/status/1268514460349546497. Replies to the thread pointed out that the last 20 nucleotides of the second version (which are also included in the first and third versions but not at the very end) are "TGTGATTTTAATAGCTTCTT", which is identical to a segment of human chromosome 1, and the extra 598 nucleotides after it in the first version also match human chromosome 1:

- "...or it is simply a repetitive region and yes "automatic machines" (such as HiSeq 3000 / HiSeq 4000 systems) do make that kind of errors in particular near the 3'- (and also 5'-) ends w/o correction. it always helps taking a closer look" (https://twitter.com/ChrisDeZPhD/status/1290064988892274694)

- "No, not poly(A), rather a contamination from RP11-173E24 from human chromosome 1q31.1-31.3, but fraudulent? I don't think so." (https://twitter.com/ChrisDeZPhD/status/1290201009260654595)

- "Thanks I did read it. Still, looks like an obvious reading error at the 3' end by hybridisation with contaminated sample. can happen on Illumina. There is a 20 bp overlap btw CoV-2 3'-end and a complementary DNA-seq from h-chr 1q31.1-31.3. 99% match in 616 bp from nt 40414-41029." (https://twitter.com/ChrisDeZPhD/status/1290218272705531905)

If you do a BLAST search for the last 618 nucleotides of the first version, it's 99.68% identical to "Human DNA sequence from clone RP11-173E24 on chromosome 1q31.1-31.3, complete sequence". You can simply copy positions 29856 to 30473 from here: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.1. Then paste it here and press the BLAST button: https://blast.ncbi.nlm.nih.gov/Blast.cgi.

So I think that solves the question of why the first version was so much longer than the current version, but I still don't know why the beginning of the sequence was also changed, or why they added the sequence of 44 bases to the end of the third version that was missing from the second version.

BTW I think your MEGAHIT pipeline is missing steps like doing quality control with FastQC, merging reads with SeqPrep, and trimming reads with Trimmomatic: https://research.csc.fi/metagenomics_quality.

Expand full comment

Software & Hardware, 1 of 2/ First, I am curious about whether you (Paul) are downloading the programs and datasets and running them locally, or running them on a remote server. Apparently USM is running them on the Amazon cloud.

Its just my old fashioned mindset to think of local control. The last time I ran DNA programs, such as FASTA, was circa 1989, on a local 16bit IBM computer running the “DNA-Star” package. Heck, DNA-Star (or parts of it) were even running on the 8 bit Commodor 64.

Second, I am just amazed at the rapidity of it all, and rapidity (ie., “acceleration”) is the topic of the rest of this thread. I never used BLAST, but I remember reading about it (circa 1990) during the transition from the FASTA era to the current BLAST era. I got the impression that BLAST was precomputed- a “user” didn’t start from scratch. BLAST was constantly doing alignments and searches in the background each time someone uploaded a submission. This was certainly true for proteins: alignments of proteins (and their domains), and assignments into families were constantly updated.

Expand full comment

I think that I may have figured it out. Part 2 of 2. If reads near the ends are less abundant, how does this interact with the software? Rounding off to 6E6 reads, a rigorous program may require 36E12 comparisons. This would take forever, so the program may take shortcuts if the first comparison is started at random, the result may differ with each run. If reads near the edges are rarer, the shortcuts may miss them.

In the construction of evolutionary trees from protein or DNA sequences, the programs don't do all possible comparisons, but take shortcuts. So multiple trees are generated by running the program several times (maybe hundreds), and the one that gets published may not be "the one true tree", but the tree with the best score, or a consensus of the better scoring trees.

Expand full comment

I think that I may have figured it out. Part 1 of 2. If you repeat assembly several times (apparently you did it twice), do you have: 1) agreement in the center? 2) missing sequence in the ends (start and finish)? The PCR reaction probably uses a set of "random" primers, and to capture the ends, you have to be lucky that your set of primers contains a match for each end. So it is likely that you will have missing sequence at the very ends. Furthermore, the reads that are closer to the ends (but still captured), are likely to be less abundant than reads closer to the center.

Expand full comment

Two people that would be good to verify this:

1. Kevin McKernan on Twitter or https://kevinmckernan.substack.com

2. Flavinkins (= Dayou15, formerly on Twitter) https://gab.com/Flavinkins

Expand full comment

luc Montagnier and Jean-Claude Perez virologist analysts. July 2020. They sequenced isolate="Wuhan-Hu-1"

Quoting from this paper I was permanently banned from twitter. Perez is on twitter if you want to talk to him (he speaks French) @JCPEREZCODEX

https://www.researchgate.net/publication/342926066_COVID-19_SARS_and_Bats_Coronaviruses_Genomes_Peculiar_Homologous_RNA_Sequences_Jean_Claude_perez_Luc_Montagnier

Expand full comment