How Reliable is Megahit?

Ben

Jul 15, 2023

A Quality Control Study of the Tools Used to Discover the SARS-CoV-2 Genome

Read →

31 Comments

Potatodots

Jul 16, 2023Liked by Ben

WOW

Expand full comment

Tom Childs

Jul 16, 2023

The overlords don’t want billions of people using up the earths resources. Satan wants to modify the natural God-given genome into his own image.

The intercession will occur in time to preserve a remnant of the natural genome.

Expand full comment

Nanc

Nanc’s Substack

Jul 16, 2023

My question is exactly what is the point? Is this going to do anything? Save a life? This whole mass Democide shit show is exactly that, kill off the population.

Unfortunately, it’s working. Yet, I’ll remain hopeful that just maybe we will be able to find a way to reverse the damage while there are still humans walking the earth. But am not going to hold my breath.

Expand full comment

Meg McAteer

Update

Jul 15, 2023

They engineered the viruses - virus can not be eradicated from earth.

There's no 'reference' genome - in that us mere mortals could ever be the arbiters of what constitutes a genome. Virus (RNA) forges its path forward through life to get to its next host.

Mere mortals like us can only watch in wonder.

Expand full comment

Reply (1)

Vonu

Jul 15, 2023

Why did they have to discover what they had made?

Expand full comment

Reply (1)

Edward Teach

Jul 16, 2023

Am I correct in observing that all the viruses that Megahit struggled with are coronaviruses?

Expand full comment

Reply (1)

henjin256

Nov 21, 2023

I reposted my comments here: https://mongol-fi.github.io/hamburgmath.html#USMortalitys_Substack_posts.

Basically the reason why you didn't get a complete contig for HKU1 is that the reference genome of HKU1 contains a region where the same 30-base segment is repeated 14 times in a row. And in HIV-1 and HIV-2 there's a long terminal repeat where a long segment at the 5' end of the genome is repeated at the 3' end of the genome. And porcine adenovirus contains a tandem repeat where the same 724-base segment is repeated twice in a row.

When you mixed together reads from multiple different viruses, you failed to get complete contigs for SARS2 and SARS1 because the contigs were split at a spot where there's a 74-base segment that is identical in the reference genomes of SARS2 and SARS1. But I was able to get complete contigs for SARS2 and SARS1 by increasing the maximum k-value of MEGAHIT from 141 to 161.

Expand full comment

Sense_strand

Jul 27, 2023

Because you asked a metagenomics tool to assemble three close genomes at the same time. Rookie move.

Expand full comment

henjin256

Jul 18, 2023·edited Jul 18, 2023

One of the reasons why the no-virus people were saying that the genome of SARS2 was fake was that in the Wu et al. paper where the Wuhan-Hu-1 reference genome was described, they wrote that the longest contig they got with MEGAHIT was 30,474 bases long but the longest contig they got with Trinity was only 11,760 bases long, so the no-virus people thought that Trinity produced a completely different genome for the virus. And they didn't realize that Trinity just split the genome into a couple of incomplete contigs which likely had only small gaps in between, even though with different settings Trinity may have also produced a complete contig.

However your experiments show that even though you generated sets of reads which covered the whole genome of a virus, de-novo assemblers like MEGAHIT still occasionally fail to produce a single complete contig for the whole virus. So it also indicates that there's nothing that anomalous in how Trinity failed to generate a complete contig from Wu et al.'s reads even though the reads actually covered the entire genome of Wuhan-Hu-1 apart from the last couple of bases of the poly(A) tail.

BTW the genomes of influenza viruses are about 15,000 bases long, but the reason why your influenza A, B, C, and D references are so short is because they don't include all genes.

The first sentence of the post says: "The Wu et al. 2020 paper is the first to discover the genetic sequence of the novel pathogen SARS-CoV-2." However I think the team of Winjor Small Mountain Dog discovered it earlier in December 2019: https://www.researchgate.net/profile/Gilles-Demaneuf/publication/360313016_Sequencing_and_early_analysis_of_SARS-CoV-2_27_Dec_2019_-_The_crushed_hopes_of_Little_Mountain_Dog_of_Vision_Medicals_China/links/626fa7afb1ad9f66c89a1d13/Sequencing-and-early-analysis-of-SARS-CoV-2-27-Dec-2019-The-crushed-hopes-of-Little-Mountain-Dog-of-Vision-Medicals-China.pdf.

Expand full comment

Reply (1)

Nipples Ultra

Jul 17, 2023

As a 40-year veteran of Silicon Valley, I can strongly assure you that you should turn off comments.

Expand full comment

Reply (2)

henjin256

Jul 16, 2023·edited Jul 16, 2023

In your assembly experiment where you simulated the reads with wgsim, your longest MEGAHIT contig for HKU1 was only about 89% of the length of the HKU1 reference genome.

I also tried using wgsim to generate 100,000 reads for reference genome of HKU1, and when I ran MEGAHIT to assemble the reads, my longest contig was only 26,535 bases even though the HKU1 reference genome is 29,926 bases: `brew install megahit -s;brew install seqkit samtools;curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=NC_006577' >hku1.fa;wgsim -N100000 hku1.fa hku1_{1,2}.fq;megahit -1 hku1_1.fq -2 hku1_2.fq -o megahku;seqkit stat hku1.fa megahku/final.contigs.fa`.

When I aligned the contigs with Bowtie2, I noticed that my three contigs covered the positions 54-3136, 3397-29933, and 3397-3727, so there was a short gap from position 3137 to 3396: `bowtie2-build hku1{,}.fa;bowtie2 -p4 -x hku1.fa -fU megahku3/final.contigs.fa|samtools sort ->hku1.bam;samtools view hku1.bam|awk -F\\t '{l=length($10);print$4,$6,l,$4+l}'|column -t`.

When I ran `seqkit subseq -r 3137:3396 hku1.fa`, I noticed that the gap which was not covered by any contig fell within a region where the 30-base segment AATGACGATGAAGATGTTGTTACTGGTGAC was repeated 15 times. You can also see the repeats from here: https://www.ncbi.nlm.nih.gov/nuccore/NC_006577.2.

So if MEGAHIT would've had to assemble the contigs from unpaired reads that were 150 bases long, how could it know how many times the 30-base segment was repeated, if it can only see one 150-base window of the genome at a time? Actually one of the main reasons why the paired read layout is used is that it helps sequence regions with repeats, because there are variable-length gaps between the forward and reverse reads so that the read pair covers a region that is longer than an individual read. But even though wgsim generated paired reads, I guess the region with repeats was so long that the paired reads didn't help MEGAHIT assemble the region correctly. And actually the default read length used by wgsim is only 70 bases.

A paper about HKU1 said: "Genome analysis also revealed various numbers of tandem copies of a perfect 30-base acidic tandem repeat (ATR) which encodes NDDEDVVTGD and various numbers and sequences of imperfect repeats in the N terminus of nsp3 inside the acidic domain upstream of papain-like protease 1 among the 22 genomes." (https://pubmed.ncbi.nlm.nih.gov/16809319/) And NDDEDVVTGD is the translation of the 30-base repeat: `seqkit translate<<<$'>a\nAATGACGATGAAGATGTTGTTACTGGTGAC'`.

You also failed to assemble a complete contig for the porcine adenovirus genome, but it also contains a 723-base segment that is repeated twice between positions 29481 and 30929. This code finds repeats that are 100 bases or longer: `curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=NC_044935' >adeno.fa;grep -v ^\> adeno.fa|tr -d \\n|awk '{x=100;for(i=1;i<=length-x;i++){s=substr($0,i,x);if(s in a)print i,s;a[s]}}'`.

Expand full comment

Reply (3)