Jul 16Liked by Ben


Expand full comment

The overlords don’t want billions of people using up the earths resources. Satan wants to modify the natural God-given genome into his own image.

The intercession will occur in time to preserve a remnant of the natural genome.

Expand full comment

My question is exactly what is the point? Is this going to do anything? Save a life? This whole mass Democide shit show is exactly that, kill off the population.

Unfortunately, it’s working. Yet, I’ll remain hopeful that just maybe we will be able to find a way to reverse the damage while there are still humans walking the earth. But am not going to hold my breath.

Expand full comment

They engineered the viruses - virus can not be eradicated from earth.

There's no 'reference' genome - in that us mere mortals could ever be the arbiters of what constitutes a genome. Virus (RNA) forges its path forward through life to get to its next host.

Mere mortals like us can only watch in wonder.

Expand full comment

Why did they have to discover what they had made?

Expand full comment

Am I correct in observing that all the viruses that Megahit struggled with are coronaviruses?

Expand full comment


Thank you for your controls. S.L. said in a video that the sars-cov-2 genomes as a final result are not compatible with sars-cov-2 primers. Primers allegedly do not fit in genomes, if I understood this properly. (Amplicon-based WGS)

Expand full comment

Because you asked a metagenomics tool to assemble three close genomes at the same time. Rookie move.

Expand full comment
Jul 18·edited Jul 18

One of the reasons why the no-virus people were saying that the genome of SARS2 was fake was that in the Wu et al. paper where the Wuhan-Hu-1 reference genome was described, they wrote that the longest contig they got with MEGAHIT was 30,474 bases long but the longest contig they got with Trinity was only 11,760 bases long, so the no-virus people thought that Trinity produced a completely different genome for the virus. And they didn't realize that Trinity just split the genome into a couple of incomplete contigs which likely had only small gaps in between, even though with different settings Trinity may have also produced a complete contig.

However your experiments show that even though you generated sets of reads which covered the whole genome of a virus, de-novo assemblers like MEGAHIT still occasionally fail to produce a single complete contig for the whole virus. So it also indicates that there's nothing that anomalous in how Trinity failed to generate a complete contig from Wu et al.'s reads even though the reads actually covered the entire genome of Wuhan-Hu-1 apart from the last couple of bases of the poly(A) tail.

BTW the genomes of influenza viruses are about 15,000 bases long, but the reason why your influenza A, B, C, and D references are so short is because they don't include all genes.

The first sentence of the post says: "The Wu et al. 2020 paper is the first to discover the genetic sequence of the novel pathogen SARS-CoV-2." However I think the team of Winjor Small Mountain Dog discovered it earlier in December 2019: https://www.researchgate.net/profile/Gilles-Demaneuf/publication/360313016_Sequencing_and_early_analysis_of_SARS-CoV-2_27_Dec_2019_-_The_crushed_hopes_of_Little_Mountain_Dog_of_Vision_Medicals_China/links/626fa7afb1ad9f66c89a1d13/Sequencing-and-early-analysis-of-SARS-CoV-2-27-Dec-2019-The-crushed-hopes-of-Little-Mountain-Dog-of-Vision-Medicals-China.pdf.

Expand full comment

As a 40-year veteran of Silicon Valley, I can strongly assure you that you should turn off comments.

Expand full comment
Jul 16·edited Jul 16

In your assembly experiment where you simulated the reads with wgsim, your longest MEGAHIT contig for HKU1 was only about 89% of the length of the HKU1 reference genome.

I also tried using wgsim to generate 100,000 reads for reference genome of HKU1, and when I ran MEGAHIT to assemble the reads, my longest contig was only 26,535 bases even though the HKU1 reference genome is 29,926 bases: `brew install megahit -s;brew install seqkit samtools;curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=NC_006577' >hku1.fa;wgsim -N100000 hku1.fa hku1_{1,2}.fq;megahit -1 hku1_1.fq -2 hku1_2.fq -o megahku;seqkit stat hku1.fa megahku/final.contigs.fa`.

When I aligned the contigs with Bowtie2, I noticed that my three contigs covered the positions 54-3136, 3397-29933, and 3397-3727, so there was a short gap from position 3137 to 3396: `bowtie2-build hku1{,}.fa;bowtie2 -p4 -x hku1.fa -fU megahku3/final.contigs.fa|samtools sort ->hku1.bam;samtools view hku1.bam|awk -F\\t '{l=length($10);print$4,$6,l,$4+l}'|column -t`.

When I ran `seqkit subseq -r 3137:3396 hku1.fa`, I noticed that the gap which was not covered by any contig fell within a region where the 30-base segment AATGACGATGAAGATGTTGTTACTGGTGAC was repeated 15 times. You can also see the repeats from here: https://www.ncbi.nlm.nih.gov/nuccore/NC_006577.2.

So if MEGAHIT would've had to assemble the contigs from unpaired reads that were 150 bases long, how could it know how many times the 30-base segment was repeated, if it can only see one 150-base window of the genome at a time? Actually one of the main reasons why the paired read layout is used is that it helps sequence regions with repeats, because there are variable-length gaps between the forward and reverse reads so that the read pair covers a region that is longer than an individual read. But even though wgsim generated paired reads, I guess the region with repeats was so long that the paired reads didn't help MEGAHIT assemble the region correctly. And actually the default read length used by wgsim is only 70 bases.

A paper about HKU1 said: "Genome analysis also revealed various numbers of tandem copies of a perfect 30-base acidic tandem repeat (ATR) which encodes NDDEDVVTGD and various numbers and sequences of imperfect repeats in the N terminus of nsp3 inside the acidic domain upstream of papain-like protease 1 among the 22 genomes." (https://pubmed.ncbi.nlm.nih.gov/16809319/) And NDDEDVVTGD is the translation of the 30-base repeat: `seqkit translate<<<$'>a\nAATGACGATGAAGATGTTGTTACTGGTGAC'`.

You also failed to assemble a complete contig for the porcine adenovirus genome, but it also contains a 723-base segment that is repeated twice between positions 29481 and 30929. This code finds repeats that are 100 bases or longer: `curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=NC_044935' >adeno.fa;grep -v ^\> adeno.fa|tr -d \\n|awk '{x=100;for(i=1;i<=length-x;i++){s=substr($0,i,x);if(s in a)print i,s;a[s]}}'`.

Expand full comment