SARS-CoV-2 Genome Assembly (Part 2)

Mar 27, 2023

An update to my post from six months ago.

40 Comments

Mar 27, 2023Liked by Ben

Oh that is very, very interesting. I guess Wu or anyone else involved in the 2020 paper is not providing any answers. Here's little old me thinking that one should be able to replicate the results in published papers.

To me, publishing three different sequences should end it all right there.

Thanks for doing this work and sharing. It's big stuff.

Expand full comment

henjin256

Mar 27, 2023·edited Mar 27, 2023Liked by Ben

I now figured out a less hacky way to show the positions where all sequences in an alignment don't have an identical value. It also works with three or more sequences, so for example this shows all the differences between the three versions of Wuhan-Hu-1:

curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947.'{1,2,3}|mafft ->temp.aln

Rscript -e 'm=t(as.matrix(Biostrings::readDNAStringSet("temp.aln")));pick=which(rowSums(m!=m[,1])>0);write.table(`rownames<-`(m[pick,],pick),sep=";",quote=F,col.names=NA)'

Or the same in gawk:

awk 'NR>1{gsub("\t"," ")sub("\n","\t")gsub("\n","");print}' RS=\> temp.aln|awk -F\\t '{printf";%s",$1;l=length($2);for(i=1;i<=l;i++){x=substr($2,i,1);a[i][NR]=x;b[i][x]}}END{print"";for(i in b)if(length(b[i])>1){o=i;for(j=1;j<=NR;j++)o=o";"a[i][j];print o}}'

Expand full comment

Reply (2)

Steve

Mar 30, 2023Liked by Ben

Way out of my depth here, but daring to comment ...

Given sequencing this genome has problems in reproducing the original posted work scientifically, and sequencing the entire genome itself has remained elusive, plus the versions all differ when others try to sequence the genome...

Is this as suspicious as it sounds? Or is it just the nature of this genome is a tricky?

Is it a case of because,

"the corona virus is an RNA virus. Because RNA is only a single strand of genetic material as opposed to the double strand of DNA, but also very long, a virus making copies of itself using RNA as a blueprint leads to many copy errors. Between 90-99+% of the copies it makes of itself are replication incompetent"? (copy paste from another substack - Occam's Razor)

Is this why the large variance in sequencing?

What were we testing for in PCR tests of patients?

I find the article above intriguing, but am wanting to be more specific of the implications when discussing with others ... So I do not misinterpret or misrepresent your work

Expand full comment

Reply (1)

henjin256

Apr 1, 2023·edited Apr 4, 2023

In a paper in 2020 titled "A Novel Bat Coronavirus Closely Related to SARS-CoV-2 Contains Natural Insertions at the S1/S2 Cleavage Site of the Spike Protein", they described the RmYN02 sequence which was similar to some of the BANAL sequences that were published later in 2022: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7211627/. However the whole genome sequence of RmYN02 is not available from GenBank but only GISAID, so I tried to assemble it myself from the raw reads.

My MEGAHIT run produced fairly long contigs of three SARS-like bat viruses: a 29,601-base contig which had 97.4% identity with "Bat coronavirus isolate BANAL20247/Laos/2020" (because RmYN02 is not included at GenBank), a 7,111-base contig which had 99.4% identity with "Betacoronavirus sp. RsYN03 strain bat/Yunnan/RsYN03/2019", and a 4,492-base contig which had 97.2% identity with "Betacoronavirus sp. RmYN07 strain bat/Yunnan/RmYN07/2020". RsYN03 and RmYN07 were originally sequenced from the same pool of samples as my sample, which is why they had such high identity with my contigs: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8188299/#sec2title.

I now did the trimming with trim_galore, which is a bit more convenient than Trimmomatic because it automatically detects which adapters are being used and removes them:

brew install --cask miniconda;conda init ${SHELL##*/};conda -c bioconda install trim-galore

brew install blast seqkit

brew install -s megahit # use `-s` to compile from source because the bottle is compiled against GCC 9

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR124/009/SRR12432009/SRR12432009_{1,2}.fastq.gz # download about 4 GB of raw reads sequenced from a sample of bat shit

seqkit head -n10000 SRR12432009_1.fastq.gz|seqkit locate -p AGATCGGAAGAG # find reads which feature the Illumina universal adapter sequence or its reverse complement

trim_galore --paired SRR12432009_[12].fastq.gz --length 70 # remove reads shorter than 70 bases, remove Illumina universal adapter sequences, and trim reads with quality under 20

megahit -1 SRR12432009_1_val_1.fq.gz -2 SRR12432009_2_val_2.fq.gz -o megabat

curl -Lso sarslike.fa 'https://drive.google.com/uc?export=download&id=1eq9q_6jPbjvfvKnpAWDkbIyjgky7cjv5' # download an aligned FASTA file of 335 SARS-like viruses

seqkit seq -g sarslike.fa>sarslike.ungap;makeblastdb -in sarslike.ungap -dbtype nucl # make a BLAST database of SARS-like viruses

update_blastdb.pl --showall tsv # list available BLAST databases

update_blastdb.pl --decompress ref_viruses_rep_genomes # download a small databases of viral genomes

<megabat/final.contigs.fa blastn -db 'ref_viruses_rep_genomes sarslike.ungap' -outfmt '6 qseqid sseqid pident length bitscore qlen slen stitle qstart qend sstart send'|tee blastout|awk '$3>=95&&$4>=500'|sort -rnk5|awk '!a[$1]++' # select matches with at least 95% identity and at least 500 bases overlap; show only best match for each read

In an earlier run where I didn't do any trimming, I actually got a bit longer contigs for RsYN03 and RmYN07, so maybe I did the trimming too aggressively. This plot shows the results of my earlier run: https://i.ibb.co/9Txbdgv/1.png. My plot only includes viruses where the match length was at least 500 bases and the identity percent was at least 95%. Most of the viruses I matched are bacteriophages that infect intestinal bacteria like E. coli. My plot only includes matches in two small BLAST databases for viruses, so it doesn't include the contigs which matched bacteria or bat DNA. In order to draw the colored clusters in my plot, I downloaded the sequences of all viruses from GenBank, I aligned them with MAFFT, I made a distance matrix of the alignment, I used the distance matrix as input for hierarchical clustering, and I cut the clustering tree at the height where it had 8 subtrees: `cutree(hclust(as.dist(1-bio3d::seqidentity(bio3d::read.fasta("input.fa")))),8)`.

The genome of RmYN02 might actually be fraudulent though. It was used to demonstrate that the furin cleavage site of SARS 2 has a natural origin, because RmYN02 contained a "PAA" sequence around the furin cleavage site which at that point had not been described in other SARS-like viruses even though it was later featured in some of the BANAL viruses published in 2022. In my alignment of the spike proteins, the region around the furin cleavage site is "QTNSPRRARSVA" in Wuhan-Hu-1, "PAA-------RV" in RmYN02, BANAL-20-116, and BANAL-20-103, "QTNS----RSVA" in BANAL-20-52 and BANAL-20-236, "ASIL----RSTS" in the Zhoushan bat virus ZC45). The point was that the "PAA" sequence would've evolved into the "PRRA" sequence, but it's weird that the "PAA" sequence is not featured in BANAL-52 which is much closer to SARS 2 than BANAL-20-116 or BANAL-20-103 is.

Here's how you can align the spike proteins:

brew install seqkit mafft brewsci/bio/snp-dists;conda install -c bioconda nextclade

nextclade dataset get --name sars-cov-2 --output-dir .

seqkit sort -lr megabat/final.contigs.fa|seqkit range -r 4:4|seqkit seq -rp|nextclade run --input-dataset sars-cov-2 --output-all cladeout

snp-dists sarslike.fa>sarslike.dist

awk -F\\t 'NR==1{for(i=2;i<=NF;i++)if($i=="NC_045512.2")break;next}$i<=5000{print$1}' sarslike.dist|curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta_cds_aa&id=$(paste -sd, -)"|seqkit grep -nrp spike\|surface|cat - cladeout/nextclade_gene_S.translation.fasta|mafft ->spike.aln

seqkit subseq -r 692:703 spike.aln|seqkit fx2tab|sed $'s/lcl|//;s/_prot_[^\t]*//'|awk -F\\t 'NR==FNR{a[$1]=$2;next}{print$2,$1" "a[$1]}' <(seqkit seq -n sarslike.fa|sed $'s/ /\t/;s/, complete genome//') -

Expand full comment

henjin256

Mar 29, 2023

When I ran MEGAHIT without Trimmomatic for SRR10971381, my second-longest contig was 16,037 bases long and its best match at BLAST was "Leptotrichia hongkongensis JMUB5056 DNA, complete genome" with 99.18% identity and 99% query coverage. My third-longest contig was 13,657 bases long and its best match at BLAST was "Select seq LR778174.1 Veillonella parvula strain SKV38 genome assembly, chromosome: 1" with 99.82% identity. My fourth-longest contig was 11,777 bases long and its best match was "Leptotrichia sp. oral taxon 212 strain W10393, complete genome" with 99.24% identity. I didn't find any matches for my fifth-longest contig, but my sixth-longest contig was 8,062 bases long and its best match was "Select seq CP068263.2 Homo sapiens isolate CHM13 chromosome 15" with 99.89% identity.

From the "Analysis" tab of the NCBI's sequence read archive, you can see which organisms the reads are estimated to come from: https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR10971381&display=analysis. For SRR10971381, 61% of reads are unidentified, 33% are classified under bacteria, 5% under eukaryotes, and only 0.2% under viruses. So the ratio of 33% bacteria to 5% eukaryotes is consistent with my BLAST searches, where three out of four reads were from bacteria and the fourth was a human read.

To get the second-longest contig from the new dataset in this Substack article, run: `curl https://raw.githubusercontent.com/USMortality/Megahit-SARS-CoV-2/master/out/final.contigs.fa|seqkit sort -rl|seqkit range -r2:2`. Then paste the result here and press BLAST: https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn.

Expand full comment

Reply (1)

henjin256

Mar 28, 2023·edited Mar 28, 2023

I tried aligning 194 sequences of SARS 1:

curl -Lso sarslike.fa 'https://drive.google.com/uc?export=download&id=1eq9q_6jPbjvfvKnpAWDkbIyjgky7cjv5'

seqkit grep -nrp SARS\ coronavirus sarslike.fa|mafft --thread 4 ->sars1.aln

seqkit subseq -r-100:-1 -w0 sars1.aln|awk '{getline x;print x,$0}'

There was only one sequence which didn't have any gaps at the end, but the number of gaps at the end ranged from 8 to 242 in the other sequences, where in about half of the cases the only difference was the length of the poly(A) tail.

AY461660.1 (SARS coronavirus SoD) had a 24-base segment at the end which was not shared by any other sequence in my alignment (cacattagggctcttccatatagg). However when I ran `seqkit seq -g sars1.aln|seqkit locate -p cacattagggctcttccatatagg`, I found that the 24-base segment was the reverse complement of another 24-base segment near the end of the genome which started 63 bases earlier.

Another assembly error is that the beginning of AY559083.1 (SARS coronavirus Sin3408) includes two copies of a 29-base segment (ggaaaagccaaccaacctcgatctcttgt), where one copy starts at position 22 and another copy starts at position 57: `seqkit grep -nrp AY559083.1 sarslike.fa|seqkit seq -g|seqkit locate -p ggaaaagccaaccaacctcgatctcttgt`. `seqkit locate` doesn't ignore gaps so I used `seqkit seq -g` to remove gaps, but another way to remove gaps is `sed '/^>/!s/-//g'|grep .`.

Expand full comment

Reply (1)