Diffing coronaviruses
We can use standard UNIX tools
to investigate the origins of the Wuhan coronavirus!
I read on Wikipedia that
“2019-nCoV has been reported to have a genome sequence 75% to 80% identical to the SARS-CoV
and to have more similarities to several bat coronaviruses.”
We can use diff
to see those similarities:
$ ./genome_diff MG772933.1 MN988713.1
MG772933.1: 29802 words 26618 89% common 856 3% deleted 2328 8% changed
MN988713.1: 29882 words 26618 89% common 897 3% inserted 2367 8% changed
This says that there’s an 89% similarity
between bat CoV (MG772933.1)
and human nCoV (MN988713.1).
More precisely,
they share a subsequence of 26618 bases,
in a total genome of only ~29800 bases.
That genome_diff
script looks like this:
#!/bin/bash
fetch_genome() {
curl -s "https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=$1" \
| grep -v '^>' | tr -d '\n' | sed 's/\(.\)/\1 /g' > $1
}
fetch_genome $1
fetch_genome $2
wdiff -s -123 $1 $2
This script works by fetching the genome from the NCBI database.
The strings “MG772933.1” and “MN988713.1” are accession numbers.
The API returns the RNA sequence in FASTA format,
which looks like:
$ curl -s 'https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?report=fasta&id=MN988713.1'
>MN988713.1 Wuhan seafood market pneumonia virus isolate 2019-nCoV/USA-IL1/2020, complete genome
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA
CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC
TAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG
TTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTC
...
The FASTA format needs some “massaging” before we can diff
it.
The first line, starting with >
, describes the sequence that follows.
We don’t need this metadata, so we strip it with grep -v '^>'
.
Next, we don’t need those newline characters,
so we strip them with tr -d '\n'
.
Finally,
because diff
works on lines rather than characters,
we’ll instead use wdiff
,
after separating the characters into separate words using sed 's/\(.\)/\1 /g'
.
This gives us genomes that look like A T A T T A G G ...
.
Finally, we can call wdiff -s -123
on these genomes,
which gives us some statistics about their similarity.
If we omit -s -123
,
we get the actual base differences between the sequences.
For example, check out the end of the sequences:
$ ./genome_diff MG772933.1 MN988713.1 | fold | tail -2
[-A A C C A C-] T [-C G A C A-] {+T+} A G {+G+} A {+G+} A A {+T G+} A [-A A A A
A A A A A A-] {+C+} A A A A A A A A A A A A
We can see that the sequences both have a long sequence of A
s at the end,
but the Bat CoV’s tail is significantly longer.
This is known as a “poly(A) tail”.
A different way to see similarities is to use NCBI’s BLAST tool.
Enter the accession number MN988713.1
,
and you’ll get a list of other sequences,
ranked by “percent identity”.
The most similar are several recent sequences of 2019-nCoV,
followed by the “Bat SARS-like coronavirus”,
followed by many SARS coronavirus sequences.
Correction 2020-02-11:
My script used tr -d -C 'ATGC'
to strip newlines,
but it should use tr -d '\n'
.
It’s important, because
the full set of FASTA characters in this file
also includes S
, W
, and Y
!
For the meaning of these,
see FASTA format.
Similar posts
Why does this RNA virus look like DNA?
RNA genomes of viruses like COVID-19 are often sequenced as complementary DNA (cDNA) for practical reasons, though the actual genome is made of RNA with uracil instead of thymine. 2020-02-16
Browsing my genome
Exploring my 23andMe genome data - a text file of single nucleotide polymorphisms, not a full DNA sequence, requiring a reference genome to interpret. 2019-12-30
The golden rule of PR reviews
The “golden rule” of code reviews is to approve improvements. Approve pull requests that fix bugs, even if the implementation isn’t ideal. 2023-10-07
Executables in npm?
NPM packages can contain executables, not just Node.js modules. NPM provides features to help distribute and run these executables, such as the bin
field in package.json
and the npm run-script
command. 2020-10-02
Why is the contentRect
of my NSWindow
ignored?
Calling setFrameAutosaveName
on an NSWindow
causes its size and position to be saved to user defaults, overriding the contentRect
passed to the constructor. To avoid this, do not call setFrameAutosaveName
. 2020-07-10
What is simulated annealing?
Simulated annealing is a variation of trial-and-error optimization that generates mutations of the current best guess, and gradually reduces the magnitude of the mutations over time, similar to how metals cool. 2019-05-28
More by Jim
What does the dot do in JavaScript?
foo.bar
, foo.bar()
, or foo.bar = baz
- what do they mean? A deep dive into prototypical inheritance and getters/setters. 2020-11-01
Smear phishing: a new Android vulnerability
Trick Android to display an SMS as coming from any contact. Convincing phishing vuln, but still unpatched. 2020-08-06
A probabilistic pub quiz for nerds
A “true or false” quiz where you respond with your confidence level, and the optimal strategy is to report your true belief. 2020-04-26
Time is running out to catch COVID-19
Simulation shows it’s rational to deliberately infect yourself with COVID-19 early on to get treatment, but after healthcare capacity is exceeded, it’s better to avoid infection. Includes interactive parameters and visualizations. 2020-03-14
The inception bar: a new phishing method
A new phishing technique that displays a fake URL bar in Chrome for mobile. A key innovation is the “scroll jail” that traps the user in a fake browser. 2019-04-27
The hacker hype cycle
I got started with simple web development, but because enamored with increasingly esoteric programming concepts, leading to a “trough of hipster technologies” before returning to more productive work. 2019-03-23
Project C-43: the lost origins of asymmetric crypto
Bob invents asymmetric cryptography by playing loud white noise to obscure Alice’s message, which he can cancel out but an eavesdropper cannot. This idea, published in 1944 by Walter Koenig Jr., is the forgotten origin of asymmetric crypto. 2019-02-16
How Hacker News stays interesting
Hacker News buried my post on conspiracy theories in my family due to overheated discussion, not censorship. Moderation keeps the site focused on interesting technical content. 2019-01-26
My parents are Flat-Earthers
For decades, my parents have been working up to Flat-Earther beliefs. From Egyptology to Jehovah’s Witnesses to theories that human built the Moon billions of years in the future. Surprisingly, it doesn’t affect their successful lives very much. For me, it’s a fun family pastime. 2019-01-20
The dots do matter: how to scam a Gmail user
Gmail’s “dots don’t matter” feature lets scammers create an account on, say, Netflix, with your email address but different dots. Results in convincing phishing emails. 2018-04-07
The sorry state of OpenSSL usability
OpenSSL’s inadequate documentation, confusing key formats, and deprecated interfaces make it difficult to use, despite its importance. 2017-12-02
I hate telephones
I hate telephones. Some rational reasons: lack of authentication, no spam filtering, forced synchronous communication. But also just a visceral fear. 2017-11-08
The Three Ts of Time, Thought and Typing: measuring cost on the web
Businesses often tout “free” services, but the real costs come in terms of time, thought, and typing required from users. Reducing these “Three Ts” is key to improving sign-up flows and increasing conversions. 2017-10-26
Granddad died today
Granddad died. The unspoken practice of death-by-dehydration in the NHS. The Liverpool Care Pathway. Assisted dying in the UK. The importance of planning in end-of-life care. 2017-05-19
How do I call a program in C, setting up standard pipes?
A C function to create a new process, set up its standard input/output/error pipes, and return a struct containing the process ID and pipe file descriptors. 2017-02-17
Your syntax highlighter is wrong
Syntax highlighters make value judgments about code. Most highlighters judge that comments are cruft, and try to hide them. Most diff viewers judge that code deletions are bad. 2014-05-11
Want to build a fantastic product using LLMs? I work at
Granola where we're building the future IDE for knowledge work. Come and work with us!
Read more or
get in touch! This page copyright James Fisher 2020. Content is not associated with my employer. Found an error? Edit this page.