Proving I can code enough to study bioengineering - Part 2

The biology part:

In part 1, we established that we need to calculate the G-C content of a genome. Now what is a G-C content? It is the percentage of Guanine and Thymine base pairs in a given genome sequence. Guanine and Thymine are part of the 4 different bases of DNA: Adenine, Thymine, Guanine and, Cytosine. These build so-called “base pairs” which in turn are the building blocks of the DNA double helix. They always come in pairs of A-T and G-C.

Alright, so much for High School biology. Let’s expand on this a bit:

Each base (e.g. a single Adenine) is part of a single building block of the DNA. A single building block consists of a:

  • Phosphate group on the outside

  • A sugar group

  • The base (e.g. Adenine) in the center

As we have 4 different bases, we have 4 building blocks to work with inside a DNA:

  • Phosphate group + sugar group + Adenine (“A”)

  • Phosphate group + sugar group + Thymine (“T”)

  • Phosphate group + sugar group + Guanine (“G”)

  • Phosphate group + sugar group + Cytosine (“C”) 

The phosphate group on the outside makes DNA acidic in nature (hence the “Acid” in the connotation). The fact that DNA consists of these repeating building blocks make it a “polymer” (vs. a monomer, where a block is not connected to any other one. A polymer is therefore a collection of multiple monomers).

Adenine and Thymine (A-T) always connect to each other as does Guanine and Cytosine (G-C). A major difference between the two combinations is the number of hydrogen bonds that hold them together. A-T has 2 hydrogen bonds while G-C has 3 hydrogen bonds.

DNA.PNG

In the illustration to the left, the pink outline represents the phosphate outside, the polygons (shape with corners connected to the pink) are the sugar groups and the different coloured circles represent the bases (A-T ; G-C).

The dotted lines in between, represent the 3 and 2 hydrogen bonds respectively.

Since 3 hydrogen bonds hold better than 2, a G-C combination is more difficult to “break” compared to an A-T combination. Extrapolating this means that a genome sequence with a higher G-C content is more difficult to split compared to one with a comparatively lesser G-C content. But why does that matter?

I will have Wikipedia guide me here, as it mentions the importance of the G-C content in molecular biology to perform the PCR method. Now, what is this PCR method? It’s a method to “amplify” a specific DNA sample. Meaning, you make more of the sample DNA to study it in greater detail. To duplicate the sample DNA, the first step is to physically separate the sample DNA double helix into their two chains. This is done by applying high temperatures and breaking the hydrogen bonds that keep these chains, well, chained together.

The G-C content is used to predict the specific temperatures needed to perform the PCR method on the sample DNA. A relatively higher G-C content indicates a higher needed temperature.

With this, I will leave the biology part. I believe we have learned enough for a basic data understanding and therefore should start to code. This will need to wait for the 3rd part of this series though. Until then!

Previous
Previous

Thinking Again and Again

Next
Next

Proving I can code enough to study bioengineering - Part 1