Category Archives: Biology

Fast estimation of line count on linux

Working with huge data files (millions of lines), I often want to know roughly how many lines there are in a file. You can use the typical wc -l to count the lines exactly, but this takes awhile for big files. Instead, if all you really need is a rough estimate (to the nearest million or so), here’s a quick script I wrote that can do this.

The idea is to use the file size, along with a guess of how big a line is (based on the average length of the first lines in the file). Then, extrapolate out to estimate the number of lines. Kudos to stackoverflow answers for the idea. This is much faster than using wc -l, and accurate enough to get an idea of what you’re dealing with. Here’s wcle (word count line estimate):


#wcle – word count line estimate
#Fast line-count estimate for huge files
#By Nathan Sheffield, 2014

headbytes=`head -q -n $nsample $file | wc -c`
#tailbytes=`tail -q -n $nsample $file | wc -c`
#echo $headbytes

filesize=`ls -sH –block-size=1 $file | cut -f1 -d” “`
#echo $filesize

echo -n $((filesize / (headbytes) * $nsample))
echo ” (” $((filesize / headbytes )) “K;” $((filesize / headbytes /1000 )) “M )”




What is hemimethylated DNA?

At first, it took me a moment to wrap my  head around the concept of hemimethylation because I always seemed to get it mixed up with parental imprinting, which relies on parent-specific DNA methylation.

DNA-hemimethylation is when only one of two (complementary) strands is methylated. A hemi-methylated site is a single CpG that is methylated on one strand, but not on the other. This is not the same thing as allele-specific methylation, which is common in imprinting. In hemi-methylation, we’re talking about 2 strands from the same parent. Hemimethylation is important because it directly identifies de novo methylation events, allowing you to differentiation between de novo vs. maintenance factors. Because DNA methylation is faithfully propagated during DNA replication (by DNMT1), any hemimethylated sites must have arisen during the last replication round, either because: 1) failure to faithfully propagate a parental methylation signal; or, 2) a de novo methylation event. You can differentiate between the two if you know the methylation status of the parent: if the parent strand was entirely methylated, then hemimethylation indicates failure of maintenance. Vice versa, if the parent straned was unmethylated, hemimethylation indicates de novo methylation.


What are chromosomal breakpoints?

Chromosomes can break! When they break, they get breakpoints. Ok, more seriously:

As a cell divides, during metaphase, the chromosomes all line up in the center of the cell. Microtubules attach to the chromosomes and pull them apart, so half the DNA ends up in each daughter cell. Before the DNA gets pulled apart, the chromosomes are free to recombine, so your chromosome 5, for example, is actually a mix of chromosome 5 from your mother and father. During recombination, the chromosomes must break and reattach. “Chromosomal breakpoints” refers to these places where they break. Occasionally something goes wrong and the reattachment happens in the wrong place…this can spell disaster. Usually the term “chromosomal breakpoints” is used in the context of some abnormality.

One of the early examples of this is the Philadelphia chromosome, a translocation between chromosome 9 and 22. This means a chunk of chromosome 9 ends up chromosome 22, and a chunk of chromosome 22 ends up on chromosome 9.

The term “breakpoint” refers to the position on the hybrid chromosome where the original location shifts, from 9 to 22 or vice versa.

There are other abnormalities besides translocation that also use the term “chromosomal breakpoint” though. For example, when comparing human and chimp, there are several “inversions” — where the DNA comes from the same chromosome, but a certain part of the chromosome is inverted. Such an inversion would also have breakpoints surrounding it. See this paper for an example.

Scientific Writing

In my opinion, scientific writing is often poor. All too often, when I try to read a paper, I find myself having to re-read sentences or paragraphs before I finally understand the meaning. Much of the time, I can figure out ways to say the important points in fewer words with better structure. Of course, it’s always easy to do this to someone else’s writing. I also find similar mistakes in my own writing when I go back to revise.

I think that one of the big problems is that bad writing is accepted (in science fields). For example, it almost seems expected that scientific papers are written in the passive voice. I, for one, find this to decrease readability. I have been thinking a lot about this lately and wondering what I can do to improve it. I am working on developing a workshop for scientific writing that will address this. [2011 update: there is now a website at Duke hosting my scientific writing workshop]

What is Allelic Imbalance?

I was looking around to try to find a good short explanation of what allelic imbalance is, but was unable to find one. I eventually figured it out, and wanted to make a post to clarify this for future searchers:

What is allelic imbalance?

  • A difference in the expression between two alleles.

Humans are diploid organisms, which means we have 2 copies of each gene. Normally, these two copies are expressed at the same level. This means that the mRNA transcript from the mother and the transcript from the father will have roughly the same number of copies. Sometimes, however, this is not the case. When the ratio of the expression levels is not 1 to 1,  we call it “allelic imbalance”. There are a variety of reasons why the expression may vary between the alleles. “Gene imprinting,” when environmental factors silence either the maternal or paternal allele, is one case. If one allele is silenced completely, then there will be an extreme case of allelic imbalance. Other scenarios may increase or decrease expression of one particular allele only slightly, resulting in imbalance to a lesser degree. Cis-acting mutations may alter regulation for just one allele through a change to promoter/enhancer regions (transcription factor binding sites), or even through 3′ UTR mutations that affect mRNA stability or microRNA binding.

A good source for further information (if you have a subscription) is Detection of Allelic Imbalance in Gene Expression Using Pyrosequencing