Fast estimation of line count on linux

Working with huge data files (millions of lines), I often want to know roughly how many lines there are in a file. You can use the typical wc -l to count the lines exactly, but this takes awhile for big files. Instead, if all you really need is a rough estimate (to the nearest million or so), here’s a quick script I wrote that can do this.

The idea is to use the file size, along with a guess of how big a line is (based on the average length of the first lines in the file). Then, extrapolate out to estimate the number of lines. Kudos to stackoverflow answers for the idea. This is much faster than using wc -l, and accurate enough to get an idea of what you’re dealing with. Here’s wcle (word count line estimate):

wcle

#wcle – word count line estimate
#Fast line-count estimate for huge files
#By Nathan Sheffield, 2014

file=$1
nsample=1000
headbytes=`head -q -n $nsample $file | wc -c`
#tailbytes=`tail -q -n $nsample $file | wc -c`
#echo $headbytes

filesize=`ls -sH –block-size=1 $file | cut -f1 -d” “`
#echo $filesize

echo -n $((filesize / (headbytes) * $nsample))
echo ” (” $((filesize / headbytes )) “K;” $((filesize / headbytes /1000 )) “M )”

 

 

 

What is hemimethylated DNA?

At first, it took me a moment to wrap my  head around the concept of hemimethylation because I always seemed to get it mixed up with parental imprinting, which relies on parent-specific DNA methylation.

DNA-hemimethylation is when only one of two (complementary) strands is methylated. A hemi-methylated site is a single CpG that is methylated on one strand, but not on the other. This is not the same thing as allele-specific methylation, which is common in imprinting. In hemi-methylation, we’re talking about 2 strands from the same parent. Hemimethylation is important because it directly identifies de novo methylation events, allowing you to differentiation between de novo vs. maintenance factors. Because DNA methylation is faithfully propagated during DNA replication (by DNMT1), any hemimethylated sites must have arisen during the last replication round, either because: 1) failure to faithfully propagate a parental methylation signal; or, 2) a de novo methylation event. You can differentiate between the two if you know the methylation status of the parent: if the parent strand was entirely methylated, then hemimethylation indicates failure of maintenance. Vice versa, if the parent straned was unmethylated, hemimethylation indicates de novo methylation.

 

What are chromosomal breakpoints?

Chromosomes can break! When they break, they get breakpoints. Ok, more seriously:

As a cell divides, during metaphase, the chromosomes all line up in the center of the cell. Microtubules attach to the chromosomes and pull them apart, so half the DNA ends up in each daughter cell. Before the DNA gets pulled apart, the chromosomes are free to recombine, so your chromosome 5, for example, is actually a mix of chromosome 5 from your mother and father. During recombination, the chromosomes must break and reattach. “Chromosomal breakpoints” refers to these places where they break. Occasionally something goes wrong and the reattachment happens in the wrong place…this can spell disaster. Usually the term “chromosomal breakpoints” is used in the context of some abnormality.

One of the early examples of this is the Philadelphia chromosome, a translocation between chromosome 9 and 22. This means a chunk of chromosome 9 ends up chromosome 22, and a chunk of chromosome 22 ends up on chromosome 9.

The term “breakpoint” refers to the position on the hybrid chromosome where the original location shifts, from 9 to 22 or vice versa.

There are other abnormalities besides translocation that also use the term “chromosomal breakpoint” though. For example, when comparing human and chimp, there are several “inversions” — where the DNA comes from the same chromosome, but a certain part of the chromosome is inverted. Such an inversion would also have breakpoints surrounding it. See this paper for an example.

Resizing images in Ubuntu (in Bulk!)

A common problem relating to sharing photos is that the files to are TOO BIG! When I load up my 7-megapixel camera, photos are close to 5mb each! I want to put them on the net or email them to people, but the size limits me. Here I’ll show you how I run these through a single command to reduce them to whatever size I need, using ImageMagick and a perl script I wrote. This really is pretty easy and hopefully these instructions are useful even for linux newbs. Incidentally, this will also work on a mac if you have Imagemagick and Perl installed, or an Windows machine under Cygwin.

First you’ll have to install Imagemagick. This should be as easy as a sudo apt-get install imagemagick. Of course you need perl, but that’s standard on any linux install. Now here’s the perl script:

use Cwd;
$some_dir = cwd;
print $some_dir."n";
opendir(DIR, $some_dir) || die "can't opendir $some_dir: $!";
system("mkdir small");

while ($_ = readdir(DIR)) {
next if($_ =~ /^./ || -d "$some_dir/$_" || /.pl/);
print ":convert -resize 800 -quality 80 $_ small/$_"."n";
system("convert -resize 800 -quality 80 $_ small/tn_$_");
}
closedir DIR;

Steps

1. Save the above script as resize.pl (copy and paste it into a new text document).

2. Adjust the image width (800 in the example) and quality (80 in the example), and put the script in a folder.

3. Put all the images you want to resize into the same folder.

4. Open a terminal, cd to the folder, and type

perl resize.pl

That’s it! This will make a subfolder called small and put the newly resized images in there.

If you’re not happy with the file size, just do it again, it will overwrite anything. I find this little script really useful — you can now email however many pictures you want. You can also use this script to make thumbnails for a web site: just set the size to 200 or 300, lower the quality a bit, and you can get images that are around 30K each.

Scientific Writing

In my opinion, scientific writing is often poor. All too often, when I try to read a paper, I find myself having to re-read sentences or paragraphs before I finally understand the meaning. Much of the time, I can figure out ways to say the important points in fewer words with better structure. Of course, it’s always easy to do this to someone else’s writing. I also find similar mistakes in my own writing when I go back to revise.

I think that one of the big problems is that bad writing is accepted (in science fields). For example, it almost seems expected that scientific papers are written in the passive voice. I, for one, find this to decrease readability. I have been thinking a lot about this lately and wondering what I can do to improve it. I am working on developing a workshop for scientific writing that will address this. [2011 update: there is now a website at Duke hosting my scientific writing workshop]

What is Allelic Imbalance?

I was looking around to try to find a good short explanation of what allelic imbalance is, but was unable to find one. I eventually figured it out, and wanted to make a post to clarify this for future searchers:

What is allelic imbalance?

  • A difference in the expression between two alleles.

Humans are diploid organisms, which means we have 2 copies of each gene. Normally, these two copies are expressed at the same level. This means that the mRNA transcript from the mother and the transcript from the father will have roughly the same number of copies. Sometimes, however, this is not the case. When the ratio of the expression levels is not 1 to 1,  we call it “allelic imbalance”. There are a variety of reasons why the expression may vary between the alleles. “Gene imprinting,” when environmental factors silence either the maternal or paternal allele, is one case. If one allele is silenced completely, then there will be an extreme case of allelic imbalance. Other scenarios may increase or decrease expression of one particular allele only slightly, resulting in imbalance to a lesser degree. Cis-acting mutations may alter regulation for just one allele through a change to promoter/enhancer regions (transcription factor binding sites), or even through 3′ UTR mutations that affect mRNA stability or microRNA binding.

A good source for further information (if you have a subscription) is Detection of Allelic Imbalance in Gene Expression Using Pyrosequencing

Dell Latitude D630 vs Lenovo Thinkpad T61

The Dell Latitude D630 and the Lenovo Thinkpad T61 are of comparable class, and so many have asked the question of which one is better. I did a lot of research on the two of them and finally came to the conclusion to get the D630, mainly for 1 reason: battery life.

Here I will summarize the information I gathered on the subject in relation to factors that I thought were most important: size, weight, performance, cost, and battery life.

Size

The size is pretty close on these two. I have below used the 14.1″ widescreen T61, since that’s the one that matches the D630. It turns out the T61 is about a millimeter thinner in all directions.

T61: 335.5 x 237 x 27.6 – 31.9 (mm)

D630: 337.1 x 238 x32 (mm)

Weight

I call this difference negligible:

D630: starts at only 4.47 lbs
T61: starts at 5 lbs

Performance

Both had WXGA+ screen availability, which was important to me. The default processor, an Intel T7250 2.0Gz Duo, is exactly the same in the base model for both. RAM is irrelevant, because you can pick up 4GB of Corsair memory from Newegg for under $65, so there’s no point in upgrading the RAM they include (it costs $400+ for 4GB), or giving extra credit to a machine that comes with 1GB instead of 512MB (because you should upgrade it either way, in my opinion). Hard drives start the same too. The only real difference in starting specs is that the Dell comes with a 6-cell battery while the T61 comes with a 4-cell battery…For me graphics is irrelevant because I don’t play new video games.

Cost

At the time of purchase, the cost for similar systems were almost identical. The latitude was slightly (maybe $20) cheaper with basically the same specs, but I just considered them even.

Battery Life

Up until this point, I considered them about even. I probably was leaning toward the T61 because of the reputation Lenovo has for great customer service and long-lasting laptops with no problems. Not that Dell has a negative reputation in either of those categories, but I think consumer opinion is that Lenovo takes the cake. But the battery life is what turned me around.

First of all, the T61 (14.1″ edition) doesn’t have a 9-cell battery option. It only goes up to 7-cell, which, according to the spec sheet, gives up to 6.5 hours (which really means 3 hours of actual use). Furthermore, the optional ultrabay battery is only a 3-cell, and from what I read on the forums, this only gives about an extra 45 minutes to 1 hour of actual use time. On the other hand, the Dell can have a 9-cell primary battery, and a 6-cell media bay battery. Besides this, from what I read on the forums, even with just a 6-cell battery the Dell did significantly better with battery than the T61.

Conclusions

I ended up with the D630. I am very happy with it. I seems to be built very well, I haven’t had any problems with it, and I really do get 4+ hours of battery life (using the computer for browsing, programming, etc) on the 9-cell extended battery. I haven’t seen a need to get the media bay battery yet. I haven’t actually owned a T61 so I can’t give a perfect opinion, but for what it’s worth, I am happy D630 user.

Here are some helpful links if you’re considering this same issue:

Thinkpad T61 Review

T61 spec sheet

Forum discussion and poll

Installing Ubuntu Linux on a Dell Latitude D630

I recently bought a new laptop from Dell. After some difficult option considering, I decided on the Latitude D630 over Lenovo’s T61 Thinkpad. Mostly it came down to the Dell having considerably better battery power.

The D630 came with Windows Vista Basic on it, but I prefer Ubuntu. However, I didn’t want to completely abandon Windows, because sometimes there are programs that I want to run that cannot easily be run on Linux (like computer games). I decided to dual-boot.

I installed the 64-bit edition of Ubuntu 7 Gutsy Gibbon and without much difficulty. The Ubuntu installer guided me through an easy repartitioning of the hard drive by claiming a portion of the Windows partition for the Linux. The Ubuntu boot loader recognized the Windows Vista installation and correctly prompts me on boot which OS to load.

I had no problems with drivers. I had read earlier that some of the integrated Dell wireless cards have some driver issues with Ubuntu, so I elected for an upgraded Intel wireless card. If I recall correctly, everything worked perfectly out of the box, except the sound card. I got instructions for the fix from Martti Kuparinen, whose guide is no longer available:

sudo aptitude install linux-backports-modules
sudo gedit /etc/modprobe.d/alsa-base
options snd-hda-intel model=dell-m42

After a reboot, the sound worked fine. So the installation went really well, and I now use Ubuntu almost exclusively on the laptop. I get 4+ hours of battery life on the extended 9 cell battery.

A few things that I noticed that you may want to consider if you’re thinking about this. First, dual monitor support is difficult. I haven’t been able to get my laptop to connect successfully to a second monitor. Usually I can get it to work in some semblance of the word, but I can only mirror the desktops, and the resolution gives me problems. I was unable to get it to extend the desktop like I wanted it to. This may be due to incompetency on my part, however, as I have only been using Ubuntu (on my desktop) since August 2007, so I’m sure it’s possible. Second, as far as I can tell, there is no way in Ubuntu to instruct the computer to NOT charge the battery, if you have the battery plugged in. I always just have to unplug my battery if I don’t want it charged. But wait, you ask, why wouldn’t I want it charged? Because battery life is prolonged when you don’t leave your battery at max charge all the time. Check out these tips on prolonging lithium ion batteries if you’re interested in more information.

64-bit edition?

Of course I was a bit concerned that installing the 64-bit OS would cause me problems, but it hasn’t been too bad. At this point, they always warn you that some programs may not be able to run on a 64-bit OS, so you’re always safer to just install a 32-bit OS, even if you have a 64-bit processor. In fact, the Windows Vista that came with the laptop was 32-bit…which surprised me, because the processor is 64 bit. I have noticed a few difficulties, though, and I’ll highlight them here. Once again, these may be due to my incompetence, but I’ll present them anyway. First, I seem to have trouble with java applets loading in firefox. For example, the Facebook image upload applet won’t load for me. A quick google search shows one possible solution, but you also will notice that this only works for 32-bit Ubuntu. I also had trouble getting MEGA (bioinformatics software) to work, though they do claim support for Linux through WINE, it only works on a 32-bit OS. But that’s about the limit–I don’t think the 64-bit OS has caused me any other problems.

In conclusion, I think most who are willing to attempt could be successful at installing Ubuntu on a D630.