Wednesday, December 1, 2010

Program to shred contigs into reads with overlap

Most assembly programs seem to have some limit on the length of the reads that they take as input for assembling them. Assembling contigs can be a difficult problem in such cases. Shredding the contigs into smaller reads with a 50% or more of overlap should retain the benefit of having assembled the contigs and use it for further assembly.

Here is a small program to shred contigs into 1kb reads with 500 base pair overlap:


open FILE1, $ARGV[0] or die $!;

while($line1 = ){
my $flag=0,$overlap=500,$length=1000;


$line2 = ;
$seqlen=length $line2;

$nextseq=substr $line2,$flag,$length;
print $line1.":".$flag."-".($flag+$length)."\n";
print $nextseq."\n";

}#end of seqlen while loop

}#end of file while loop

This script works on a multifasta file with header in the first line and the sequence in the second line.

Tuesday, October 19, 2010

Handling duplicate accession numbers created by flowsim with sfffile command

Having simulated loads of 454 data with Flowsim, i wanted to assemble everything. Unfortunately Flowsim gives one output .sff file for one input fasta file. In my case there were some 1,00,000 fasta sequences and hence 1,00,000 .sff files.

Newbler (assembler for 454 data) just crashed when asked to assemble the data in these many different files. It worked upto 500 .sff files though.

When all the files where given as input to sfffile to combine into a single file it gave a cryptic "segmentation fault" error. It seemed as though all the loads of data generated by Flowsim would be wasted. As a lost resort i wrote a script to combine the .sff files two at a time.

The script crashed after running for 100 files. The culprit was "Flowsim". It had generated sequences with the same accession number in the different runs. sfffile was failing with "Error: Duplicate accession chr1:56036789..56039961_1148- found. Unable to write SFF file.". After further investigation it became clear that duplciate accession numbers had occured rather frequently. Later versions of Flowsim may have fixed this issue (I used version 0.1).

Since, the data had already been generated it had to "handled". sfffile does not have a flag to ignore duplicate accession numbers. Hence, a simple perl script that checks if sfffile failed in the previous step take care of the duplicate accession numbers. Perl script below:

 my $startfile="combined1.sff";  
 print "sfffile -o combined".$count.".sff combined".$pcount.".sff myreads.fasta.".$count.".sff \n";  
 if(-e $pfilename){  
 system("sfffile -o combined".$count.".sff combined".$pcount.".sff myreads.fasta.".$count.".sff");  
 if(-e $filename){$startfile=$filename;}  
 else {system("sfffile -o combined".$count.".sff ".$startfile." myreads.fasta.".$count.".sff");}  

Monday, July 5, 2010

Youtube bug or hidden feature?

Certain Youtube videos which are considered "inappropriate for some users, as flagged by YouTube's user community" are supposed to be hidden from users who are not logged in or who are not above 18 years old. But alas, google seems to have left a bug(or is it a hidden feature--not decipherbale by the underaged?) which allows you to watch any video you want without having to sign in or sign up.

Just give the direct path to the video like "" and the video just opens up in the browser. This seems to have been left open to allow external websites to embed youtube videos. However, one would expect YT with the "brilliance" of google to have blocked embed access to the "marked" videos.

Apart from the flaw in youtube that has been hastily fixed by google, there seems to be minor bugs or rather design decisions that are confusing. Will they just keep patching things or consider a redesign of the model is interesting to watch for.

Stars and stripes in perl

Perl can be made to print in color using TERM::ANSIColor. This perl program tries to generate the stars and stripes with 50 stars and 13 stripes. Although its not perfect in having larger gaps between every second row of stars, it seems to do a pretty ok job. Even the 2/5 ratio between the full and partial stripes is valid. This is the longest surviving version which has managed to complete 50 long years.

Perl does not seem to have the ability to print subscripts or superscripts while printing to the terminal atleast. The uneven spacing can probably be avoided by using subscripts or superscripts.

 use Term::ANSIColor qw(:constants);  
 $sixstars=" * * * * * * ";  
 $fivestars=" * * * * * ";  
 $emptystars=" ";  
 $partstripes=" \n";  
 $fullstripes=" \n";  
 print RESET;  
 print BOLD,WHITE,ON_BLUE,$sixstars;  
 print ON_RED,$partstripes,RESET;  
 print BOLD,WHITE,ON_BLUE,$fivestars;  
 print ON_RED,$partstripes,RESET;  
 print BOLD,WHITE,ON_BLUE,$emptystars;  
 print ON_WHITE,$partstripes,RESET;  
 print BOLD,WHITE,ON_BLUE,$sixstars;  
 print ON_WHITE,$partstripes,RESET;  
 print BOLD,WHITE,ON_BLUE,$fivestars;  
 print ON_RED,$partstripes,RESET;  
 print BOLD,WHITE,ON_BLUE,$emptystars;  
 print ON_RED,$partstripes,RESET;  
 print BOLD,WHITE,ON_BLUE,$sixstars;  
 print ON_WHITE,$partstripes,RESET;  
 print BOLD,WHITE,ON_BLUE,$fivestars;  
 print ON_WHITE,$partstripes,RESET;  
 print BOLD,WHITE,ON_BLUE,$emptystars;  
 print ON_RED,$partstripes,RESET;  
 print BOLD,WHITE,ON_BLUE,$sixstars;  
 print ON_RED,$partstripes,RESET;  
 print BOLD,WHITE,ON_BLUE,$fivestars;  
 print ON_WHITE,$partstripes,RESET;  
 print BOLD,WHITE,ON_BLUE,$emptystars;  
 print ON_WHITE,$partstripes,RESET;  
 print BOLD,WHITE,ON_BLUE,$sixstars;  
 print ON_RED,$partstripes,RESET;  
 print BOLD,WHITE,ON_BLUE,$emptystars;  
 print ON_RED,$partstripes,RESET;  
 print ON_WHITE,$fullstripes,RESET;  
 print ON_WHITE,$fullstripes,RESET;  
 print ON_RED,$fullstripes,RESET;  
 print ON_RED,$fullstripes,RESET;  

Wednesday, June 23, 2010

PERL writing to a file at both ends

Writing to a file in PERL 5.0 (PERL 6 has few changes) is easy and has many options which are specified while opening the file. Here is a list of different ways to write to a file and the change they bring about in the file.
print FILE "hello\n";
Output:(appended at end of file)
print FILE "hello\n";
Output:(note that previous text is erased and new file is created)
print FILE "hello\n";
Output:(note that previous text is replaced starting at the point where the file pointer is located at the time of printing)

But unfortunately if we want to append to the beginning of the file we need to copy the previous text and write it all back in after writing the new text. In such cases using few of the perl modules can solve the problem. In principle its possible that few programs can be writing to the beginning and end of a file at the (nearly)same time.

Thursday, June 17, 2010

Unix tail bug?

The UNIX tail command is probably one of the most widely used Unix commands. There seems to be a "bug" if you want to call it that in its functionality.

Try running a tail -5 on a directory where the files have data being written into them continuously like log files. Something like "tail -5 *" which should get you just the last 5 lines from each of the files in the directory.

However, try running it a few times and you will be surprised to see that for few of the files more than 5 lines are displayed!!

If you are unlucky enough to use this in a shell script that required just 5 lines and no more, the script will keep failing few times but seem to work just fine at other times. A transient bug which can slide pass the best testing.

This is something that happens only when you use wild cards to specify the files. So presumably the error happens because the command fails to realize its already read a particular file. Using other non-standard utilities like since or multitail might solve the problem.The programmer could even become intelligent and read just the top 5 lines of the output of tail!! for each file.

More importantly is this a bug with Unix tail or with the wildcards or just a undesired side effect we have to live with? I could see this "bug" in my 9.8. Not sure if this thing is endemic to this flavor. When i tried the same in the GNU core utilities and opensuse, i have to use the correct version "tail -n 5 *" and this works fine. Its always better to have the GNU core utilities than using the standard utilities that come with some of the less frequently updated flavors .

Tuesday, June 8, 2010

Ingenious ways of hiding e-mail from spam bots

By putting up your e-mail address on your website or blog is an open invitation to spam bots to harvest your email and start spamming your inbox or congesting your spambox. Congesting of spambox is a big problem if the the spam protector puts genuine mail in the spam folder. Having to sort through the spam can be a nightmare if its full.

We have seen the obvious ways such as using [at] instead of @ for protecting? which unfortunatley is not good enough as spam bots evolve too. Having a image showing the mail id is mostly safe, but makes it impossible for the user to copy and paste it. Services such as reCAPTCHA have made it possible to protect you mail id with an image.However, for the lazy user this may be too much work! to do, just to see the mail id.

Few ingenious ways to hide a mail id in plain site:
  1. Remove the "meaning of the word that will confuse the spam bot" from the
  2. Use the popular version of the domain in
  3. Do the math in
  4. No numbers in
If this is not enough to confuse the spam bots, we could always use other methods such as having a serverside contact form which hides the mail id or use a client side script to render the mail id.

Sunday, May 30, 2010

Setting up and getting the ball rolling in R language

You can download and install R for Linux, windows or Mac from here. Select the platform that best describes your system and install R.

In Linux, R can be started from the command line or terminal by just typing "R" and pressing enter.Once the R language is started, we are free to use the R language commands to perform various functions. You can come out of R by typing "q()" and pressing enter.

Most R installs come with a html help manual which opens in the default browser and can be started by typing "help.start()" and pressing enter.You can even look here if your httpd server is running.

Saturday, May 29, 2010

Sheep grazing in the field

Sheep eating grass in a field surrounded by electric fencing. Eco-friendly grass cutting?

Friday, May 28, 2010

Flowers after a rain

Yellow and red hues in the flower sprinkled with water droplets after a light drizzle.The Green background adds to the contrast.

Thursday, May 27, 2010

Animal eyes in the dark

See the glow in the eyes of all the three goats. I always wondered about the glow in the eyes of animals after seeing them in cartoons. Now saw it live through the camera. Rather than the eyes glowing, it seems to be the reflection from the camera flash.

Wednesday, May 26, 2010


The fossil fuels where formed over millions of years from plant biomass. Rapid use of the fossil fuels is releasing carbon from these fuels in the form of carbon dioxide. Hence, there is a increase in the carbon dioxide concentration in the earths atmosphere. With the growing energy demands and the depleting fossil fuel reserves, we are in need of alternative renewable sources of energy.

A large-scale transition to renewable energy is not possible in the short term due to the current technology for harnessing the alternative sources not being cost effective. Use of biomass for meeting the needs of the energy has been explored with significant success. Gaseous forms of fuel products from biomass such as biohydrogen and biomethane are considered as good sources due to their portability and efficiency.

Hydrogen can be produced by the electrolysis or from fossil fuels in either a small or a large scale.Large scale production from fossil fuels has the advantage of being able to capture the carbon dioxide to be utilized for stimulating plant growth or for storage in chemical form such as carbonates or in underground reservoirs.

Methane production through anaerobic process of digestion of wastewater and residues involves hydrogen as an intermediate product which is rapidly taken up and converted to methane by methane producing micro-organisms. The degradation of organic matter to methane and carbon dioxide in the absence of oxygen by microorganisms is called as Anaerobic microbial digestion. This digestion occurs in several phases involving many microbes. The complex organic compounds are first degraded to simple molecules. In the second phase the molecules are degraded into organic acids and hydrogen. The last step involves organic acids and hydrogen being converted into methane.

Biophotolysis involves many microalgae and cyanobacteria which are able to split water into hydrogen and oxygen with the aid of absorbed light energy. However, this process is limited by the efficiency of the enzyme involved in the conversion process. The enzyme is inhibited by the oxygen produced in the process of splitting water. Several variants of this process are being developed to separate the hydrogen and oxygen production steps.

Organic compounds like acetic acid are converted into hydrogen and carbon dioxide with sunlight by bacteria in what is known as photofermentations. However, this process is difficult to scale up as it requires a large surface area to capture the light needed for the driving the process.

Photosynthesis, CO2, Biomass from Plants & Algae, Biohydrogen

The changing energy need dynamics is going to affect not only the way energy is produced but also how its going to be used. The winning entry for the city of the future competition of History channel predicts a future for San Francisco that has hydrogen fueled hover car networks. The city will have specific structures to collect, store and distribute water and power from various sources. Harvesting the solar energy would be a very effective contributor. Currently energy is produced by processing the biomass into ethanol and biodiesel. However, energy is lost in producing all the other complex compounds that form the part of the complex biomass. Photosynthetic organisms that produce biofuels directly will be more energy efficient than processing the biomass thats produced by plants. This concept of producing a fermentation product called photanol with the input of carbon dioxide, water and solar energy into a synthetically designed organism.

Being able to produce hydrogen for use as a fuel by splitting water using solar energy is a long term goal to overcome the energy crisis. Various options are being explored to perform the task of splitting. However, using cyano bacteria for photo biological production of hydrogen has been found to be a very promising option. Many micro-organisms can produce hydrogen using enzymes called hydrogenases. This hydrogen production will produce the Biohydrogen which can be used a fuel for various purposes due to its portability.

Two different approaches are being pursued to produce biohydrogen. The first approach is the nitrogenase based approach and it involves knocking out the uptake hydrogenase. The second approach is to introduce a foreign hydrogenase. Both the approaches are being tried out by various companies. Different growth conditions and mechanisms are being observed to get the optimal system.

Interesting developments in using LED technologies as an additional, low-energy artificial supply of light with optimal properties for photosynthesis is being explored. Growing understanding of genetic engineering, regulation of transcription and translation will improve the design of the organism used to produce the fuel. Other areas such as mass culturing of the microorganism can also lead to significant cost reduction and stability.

Due to the complex nature of the biological systems, various problems such as auto inhibition of growth in the model organism while producing the new compound. The resistance mechanisms to such inhibition has to be studied and expressed in the organism to get a higher yield.

Tuesday, May 11, 2010

Energy forests, Salix program & breeding

The increasing demand for energy and decreasing oil production has made it imperative to find and exploit new forms of energy. Various forms of energy such as hydro power, nuclear power, wind and biomass are being used as alternative sources of energy. Biomass has emerged as a very important form of alternative energy due to its renewable nature, low impact on the environment and cost benefits. Growing energy in the form of forest trees has been found to be a sustainable model for the production of energy. However, strict regulatory policies on the felling and growth rates are required to ensure the balance in the forest cover.

Deciduous trees of the genus salix are found to grow mostly in moist soils in cold and temperate regions of the Northern hemisphere. Being a perennial crop with a life span of 20 to 25 years, its ideally suited for cultivation with the aim of harvesting for biomass. It requires low input of fertilizers and pesticides for its growth making it easy to cultivate.

The advantage of growing these energy forests can be further increased by using these tree for phytoremediation. The short rotation crops such as willows offer the double advantage of high biomass yields and removal of hazardous compounds through frequent harvests.The cleaning of polluted sites which contain heavy metals such as cadmium can be helpful in cleaning up various wastes.

Breeding programs to improve the biomass production, drought and heat tolerance and resistance towards pests are underway. With the aim of growing the plants in southern Europe which has higher temperatures, the heat tolerant strains are being sought. Leaf beetles are the major pests of willows and reduce the biomass by up to 40%. Leaf rust caused by fungi also cause loss of biomass in excess of 40%. As a result of the breeding programs, new strains which increase the biomass production by 60% have been selected for use.

Genomics based approaches which use the sequences genomes of the trees have been used to find genes associated with specific traits. Molecular markers identified by crossing have been associated with the concerned genes.It has been predicted that successful use of knowledge from genome sequencing projects will require the successful identification of polymorphisms associated with traits of interest, the frequency of superior alleles in the base breeding population and their phenotypic effect. Hence, just the sequencing of the genomes without proper understanding of the mechanisms involved in the various traits will not be of much use. Efforts to sequence EST's and studies of the expression patterns associated with different environmental conditions are being undertaken to bridge this gap.

Tuesday, May 4, 2010

Artificial photosynthesis and Synthetic biology

Future global energy needs cannot be met by any single source of energy known to us today. Contributions from different energy sources might make it possible to meet the energy needs. Solar energy is converted into biomass which can be used as energy source. However, the production of biomass is a inefficient process. Hence, different approaches to mimic the efficient parts of the system is being attempted.

Hydrogenase enzyme which catalyzes the formation of hydrogen is coupled to photosystem II to use water for utilizing the solar energy more effectively. Among the different steps involved in photosynthesis, the following are considered to be worth mimicking.

1.Absorb light and funnel energy
2.Convert energy to charge separated state
3.Couple charge separation to catalysis
4.Higher level of organization

Different molecules and molecular complexes are being perfected for each of the steps in the hope of increasing the efficiency.

Synthetic biology is the design and construction of new biological parts, devices and systems for useful purposes. The purpose of making parts and devices is to be able to have standardized components which could be used to build devices. The registry of standard parts is one such collection of parts such as promoters, ribosome binding sites, protein domains, protein coding sequences, translational units, terminators etc.

The standardized parts known as bricks are characterized and ready to use for that specific function. Computer aided design and simulations will play a significant role in this process. Simulating the model of the system can give results which can be used for designing the system. Unpredictable results may occur while using these design principles due to cross talk between the different components. These have to be taken care of and modularized.

Application areas for synthetic biology are widespread. It could help in fields such as bioenergy, drugs and chemicals, biomaterials, medicine etc. Coordinating the bacteria or yeast involved in fermentation by engineering the microbes could eliminate the need for monitoring the culture as it will be self regulated. With an increased knowledge of the various cellular processes, it should be possible to engineer entire new cellular systems as per requirement. Such a use of synthetic biology to design a cell from ground up might be made possible by integrative synthetic biology.

Sunday, May 2, 2010

The last statue

His hand stopped short of the lever. After having demolished more than a million statues, it seemed weird that he should feel guilt at having to destroy this small statue.

May be it was the burden of the knowledge that this was the only statue made by man still standing. If he just pushed the lever, mankind would probably never make another statue ever. All knowledge about the arts had been destroyed in the revolution. It seemed these stones where the only proof of a saner time.

"Whats taking you so long", the supervisor was shouting at him from the ground. If he dint do it somebody else would. At least they could not erase the statue from his mind.

He touched the lever with a sinking feeling in his heart and aimed the crushing ball at the statue.

Friday, April 30, 2010

How is email spam found out?

Email spam is a not just a problem but more a menace with the increasing number of spammers that are active. Detection of spam mails from a collection of legitimate mails is a very important data classification problem. However, the impact of classifying a legitimate email as spam is more than filtering spam. Legitimate mail classified as spam is known as ham and is the result of improper classification. Hence, the threshold is selected in such a way as to avoid false negatives.

Various features of the mail such as the senders mail id (black listed or not), content of the mail in terms of occurrence of words common to spam mail like 'free', 'buy' etc are used to judge if a mail is spam or not.

Email Spam – detection and anti detection methods

Server side filter out spam such as spam assassin make use of rules which classify mail into spam and email based on occurrence of different types of words and features. It makes use of a neural network algorithm based method to do the classification. Spam Bayes provides tools for desktop utilities such as outlook, gmail, yahoo POP3 and IMAP and many other popular email clients. It is based on stasticial analysis of the email content.

The methods which rely on content based classification of spam have been very effective as the spammer has to deliver his spam message whose content is very different from a legitimate mail in many ways.

The empire fights back

The spam empire has fought back with many changes. Tools such as Spam checker will check the mail and suggest synonyms or changes to the mail to make it look less like spam and more like legitimate mail. Although the spammer can make such changes he cant make the mail ridiculously complex and incomprehensible. The new rules and new synonyms to escape the rules is an ongoing battle between spam and spam detectors. Gmail uses the same content based approach to decide what are relevant ads for that particular mail user.

Using images in place of the words that might give them away has been a popular method among spammers to avoid detection by such content methods. Having to recognize the characters and words in the images and checking them for spam is the obvious solution. This seems to be a never ending battle.

Wednesday, April 28, 2010

Would a rose by any other name smell as sweet?

Whenever there is huge rush to rename anything from a street to a nation, people often wonder whats the point in it all. Wouldn't a rose by any other name smell as sweet? Well name is important, or else we would not have a rename file option in windows.

Lets see if it would smell as sweet in the below cases in which renaming was done:

  1. Hot Springs is Mexico changed its name to Truth or Consequences to get a popular radio program hosted from its city.
  2. Clark in Texas (America) renamed itself DISH and got itself free satellite television for 10 years in the bargain.
  3. Más a Tierra in chile renamed itself as Robinson Crusoe Island
The new names rock at least in these 3 cases. May be we should have thought a little more and come up with really catchy names for our Indian cities before renaming them. We dint even get any commercial benefits from renaming them either, just the headache of having to change the names in all the places. Here are few "cool" names that we could have used instead

Kolkata - A Coal Cat
Mumbai - Bye Mom
Pune - Puma
Bengaluru - U R Ben's Gal

Do let me know if you come up with something cool to rename to.

Finding patterns and rules in english words

I took a few hundred words in English starting with the letter "G" and few starting with letter "E". Hopefully i could get some pattern or rule in the way the letter after these letters was arranged. I managed to get as many as 51 rules for the way the letters appeared. The rules were all generated just based on the occurrence of a particular alphabet at a particular position.

The most effective and pervasive rule was if Letter2(A) => D(G). So in my sample of words, the words starting with letter "G" had lot of words with second letter as "A".

If i made the search more knowledgeable by putting in knowledge of vowels and consonants, more rules will turn up with better patterns. This might be even more interesting to try and find patterns in sentences.

Tuesday, April 27, 2010

Biodiesel and Photobioreactors

Biodiesel is a fuel similar to diesel that is obtained from oil rich plants such as reapeseed, soy, palm oil, sunflower or used cooking oils or phototrophic microorganisms. It has the advantage of producing lesser amounts of green house gases than fossil fuels. The main reason for the success of biodiesel is because it can be used without modification to engines and distribution systems.

Biodiesel has some of the same problems as bioethanol. It can start a competition for land with other agricultural crops, causing decrease in food supply or increase in food prices. Hence, the focus is on microalgae and cyanobacteria to produce biodiesel. Since, the microbes can grow in saline environments, they are not as much a threat to food crop cultivation. These methods for production of biodiesel from microbes are still experimental and slow. Developments in bioreactor design and genetic modification of the microbes may make these methods more viable in the future.

Growing the microbes required for biodiesel production requires photobioreactors as the microbes get their energy by photosynthesis. The photobioreactors can be mainly classified into open and closed systems. Open systems are lakes and natural ponds which can be used to grow the microorganisms. Closed systems are tubular or flat panel shaped bioreactors. The tubular bioreactors can be horizontal or vertical. Closed bioreactors have the advantage of not being contaminated and can be easily controlled. Open bioreactors have cost benefits.

The design of the bioreactor is driven by various factors such as light considerations, gas exchange, nutrient availability, product recovery and contamination. Proper mixing is required to ensure time for both dark and light reactions to occur. Cooling is required to remove the heat due to high irradiation. Too much light is observed by the cells at the surface of the culture and lost as heat, this is known as the shading problem. Genetic engineering changes to the cells to have smaller photosynthetic antenna seems to reduce this problem considerably.

Idea: The shading problem can be overcome by having cells of two different types in the reactor. The first type of cells do the light reaction and are positioned at the surface of the culture. Second type of cells do the dark reaction below the surface. The two cell types interact and exchange the products of their respective reactions through the medium.

Monday, April 26, 2010

Acessing windows drives from linux terminal

Many of us use linux and windows on the same machine. The new bread of linux installs does a wonderful job of mounting the windows partitions for us. We can open the files and work on them from the comfort of the GUI.

Some files and programs require to be run from the command line or terminal, especially in the case of linux. How do we access the windows drives on the command line? Since linux does not have the C,D and E drives impemented at the command line.

The drives can be easily accessed from the /media/ folder. Just type cd /media in the terminal and you will land in the media folder. This folder will contain the list of all drives mounted in the recent past. Many of these folders will not be accessible outside the 'root' group of users. To know which folders are your windows drives, just run a ls -lrt in the terminal. The windows drive folders will have read, write and execute permission for you. Just cd into that folder and acess all the files in window drives from the linux terminal.

Sunday, April 25, 2010

First contact - fear of the aliens returns

The very first time we humans meet an alien race is a dream/nightmare scenario that has been played out many times in books and movies alike. Very existence of alien life forms has been postulated and hypothesized based on probability calculations. Considering the size and content of the universe, it might be expected that life could originate in other parts of the universe. Whether these life forms will just be microbes or intelligent life forms is a totally different possibility.

The scientific community has seen contact with an alien race as a positive thing, going as far as to send out signals into space with directions to our planet. The search for extra terrestrials Intelligence is underway with many different approaches being tried. Benefits of a friendly alien race aside, the possibility of life outside planet earth gives us hope of a second home which we may one day require.

Is it really wise to be doing this? what if the alien race just plunders earths resources? Concerns regarding this have been raised by the cosmologist Stephen Hawking, famous for his book "a brief history of time". Be it microbial or any other form of life we encounter, we may not be ready to face the complications that might arise. New and deadly diseases of alien origin could wipe out the entire planet. So do we just lay low till we are technically more advanced or do we try our luck in finding an alien?

Taking candy from a baby

"Taking candy from a baby" is defined as something that is easy to achieve in wiktionary and many other sites. However, is our interpretation correct? Come to think of it, there are lot of things that are easier than taking candy from a baby. Similes such as "easy as falling of a log" would be something thats just easy to achieve, as nothing mean is involved.

A more appropriate usage would be doing something mean thats easy at the same time. Does it just reflect that being mean is so much part of everyday lives that we are just not thinking about it at all? Just take a look at this video, you will get an idea of how mean it really is. The baby will surely cry for a while. Taking candy from a baby sure requires a strong will and a stone heart.

It seems there is also an explanation that the origin of this simile has nothing to do with candies or babies. Its interpreted as taking C.A.N.D (abbreviation for a type of cargo)from Bay B of a ship. Since ships generally use Bay B as the sick bay, its impossible to locate it, let alone take the cargo out of it. So basically taking candy from Bay B was impossible. A linguist (April 1st joke) explains how the expression originated on board ships and got to mean something else after coming ashore.

We may never be able to prove if this is just a result of liberal attitudes towards being mean or the result of not having Bay B in ships.

Thursday, April 22, 2010

A Rough Set-Based Model of HIV-1 Reverse Transcriptase Resistome

A huge effort is being put into find a treatment for AIDS. HIV, the causative agent of AIDS has been studied in ever increasing detail to produce effective antiviral therapies. The high rate of replication and mutability of the virus leads to rapid drug-resistance in the virus. Efforts to overcome the AIDS pandemic would require drugs or drug regimens that can control the drug resistance in the virus.

Reverse transcriptase is one of the viral enzymes that is required for transcribing the RNA to DNA. This transcription is required for the viral genes to get integrated into the host genome. Only after integrating into the host genome, the virus can replicate and propagate. Drugs that inactivate this enzyme can be very effective in stopping the replication of the virus. However, the rapid emergence of drug resistance in the enzyme has made it difficult to treat AIDS with any single drug. Among 25 drugs currently used in HIV therapy, 12 attempt at inhibiting reverse transcriptase enzyme.

Drug resistance generally occurs due to a non-linear combination of mutations. Being able to predict if a drug will be effective against a particular mutant has been a useful tool in treatment. Further research has also given details about the mechanisms of drug resistance development and functionally important regions in the enzyme.

In this study, local phyiscochemical properties of a protein sequence where used to understand and predict drug resistance. Annotated data from the Stanford HIV database was used to classify the mutants into three groups labeled as susceptible, moderately resistant and resistant. The Monte Carlo feature selection was used to select the best features from a total of 7* 560 properties. The selected features where used to generate rules to classify the sequences into the correct class. The method was tried based on data available for different antivirals.

Evaluation of the results was done by 10 fold cross validation of the data. Performance of the classifier was assessed based on prediction accuracy and area under the curve. Analysis of the results for the sites responsible for the resistance where found to be in correlation with the known sites. New sites that could lead to resistance have also been predicted. Newly discovered sites seem to have the resistance effect by disturbing the complex network of hydrophobic and polar interactions responsible for stability of tertiary structure.

Wednesday, April 21, 2010

Bioethanol - fuel of the future?

Most of the alternative energy sources such as solar, wind, nuclear energy have a major drawback of not being useful as automobile fuels. Automobiles are one of the major consumers of the crude fuels today. This makes it necessary to have alternative energy source that can be used with the automobile engines being used today with little or no modifications. Bioethanol is one such alternative which has shown significant potential.

Ethanol is produced by fungi such as Saccharomyces cerevisiae and bacteria such as Zymomonas mobilis. The raw material for this production of ethanol is sugar plants, cereals or ligno cellulose. The use of food crops for ethanol production has the disadvantage of having a negative impact on food production. Hence, the use of ligno cellulose is a very attractive alternative.

Ethanol production is dependent on having effective production and storage of raw materials, pretreatment, fermentation, the production step itself and transport and use of the final product. Each of these steps has many problems which have to be overcome. Cost benefits and impact on other agricultural products are the main concerns with respect to bio ethanol.

Production of raw materials for ethanol production have to consider the impact on the environment due to increased usage of fertilizers and pesticides. There has also been significant concern regarding the reduction in the rain forest to meet the energy needs. However, sugar cane is not grown on rain forest land and is not actually having any impact on the rain forests.

Storage of raw materials has to provide the optimal conditions to maintain the correct water content for later use in fermentation. The raw material should also be protected from contamination and degradation during storage. Improvements in the fermentation and refinement of ethanol are also required to get better yields. Ethanol production has the advantage of being produced locally in most of the regions. However, concerns include over-utilization of land and destruction of rain forests. Integration of the different steps in the production of ethanol will increase efficiency.

Bioethanol is more sustainable than fossil fuels, but it may not be able to solely fulfill the growing need for energy.

Tuesday, April 20, 2010

Solar Energy and Solar cells

With the growing need for energy, alternative energy sources are being developed and refined. Solar energy is an attractive alternative as it is a relatively clean, renewable source of energy. Solar energy has been utilized in various ways such as for direct heating, electricity production and biomass production.

The energy needs by the year 2050 have been projected to be 28 TW in comparison to the currently used 11 TW. Although solar energy could probably provide a significant share of the required energy, it needs to be made available at a reasonable cost. The price of a 100 W silicon panel for converting solar to electric energy is 350 to 400 US dollars. However, this is too expensive to be practical. The exponential growth of about 40% per year has been mostly driven by huge subsidies from the government.

A solar cell is a device that converts solar energy directly into electricity[1]. The first generation solar cells transform light energy by using crystalline or amorphous silicon as inorganic solid-state material. The first generation cells are very expensive due to the cost involved in purification and production of the solar cells. The second generation solar cells make use of thin film as the core of the solar cell. The 3rd generation of solar cells is inspired by photosynthesis and has shown the potential to be more cost effective.

Dye- sensitized solar cells have been used to generate a potential gradient to generate electricity. These solar cells have shown good performance in diffuse light and have low investment cost to initiate production. The dye stability has been improved upto 15 years in sunlight by continued research. Titanium dioxide has emerged as the semiconductor of choice due to its abundance, non-toxicity, cost and compatibility.

Solar cells are facing the problem of scalability as the third generation cells are not being cost effective at large scale. Further developments in the field would be focused on better conversion efficiency and cost of production and maintenance.

My idea:

A biological model such as living organism capable of generating the potential gradient could be a idea worth exploring as the cost of production could be reduced. Many organisms are known to be capable of maintaining potential gradients. The challenge would probably be to combined the potential gradients of individual cells to get a net higher potential.

Sunday, April 18, 2010

Copy Number Variation - the root of all evil and good?

Human genomes contain repeated segments of DNA through most of the genome. When the number of copies of the repeats vary between different human beings, we have a copy number variation. Few regions of the genomes are popular locations for such variations, these regions which have different number of copies are known as copy number variant regions.

Copy number variable regions(CNVR's) are of particular interest as they can be responsible for diseases and prototypic differences between individuals. Similar to single nucleotide polymorphisms (SNP's) which are disease markers, these regions have been associate with many conditions. These regions have also been associated with resistance from infection by HIV and Malaria.

The importance of these regions becomes apparent as copy number variation maps are being generated and updated into genomic variant databases. This type of data will be useful in understanding the relations between CNVR's and specific characters. Evolutinary impact of these regions could also be anlaysed to get an understanding of how evolution proceeded.

It will require a more clear understanding of the role of CNVR's to really appreciate how much they influence the different characters. May be they are root of all evil and good, but again they might just be a part of the bigger puzzle.

Tuesday, April 13, 2010

Monday Morning - Swami and Friends

Our hate for Monday mornings may have something to do with the same things as Swaminathan's feelings of unpleasantness about this day. The joy of having enjoyed the freedom of Saturday and Sunday is overshadowed by the dreary nature of Monday mornings. Apart from the obviously long list of things to complete from the past two days, there is always this feeling of foreboding about Monday which is difficult to overcome.

Anything could happened on a Monday, may be Mr Ebenezar would take upon himself the duty of teaching us idiots the futility of idol worship. Worse still we might feel the insane urge to contradict and question him. Even if we do manage to get through all this ruckus, will we have the sense of not telling anybody at home during the meal about it and getting a stern letter written to the principal the next day?

Everything is not bad as we have few things to look forward to even on Mondays, there is the 12.30 mail to watch from the window. If we manage to solve the 5 arithmetic “puzzles” correctly life would not be that bad.

Hydrogen from solar energy and water?

Industrialization has been largely driven by the continued discovery of oil reserves. However, the number of oil findings is decreasing. A future with no oil left to use is a reality we have to face. Apart from the obvious problem of scarcity the fuels such as oil, coal and gas have been known to contribute to the problem of global warming. The situation is further complicated by the growing need for energy from users who are yet to start using the energy resources.

Many alternative strategies such as solar energy, wind, nuclear, tidal, geothermal etc have been proposed to solve these problems. Although these alternative sources might be able to provide energy, it might not be possible to use them effectively as fuels for transportation systems. Transportation systems being the major consumer of fuels today may need a different approach. Loss of energy during the conversion process has made it necessary to have a direct product which can be used as a fuel.

Use of solar energy to produce fuels such as hydrogen has gained importance in this context. Hydrogen could be directly used as a fuel and lack of carbon in the fuel source makes it a rather clean source of energy. The problem of scarcity and global warming can be tackled with this interesting approach. Two main approaches are being pursued to achieve this goal of using water to produce hydrogen using solar energy. The first approach is the photo biological method which aims to create or alter a biological system to convert solar energy into hydrogen using water as raw material. The second approach is the chemical method, which uses photo systems or molecules that imitate photo systems coupled to other molecules to drive reaction that convert water into hydrogen.

Photosystem II uses solar energy to oxidize water releasing electrons. This reaction is rather efficient although the other steps happening in the biological systems are not as efficient. Hence, the aim is to mimic just this step of the process from nature. The chemical approach has used molecules such as ruthenium linked to the photo systems to act as electron acceptor from Manganese. This is used to drive the reaction to produce hydrogen from water. The enzyme hydrogenase which can catalyse the reaction to produce hydrogen is used in this second step of the reaction.

Biological systems such as Nostoc produce hydrogen in special cells from nitrogenase. Currently large and small scale reactors are being developed to produce hydrogen from such biological systems and make them as effective as possible.

The direct methods of producing fuels have been found to be much more effective than the indirect methods which require the energy to be converted to electricity which is then used to split the water molecules by electrolysis.

Sunday, April 11, 2010

Brick BAT - iGEM brick biosaftey Assessment tool

With increasing popularity of genetic engineering and synthetic biology we are on the way to malicious biological content. We have seen programs like spyware and malware come out of the IT revolution. What horrors will come out of the advances in biology?

What if crops of entire nations are held to ransom by a pathogenic viral strain? worse still, the human population may be threatened. Organisms that copy our genetic material and take it for analysis without our permission are a distant possibility. Such spyteria (spy + bacteria) are a threat we must get ready to face.

A popular synthetic biology contest called iGEM ( International Genetically Engineered Machines) is encouraging an engineering based approach to synthetic biology. Although they are very serious about biosafety, a categorisation system for the various parts and components was not being used. I have come up with a simple categorisation scheme called Brick BAT. This is a set of questions to classify the components into different levels of biosafety.

Tuesday, March 30, 2010

Blast - XML output parser

The below script parses the output from blast with XML as option and stores the values in an array.

#---blast output file
my $infile = "blastout";
open(IN, "$infile");

while((my $line = ) && ($hitnumber[0] < $taketill)){ #reading the first $taketill numbers
if ($line =~ /\/) {#reading hit number
push(@hitnums,split(/\<\/Hit_num\>/,(split(/\/, $line))[1]));
if ($line =~ /\/) {#reading definiton
push(@hitdefs,split(/\<\/Hit_def\>/,(split(/\/, $line))[1]));
if ($line =~ /\/) {#reading length
push(@hitlens,split(/\<\/Hit_len\>/,(split(/\/, $line))[1]));
if ($line =~ /\/) {#reading bitscore
push(@bitscores,split(/\<\/Hsp_bit-score\>/,(split(/\/, $line))[1]));
if ($line =~ /\/) {
push(@scores,split(/\<\/Hsp_score\>/,(split(/\/, $line))[1]));
if ($line =~ /\/) {#reading evalue
push(@evalues,split(/\<\/Hsp_evalue\>/,(split(/\/, $line))[1]));
if ($line =~ /\/) {#reading query match begin
push(@qfroms,split(/\<\/Hsp_query-from\>/,(split(/\/, $line))[1]));
if ($line =~ /\/) {#reading query match end
push(@qtos,split(/\<\/Hsp_query-to\>/,(split(/\/, $line))[1]));
if ($line =~ /\/) {#reading hit match begin
push(@hfroms,split(/\<\/Hsp_hit-from\>/,(split(/\/, $line))[1]));
if ($line =~ /\/) {#reading hit match end
push(@htos,split(/\<\/Hsp_hit-to\>/,(split(/\/, $line))[1]));
if ($line =~ /\/) {#reading query frame
push(@qframes,split(/\<\/Hsp_query-frame\>/,(split(/\/, $line))[1]));
if ($line =~ /\/) {#reading hit frame
push(@hframes,split(/\<\/Hsp_hit-frame\>/,(split(/\/, $line))[1]));
if ($line =~ /\/) {#reading identities
push(@identities,split(/\<\/Hsp_identity\>/,(split(/\/, $line))[1]));
if ($line =~ /\/) {#reading positives
push(@positives,split(/\<\/Hsp_positive\>/,(split(/\/, $line))[1]));
if ($line =~ /\/) {#reading alignment length
push(@algnlens,split(/\<\/Hsp_align-len\>/,(split(/\/, $line))[1]));

Friday, March 19, 2010

Strand specific translation of DNA to aminoacid sequence in PERL

PERL script to read the annotation file and update translated ORF's it into the mysql database and writing it into a multifasta file.

use DBI;

$dbh = DBI->connect('DBI:mysql:meta', 'root', 'password'
) || die "Could not connect to database: $DBI::errstr";

my ($infile) = @ARGV;
open(IN, "$infile");
my $outfile ="multifasta";

my $jcvi_read,$strand,$start,$stop,@seq,$ofset,$leng,$wrtseq;

while(my $line = ){

my @a = split(/ /, $line);
my @c = split(/\s+/, $line);
my @b = split(/\_/, $a[0]);
$start = $c[3];
$stop = $c[4];
$strand = $c[6];
$pep = $c[8];

$sth = $dbh->prepare("SELECT sequence FROM metadata WHERE jcvi_read='$jcvi_read'");

@seq = $sth->fetchrow_array();

$ofset=$start - 1;
$wrtseq=substr $seq[0], $ofset, $leng;


#function to translate DNA to Amino acid based on standard genetic code
sub codon2aa{
$codon= uc $codon;
my(%genetic_code) = (
'TCA'=>'S', #Serine
'TCC'=>'S', #Serine
'TCG'=>'S', #Serine
'TCT'=>'S', #Serine
'TTC'=>'F', #Phenylalanine
'TTT'=>'F', #Phenylalanine
'TTA'=>'L', #Leucine
'TTG'=>'L', #Leucine
'TAC'=>'Y', #Tyrosine
'TAT'=>'Y', #Tyrosine
'TAA'=>'_', #Stop
'TAG'=>'_', #Stop
'TGC'=>'C', #Cysteine
'TGT'=>'C', #Cysteine
'TGA'=>'_', #Stop
'TGG'=>'W', #Tryptophan
'CTA'=>'L', #Leucine
'CTC'=>'L', #Leucine
'CTG'=>'L', #Leucine
'CTT'=>'L', #Leucine
'CCA'=>'P', #Proline
'CAT'=>'H', #Histidine
'CAA'=>'Q', #Glutamine
'CAG'=>'Q', #Glutamine
'CGA'=>'R', #Arginine
'CGC'=>'R', #Arginine
'CGG'=>'R', #Arginine
'CGT'=>'R', #Arginine
'ATA'=>'I', #Isoleucine
'ATC'=>'I', #Isoleucine
'ATT'=>'I', #Isoleucine
'ATG'=>'M', #Methionine
'ACA'=>'T', #Threonine
'ACC'=>'T', #Threonine
'ACG'=>'T', #Threonine
'ACT'=>'T', #Threonine
'AAC'=>'N', #Asparagine
'AAT'=>'N', #Asparagine
'AAA'=>'K', #Lysine
'AAG'=>'K', #Lysine
'AGC'=>'S', #Serine#Valine
'AGT'=>'S', #Serine
'AGA'=>'R', #Arginine
'AGG'=>'R', #Arginine
'CCC'=>'P', #Proline
'CCG'=>'P', #Proline
'CCT'=>'P', #Proline
'CAC'=>'H', #Histidine
'GTA'=>'V', #Valine
'GTC'=>'V', #Valine
'GTG'=>'V', #Valine
'GTT'=>'V', #Valine
'GCA'=>'A', #Alanine
'GCC'=>'A', #Alanine
'GCG'=>'A', #Alanine
'GCT'=>'A', #Alanine
'GAC'=>'D', #Aspartic Acid
'GAT'=>'D', #Aspartic Acid
'GAA'=>'E', #Glutamic Acid
'GAG'=>'E', #Glutamic Acid
'GGA'=>'G', #Glycine
'GGC'=>'G', #Glycine
'GGG'=>'G', #Glycine
'GGT'=>'G', #Glycine

if(exists $genetic_code{$codon}){
return $genetic_code{$codon};



return 'X';

if($strand == '-')
#reverse complementing -ve strands
{$wrtseq=reverse $wrtseq;
$wrtseq=~ tr/ATGC/TACG/;}

my $dna=$wrtseq;
#my $dna="TCATTCTCATTC";
my $protein='';
my $codon3;
for(my $i=0; $i<(length($dna)-2); $i+=3){ $codon3=substr($dna,$i,3); $protein.= codon2aa($codon3); } print $pep."\n"; $dbh->do("INSERT into iddata (jcvi_read,strand,start,stop,protein,orf_id) VALUES('$jcvi_read','$strand','$start','$stop','$protein','$pep')");

print OUT ">".$jcvi_read."|".$pep."\n";
print OUT $protein."\n";

Thursday, March 4, 2010

Updating mySql database from Perl

Perl script to read multifasta file calculate GC content,GC Skew and insert into mySql database.


use DBI;

$dbh = DBI->connect('DBI:mysql:meta', 'root', 'password'
) || die "Could not connect to database: $DBI::errstr";

my ($infile) = @ARGV;
open(IN, "$infile");

my $jcvi_read,$leng,$gccont=0,$gcskew=0,$seq="";

while(my $line = ){
if ($line =~ /^\>/) {#reading header line

if($jcvi_read !=0){

$dbh->do("INSERT into metadata (jcvi_read,full_length,gc_content,gc_skew_orient,sequence) VALUES('$jcvi_read','$leng','$gccont','$gcskew','$seq')");#inserting the data into mySql Database
my @a = split(/\//, $line);
my @b = split(/\_/, $a[0]);
$jcvi_read = $b[2];
my @c = split(/\=/, $a[1]);
$leng = $c[1];
# print $jcvi_read."\n";
else{#reading and analysing the sequence
$seq = $seq.$line;
my @char =split(//,$line);
if($_ eq "G"|| $_ eq "C"){$gccont++;}# Calculating GC content
if($_ eq "G"){$gcskew++;}#Calcualting GC skew
if($_ eq "C"){$gcskew--;}
#print $gcskew."\n";

Saturday, February 27, 2010

GANESH and Tuberculosis - Genomic Signature

Ganesh is a Hindu god who might have something to do with tuberculosis. No its not a divine or supernatural connection. Its more of a coincidental connection.

As we enter the post genomic era of creating synthetic organisms, the trend is to sign the genome with the name of the company or author of the genome in the genetic code. Assuming the various living organism that exist today are gods creations, it could be expected to have been signed by the gods. BRAMHA the god of creation would be expected to top the list. However, since the genetic code does not have a "B" it can be ruled out.

Lord Ganesh has signed his name in many genes, as a BlastP search will tell you. Dont believe me, take a look at the conserved protein from Mycobacterium tuberculosis T46 in NCBI.

>gi|289443584|ref|ZP_06433328.1| conserved hypothetical protein [Mycobacterium tuberculosis T46]

The genetic code seems to be more secular than most as it contains signatures from Allah as well. Take a look at another conserved protein from Pseudomonas. The greek goddess Demeter could also be located in Paenibacillus.
>gi|289627952|ref|ZP_06460906.1| enterobactin synthase subunit F [Pseudomonas syringae pv. aesculi str. NCPPB3681]

If you know any other secret messages that the gods encoded in the genome?