Monday, March 25, 2013

R - Defining Your Own Color schemes for HeatMaps

This post is intended at those who are beginners at R, and is inspired by a small post in Martin's bioblog.

First, we plot a "correlation heatmap" using the same logic that Martin uses. In our example, let's use the Movies dataset that comes with ggplot2.

We take the 6 genre columns, and we can compute the correlation matrix for those 6 columns.
Here's what the matrix looks like:

> cor(movieGenres) # 6x6 cor matrix
                  Action   Animation      Comedy        Drama
Action       1.000000000 -0.05443315 -0.08288728  0.007760094
Animation   -0.054433153  1.00000000  0.17967294 -0.179155441
Comedy      -0.082887284  0.17967294  1.00000000 -0.255784957
Drama        0.007760094 -0.17915544 -0.25578496  1.000000000
Documentary -0.069487718 -0.05204238 -0.14083580 -0.173443622
Romance     -0.023355368 -0.06637362  0.10986485  0.103545195
            Documentary     Romance
Action      -0.06948772 -0.02335537
Animation   -0.05204238 -0.06637362
Comedy      -0.14083580  0.10986485
Drama       -0.17344362  0.10354520
Documentary  1.00000000 -0.07157792
Romance     -0.07157792  1.00000000


When we plot with the default colors we get:
It is difficult to see the details in the tiles. Now, if you want to better control the colors, you can use the handy colorRampPalette() function and combine that with scale_fill_gradient2.
Let's say that we want "red" colors for negative correlations and "green" for positives.
(We can gray out the 1 along the diagonal.)

Doing this produces:

If there are values close to 1 or to -1, those will pop out visually. Values close to 0 are a lot more muted.

Hope that helps someone.

References: Using R: Correlation Heatmap with ggplot2





Monday, March 18, 2013

R - Simple Recursive XML Parsing

This is intended for those who are starting out in R and interested in parsing an XML document recursively. It uses DT Lang's XML package.

If you want to just read certain types of nodes, then XPATH is great. This document by DT Lang is perfect for that.

However, if you want to read the whole document, then you have to recursively visit every node. Here's the way I ended up doing it. The generic function visitNode could be useful if you are just starting out reading XML in R.

The full code, along with a sample XML file to test it is here.



Monday, March 11, 2013

Simulating Allele Counts in a population using R

This post is inspired by the Week 7 lectures of the Coursera course "Introduction to Genetics and Evolution" (I highly recommend this course for anyone interested in genetics, BTW.) Professor Noor uses a Univ Washington software called AlleleA1 for trying out scenarios.

We can just as well use R to get an intuitive feel for how Alleles and Genotypes propagate or die out in populations.

Basic Scenario

There are N individuals in an isolated island. Say, we are interested in two specific Alleles (Big "A", or small "a"). This in turn means that they can have 3 types of genotypes: AA, Aa or aa. The individuals mate in pairs, and produce two offspring and die out. (Thus the total population remains the same generation after generation.)
The genotype of the offspring depends on those of the parents. A 'gamete' has only one parental allele, depending on what the parent's genotype was. AA type parent can only product gamete type A, aa parent can only produce gamete type a, but Aa can produce either type of gamete.

A Punnett square of parents gametes to offspring's genotypes. 

  | A  | a
 ---------------- 
A | AA | Aa 
a | Aa | aa 

With these simple rules, we can use R Simulation scripts to observe what happens to the Allele Frequencies over generations. (The goal here is to learn to use R for Monte Carlo simulations.)

Writing the R Script from scratch

 I toyed around with the idea of using character strings for the genotypes and the alleles. But then I realized that are only three types and I could just as easily represent them with the numbers 1, 2, 3 as a simple R vector.


With that done, we can write very simple functions for the procreation process.
With these useful functions, we can take one generation and produce another, 2 offspring for each set of 2 parents.
Putting it all together to generate multiple trials:

We also need to compute the Allele counts for each generation, and for plotting I use ggplot.



Using this simple Monte Carlo "toy" we can develop quite a bit of intuition.



For small starting populations, either the big A or the small a allele takes over the entire population fairly quickly. Given large enough number of generations, invariably one of the alleles gets wiped out.














As one example, we can see that even a small increase in the probability of Allele A to be 0.53 (up from 0.5) makes it take over quite dramatically.











Conversely, setting it to any value under 0.5 means that the Big A allele gets wiped out of the entire population.














The entire R script can be found here. You can download the code and try playing with various starting scenarios, changing the starting population counts, generations and probabilities.


References:
  1. Coursera.org (Introduction to Genetics and Evolution by Md. Noor, Week 7 lectures) 
  2. AlleleA1 software at Univ Washington

Saturday, February 23, 2013

Simulating Population Growth in Cities Using R

R is great for anyone who wants to get started on learning Simulation. (Both Discrete Event or Agent-based, with stochastic elements in the process.)

This post is inspired by Matt Asher's "quick-and-dirty" R simulation work on Population Growth. Matt uses it to create aRt. I felt that his core idea provided a very good framework to introduce various concepts of simulation.

The Basic Idea behind this Simulation

We take an area (it could be a country, state, city) that is made up of units (cells) that can be settled (occupied).
We allow the initial 'pioneers' to settle first. (Seeding the cells)
We then establish some rules for future residents to settle. (Settling Rules)
Plot the area, and collect some stats to study the growth of the population under different starting conditions and different settling rules.

Pseudo-Code for the Simulation

Define parameters: Area Width, Height
Number of Pioneers P and future settlers S

for each pioneer p{ 
  settle p according to seeding rules in a valid cell in the area
}
for each settler S{
  Find an unoccupied Cell.
  Settle S in that cell, if the settling criteria permits it
}
Print Iterations Stats
Print out the Plot of the Area

A Few Changes from Matt's R code

  1. I switched from his matrix to a dataframe. I also switched from his image to ggplot.
  2. Delineated the initial parameters, seeding and settling clearly so that anyone who downloads the code can experiment
  3. Created a few different Seeding Functions for the pioneers (beyond random seeding)
  4. Matt allowed settling if the new cell was adjacent to an occupied cell. I played around with a few different population settling functions.  (rectilinear, diagonal etc).
  5. Added a function to collect a few statistics regarding population growth
  6. Ran the simulation a few 100 times (without plotting) to compare different growth and settling schemes

Color Coding the Population

  • Black cells : Unoccupied
  • Blue cells: Pioneers
  • Orange cells: Settlers who found a cell in 1-2 attempts
  • Yellow cells: Settlers who found a cell after more than 2 scouting attempts
The full R code I used for these R can be found here.

Here are some runs with a 30x30 area.

Here's what one run looked like. (Blue cells are the initial pioneers. The other colors represent the number of steps taken for them to find a home cell.)

Trying Out Different Seeding Schemes (for the pioneers)


As opposed to starting out with randomly strewn cells, what if we started out with a city downtown (a densely populated central rectangle) and let the settlers follow them?



I tried a rectangluar ring (annular cells for pioneers) starting point to see how the population grew inside and outside. So I used this seeding function:



One final seeding scheme I tried was to have two thick bands (columns of pioneer cells) and to let the population grow from there. Seeding function:

 

 

 

 

 

 

Different Population Growth Schemes

The default settling rule that Matt Asher uses is a very good one, one that I like. If a given cell is adjacent to any occupied cell (one of its 8 neighbors) then it becomes a valid cell to settle down in.

Just for experimentation, I tried a few other schemes.

1. Rectilinear Growth

Rule: You can only settle down in the 4 cells that are due N, S, East or West of an occupied cell.

2. Diagonal Growth

Rule: A settler can occupy any cell that is diagonally adjacent to another occupied cell.

The code can be found here.

 

 

 

 

 

 

Collecting Stats

Now that we have all the functionality that we wanted, it is time to start collecting some statistics for each run ('replication'). There are many possibilities for what we could collect in a population simulation.
Here are a couple:
  • How many settlers found a 'home cell?'
  • How many scouting attempts did each settler make before they found a home?

Making Multiple Replications

 


>st
  Iter FoundHome NumSettlers  Percent
1    1       243         350 69.42857
2    2       214         350 61.14286
3    3       213         350 60.85714
4    4       258         350 73.71429
5    5       235         350 67.14286

The full R code I used for these runs can be found here. Feel free to give it a spin and try out your own settling/growth schemes.
Reference: Matt Asher's post on his StatisticsBlog

Wednesday, December 7, 2011

A pre-requisite to be a Data Scientist

So what should be in the toolkit of people who call themselves a data scientist?

A fundamental skill is the ability to manipulate data. A data scientist should be familiar and comfortable with a number of platforms and scripting tools to get the job done. What is difficult in Excel might be trivial in R. And when R struggles, you should switch to Unix (or use a programming language such as Python) get that portion of the data munging done. Along the way, you pick up a lot of tips and tricks. For example: how to read a big datafile in R?

The goal is to get the job done. Familiarity with a wide variety of tools, and expertise in some is the hallmark of any good would-be data scientist.

Friday, December 2, 2011

O'Reilly's Data Science Kit - Books

It is not as if I don't have enough books (and material on the web) to read. But this list compiled by the O'Reilly team should make any data analyst salivate.

http://shop.oreilly.com/category/deals/data-science-kit.do

The Books and Video included in the set are:

  1. Data Analysis with Open Source Tools
  2. Designing Data Visualizations
  3. An Introduction to Machine Learning with Web Data (Video)
  4. Beautiful Data
  5. Think Stats
  6. R Cookbook
  7. R in a Nutshell
  8. Programming Collective Intelligence

Wednesday, November 30, 2011

Tips for getting started on Kaggle (datamining)

Ever since I heard about Kaggle.com at this year's Bay Area Data Mining Camp, I've wanted to participate. But I was feeling somewhat intimidated.
Jeremy Howard's "Intro to Kaggle" talk at yesterday's MeetUp (DataMining for a Cause) was exactly what I needed.
He had a number of tips for beginners. His was exactly the talk that I was looking for, though I didn't know it. I am sharing some of his tips here, in case it helps others as well.

Jeremy Howard's Tips for Getting Started on Data Mining competitions at Kaggle

* Visit the Kaggle site and spend at least 30 minutes every day hanging around. Read the forum, the competition pages, and read the Kaggle blog
* It is much better to start participating in competitions which are just starting up, rather than in ones where there are 100s of entries and teams already well on their way
* Aim to make at least one submission each and every day
* Jeremy himself participates in competitions to see where he stands, and to learn and get better
* He'd start out making trivial submissions (all zero's, or alternate zero's, all entries as averages) until his algorithm got better
* A lot of people who compete use R (and SAS, Excel or Python)
* Nearly 50% of the winning entries use Random Forest techniques.
* If you place in the top 3, that is great. But personal improvement and learning should be the goal.
* As you get better, you might get invited to "private competitions."
* Every day, strive to do a little better and improve your submission's performance, scoring and ranking

Related Links: