Tuesday, October 18, 2011

Fusion Tables by Google

Google's Fusion Tables look impressive, for those who want to try geo-visualizations of their data. You don't need much programming experience to be able to use it.

For those who want to try it out, here's a nice intro that Kathyrn Hurley presented at the recent SVCC (Silicon Valley Code Camp). When combined with ShpEscape (note spelling) it becomes very powerful.

The Guardian (UK), Texas Tribune, and WNYC seem to be organizations that are taking advantage of it.
I'll post a couple of their examples soon. If you have a Google account, it's easy to test out Fusion Tables.

Related Link: Journalist’s guide to mapping data by county, district using ShpEscape

Monday, October 17, 2011

Get the Basics right - Suggestion for R Beginners

I am always looking for suggestions on how to get better at R, esp. for beginners. So when I see someone who's gotten adept at it, I ask them how they got there.

This weekend, at the Bay Area ACM Data Mining Camp, one person gave me what seemed like a good suggestion. Just wanted to share it here, for anyone else who's just getting started.

He told me that there are tons and tons of libraries, and if you start going down that path, you might know how to use a library or two, but you may not learn the basics of data manipulation, which is one of R's main strengths.

His suggestion:
Get some data and learn to manipulate it - understand the differences between vectors, data-frames, arrays and matrices. Once you have this down, only then should you start exploring the different libraries. Don't rush in to try every new library that someone praises.

Sunday, October 16, 2011

Geo-doodlers - Paul Butler and FlowingData

I found this great R-Visualization example via an R-Blogger post that xingmowang made. (One more good reason for why it is important to read lots of field-related blogs!)

Here's the image:

If this was merely eye-candy, I would have enjoyed it, but not included it here. But to think that this was done in R -- that means the rest of us can learn from it!

When Paul Butler writes about how he created it, he shares with us how he had to tweak it, and how the results surprised him. That is true data-doodling. You guide things along, but then the data surprises (or delights) you.

I also like this small bit of musing that he includes:
What really struck me, though, was knowing that the lines didn't represent coasts or rivers or political borders, but real human relationships. Each line might represent a friendship made while travelling, a family member abroad, or an old college friend pulled away by the various forces of life.
For those of us who are new to R, this example has a few things to try. Take any dataset with Lat/Long values in it, and plot it over a world map. Once you can do that successfully, try this.
(Also pointed out courtesy of Xingmowang.)

We may not all create infographics that are great, but these examples will point us in the right direction.

Wednesday, October 12, 2011

A true data-doodler - Christophe Ladroue (R ddly and plyr on Triathlon Results)

To me, this post by Christophe Ladroue personifies what data doodlers do.

They take a dataset that is of interest to them (In his case, his triathlon results) and then they manipulate the numbers to see what insights can be drawn. Most bloggers only show their final results which look great, but for our purposes (for wannabe data doodlers) it is much more fun to see the process. It is often messy, but that's the way we learn.

In Christophe's example, you will see some data cleanup, then he plots averages and medians across categories. Then he starts to try to squeeze insights out of the day. That is true data analysis.

Check out his full post.






What does it mean to be a Data Scientist?


Check out this talk by John Rauser of AMZN at the 2011 Strata Conf. It is an excellent intro to the field.

Sunday, October 9, 2011

The Skills of a Data Miner

Data mining is not only statistics, even if statistics is the most recognized academic component of it. It also includes data cleaning, machine learning and data visualization.
The scarce factor is the ability to understand that data and extract value from it.
Hal Varian, Google

The full article by Luca Sbardella published in QuantMind is well worth a read.