Welcome...
Computational Analytics is designed to share concepts and code relating to data analytics and computational modeling. Some of the analytic topics are hard core math and others are a bit more life related. You will find a new post each month. We maintain our occasional contact (non-spamy) through a mostly monthly email and if you would like to join the mailing list, please use the subscribe link above. We do not sell your info nor do we sell advertising space so this page will remain technical. Suggestions, questions and comments are welcome via the contact link.
Thank you and happy computing! || Randall Shane, PhD
December 2011
SQLite: Select, Where and Ordering and R Barplot
Lets start slicing and dicing! One of the main activities performed is simply eyeballing our data. In this post, let's look at how we use select statements, where clauses to implement filtering and ordering for sorting. The syntax you see here is specific to SQLite but the idea is the same across different implementations of relational databases and the Structured Query Language better known as SQL. Remember, you can click to enlarge any image.
First, if you need the database file with the data loaded, HERE is a copy. You can open it using the SQLite manager for Firefox (see previous post for the software) using the Tools Menu and select the SQLite manager. To open the file, Database, Connect Database and select your file. You'll begin on the Master table tab (left column) and you can click the triangle and select the Nuclear_Alternative_Energy tab to see the table.
We will begin with a simple select and the fact that we can see our table means that the select is already structured for us. Hit the bar on the top of the page and select the 'Execute SQL' area.
You will see the following SQL statement: SELECT * FROM tablename. This is the basic syntax of a SQL statement and we can just follow the template. If you would like to see your entire table, simply alter the tablename to your table. The * just means all. Change it to read: SELECT * FROM Nuclear_Alternative_Energy and hit the button labeled 'Run SQL'. You'll see your table.
Good. Now, I only want to see the columns with the country name and the years 2005, 2006, 2007 and 2008. I modify the statement to include the desired column names instead of the * for all and hit 'Run SQL'. The statement looks like this:
SELECT "Country Name", "2005", "2006", "2007", "2008" FROM Nuclear_Alternative_Energy;
Use the names of the columns and separate them with a comma. Also, enclose the names with a quote. This is actually for 2 reasons. The first column, country name, possesses a space and since the space is used to parse on, we have to communicate to the database to ignore the space for parsing and use it in the field name. Second, the columns begin with a number. If you need to know the column names, hit the triangle next to the table name in the left pane (which is generally called the object browser) and you'll see them. Let's get into a good habit as well by terminating our statement with a semicolon. This is a good idea as you will no doubt move into using SQL in a larger environment and you will want more than one statement.
OK, moving along, I want only the following countries: US, Canada and Cuba, since one of my favorite peeps is Cuban. (Hello, by the way, been a while!!) We limit the row dimension by implementing a where clause. Quite simply we append the following: WHERE "Country Name" = "United States" OR "Country Name" = "Canada" OR "Country Name" = "Cuba". The whole statement looks like this: SELECT "Country Name", "2005", "2006", "2007", "2008" FROM Nuclear_Alternative_Energy WHERE "Country Name" = "United States" OR "Country Name" = "Canada" OR "Country Name" = "Cuba";
Lastly, I would like it in descending order by 2008 usage so we will use the order by clause. Append ORDER BY “2008” DESC. SQL Statement: SELECT "Country Name", "2005", "2006", "2007", "2008" FROM Nuclear_Alternative_Energy WHERE "Country Name" = "United States" OR "Country Name" = "Canada" OR "Country Name" = "Cuba" ORDER BY "2008" DESC;
For fun, (and for Nancy who wants more R), let's plot this. Hit the Actions button and click save Result(CSV) to File and save it to your desktop as output.csv. Fire up and R session and drop the output file into your R working directory which you can find with getwd().
Heres the steps to input and plot the file. Youll get a warning on the input about an incomplete line but dont sweat that!
And here's a bar plot of just 2005 and a little more spruced up grouped colored bar plot of all years.
Thats all for now. Have a happy holiday and enjoy the season!! NYC is chilly but this is always the best time of year up here!! Come visit! (seriously, if you're in NYC, drop me a line and Ill meet you for a cocktail!)
November 2011
SQLite for Analytics
Hellooooo! I hope this post finds you all well! From here, we are going to have a little different tenor. The future direction will concentrate on using a simple open source tool, namely SQLite for some interesting analytics. Todays post will get you set up and using SQLite and we'll proceed by integrating R, visualization and all sorts of interesting items.
First things first, lets get installed. Everyone has an opinion on what they like to use as far as Windows, Mac, Linux, etc. There are merits to all and I would like to present this material so that no audience is alienated. In order to facilitate this, we will use SQLite and a plug in for the Firefox web browser named SQLite Manager. Now, if you are using Internet Explorer or Chrome (my fav) or something else, continue to use what you like. There is no crime in having 2 web browsers. First download and install Firefox. If you already use it, then skip to the next step.
Firefox: http://www.mozilla.org/en-US/firefox/new/
Now, the SQLite manager install. Open Firefox ( I was prompted on my first open to import stuff and use it as my default browser. I said no to both as I prefer Chrome for web browsing). Next, type into the search bar firefox sqlite manager and the first entry will be for SQLite Manager :: Add-ons for Firefox Mine looks like this:
Double click it and hit Add to Firefox. This will download the extension, prompt you for the install and then restart. To run the extension, on Windows, Hit the Firefox button in the upper left corner and under Web Developer you will find the SQLite Manager. In Mac, it is under Tools and you'll see the extension.
Select it and you will see the sqlite manager like this:
OK, lets play with some data!
The first step is to import a delimited data file. Here is a test file: Nuclear_Alternative_Energy.csv Save this file to your desktop and then go back to the SQLite Manager. This is a data set that displays alternative and nuclear energy usage by country from 2000 to present as a percentage of total energy consumed. This type of data is available free from the World Bank, (http://data.worldbank.org/). Thank you, World Bank!
This is a delimited file with column titles on the first line, delimited by the comma with double quotes around the text fields. Overall, this is the most common format of data found. If you're using Excel, you can always export delimited data and Excel will also read delimited data natively.
Let's get to work:
1. Create a new database: Its the icon that looks like a page with the corner folded down. You'll be prompted to name it and also for a location to save it. I named mine Energy Data and saved it to the desktop.
2. Import the data set. The icon has the little arrow and cylinder slice. You should adjust your settings for your data and name your table. I named my table Nuclear_Alternative_Energy. My settings are listed below.
3. You will next be prompted to define the data columns. This, in the database world is called a schema. For the moment, don't worry about all different types of data. These are very versitile but like most things, you will use a few of them most often. For now, please simply mimic the definitions I have listed here:
4. Done, we have data! Double click your table in the left column and you will see your data. You should see 244 rows in your table. The various tabs will give you some options. If you hit Browse & Search, you can see the table, clicking column headers will sort. The Structure tab shows you the SQL statement you just made to construct the table. Execute SQL is the one we will use to query the data.
Good work! Thats all for now else this post gets wayyyyy too loooonnngggg! Play with the interface, its pretty simple. Next time, we'll do where clauses (filtering), grouping and sorting!
Enjoy!
September 2011
Data Reconnaissance - Getting to Know Your Information
This month we are discussing surveying data. With every data set comes the daunting task of trying to determine what is in the data. How do we do this efficiently? There are several data attributes that should be considered before mapping an efficient plan for data reconnaissance as differing types of data will require different testing. The main item is to determine if the data is qualitative or quantitative. Qualitative data is usually descriptive and quantitative data is often numeric. This greatly affects how we treat our information and the tests we use to develop a sense of what we have.
Qualitative data is categorical and can be divided into two subcategories, nominal and ordinal. Nominal data offers differentiation only, like eye color. Ordinal data provides differentiation and order, like zip codes. Neither type are attributes for which we would apply mathematics. They are both examples of descriptive information.
Quantitative data possesses the subcategories of interval and ratio. For interval data, the differences between the attributes are important. For example, 80 degrees fahrenheit is warmer then 30 degrees fahrenheit. Ratio data adds the property of its namesake. If you have 4 chihuahuas, you have twice as many chihuahuas as someone who has 2. The ratio becomes another important property and can be integral to our understanding.
The last consideration before we devise some schemes for interpretation is to decide what represents a missing data unit and also to decide its dispensation. In products like ACL, a missing number is a zero. There is no differentiation between a zero and a null in numeric data. Is this important? Should we hold the field first as character data to evaluate the presence of nulls? These are the considerations that warrant further though. In the R environment, missing data is often represented by NA and will be silently dropped when graphing or applying functions. Keep this in mind because if you work in R, you'll notice there are no warnings for this. Data held is MySQL will often possess both including a null for missing data and a zero for zero data. There is no blanket answer for resolving this situation and often the decision is situationally dependent. The options are obvious; excerpt the missing entries, permute a correlating value or leave them as is. This is something that must be considered before moving forward.
A few simple commands apply to both qualitative and quantitative data. This includes a preview of the first few rows, the number of data units and normally some confirmation of data types to determine if permutation is necessary and also what we are working with. The syntax for ACL, R and SQL follow: (Click for full size)
     
     

Qualitative Data can be easily surveyed assessing the number of permutations and performing some simple sorting on the data set. In R, its interesting as we can display these visually with one simple plot command. Syntax follows: (Click for full size)
     
     

Quantitative data necessitates making some decisions before surveying based upon the type of data and what the reviewer is looking for. Usually, one is attempting to determine the range of values and usually the distribution. Syntax follows: (Click for full size)
     
     

Obviously, these are relatively simple examples but the point here is not to produce the most complicated analysis known to mankind, (that'll come later!). The point here is that after a few simple steps, you have some idea of what is in the table. The differences in representation are that ACL and SQL represent data in a much more tabular format while R utilizes visualization to enhance the understanding. Everyone absorbs differently and depending on the data, and the day of the week, I use whatever suits my mood. In many ways, that is the beauty of analytics. Happy surveying!!
June 2011
GPS Address Resolution - Using Google Maps to resolve GPS location
Those of you who know me know I have a fascination with geolocation. Several years ago, I was having a conversation with a gentleman from Pennsylvania named Hunter and we discussed the viability of using a geolocative service to perform address matching for commercial purposes. This lead to some research and ultimately a methodology that functioned but was clunky. The further piece of the puzzle came this year when I had to write a package in the R language for a class at Stanford. I decided to take this idea to the next level and integrate with the Google Maps API.
If youre looking to resolve two list of addresses against one another list, I have written a package in R named geoPlot which you can find on CRAN-R. This package will compare 2 lists of addresses by performing geolocation and provide both a delimited output and a graphical map plot. Please note that due to dependencies for integration with RgoogleMaps, this package is only available for R on Windows although originally written and tested on Mac and Linux. I am working on the non-graphical version for Unix environments and will update this post when it is ready. If you would like a copy of the source, you can download it HERE and install it as a local package.
Here's the logic of how it works:
Did you know you can resolve an address to its GPS coordinates by rendering a string through your web browser? Let's say I want to find out the GPS coordinates of the Empire State Building. The street address is 350 5th Avenue, Manhattan, NY 10001. The syntax of the HTTP string I need to construct looks like this:
So, the string once constructed looks like this:
The output here is being designated in XML although there are other options. The output is the following: (Click the image to see it full size or just paste the line in yourself!)

You have the options of parsing the XML document and using point coordinates which are shown here on the line contained in the point tag:
You can also use a set of box coordinates found enclosed in the extended data tags:
With a little web scraping, you can implement this data almost anywhere.
Soooo, where on the globe do you live???
See you next time! [R]
















