Yesterday I tried to do some data processing on my really big data set in MS Excel. Wow, did it not like handling all those data!! Every time I tried to click on a different ribbon, the screen didn’t even register that I had clicked on that ribbon. So, I took the hint, and decided to do my data processing in R.
One of the tasks that I needed to do was calculate a maximum value, in each row of the dataset, from multiple monetary values in 5 different fields. The first thing I noticed was that the regular max() function in R doesn’t quite like it when you try to calculate a maximum from a series of NA values (it returned an inf value for some reason…). So, I decided to create a “safe” max function:
Finding that it was working, I then constructed a simple for loop to iterate through my ~395,000 rows. As you could imagine, this was taking forever! After much looking around, I realized that the best solution was actually a base function, apply()!!
I constructed my “max” variable with one simple line of code: big.dataset$max_money = apply(as.matrix(big.dataset[,214:218]), 1, function (x) safe.max(x))
Compared to the for loop, which was taking forever, this method was a breeze! It took less than a minute to get through the whole data set. Moral of the story? When you’re dealing with lots of data, code as efficiently as possible!
