<?xml version="1.0" encoding="UTF-8"?>
 <rdf:RDF xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://web.resource.org/cc/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/">
  <channel rdf:about="http://pinboard.in">
    <title>Pinboard (rahuldave)</title>
    <link>https://pinboard.in/u:rahuldave/public/</link>
    <description>recent bookmarks from rahuldave</description>
    <items>
      <rdf:Seq>	<rdf:li rdf:resource="http://www.cscs.umich.edu/~crshalizi/weblog/cat_36-402.html"/>
	<rdf:li rdf:resource="http://www.johnmyleswhite.com/notebook/2012/05/03/cumplyr-extending-the-plyr-package-to-handle-cross-dependencies/"/>
	<rdf:li rdf:resource="http://www.r-bloggers.com/measuring-time-series-characteristics/"/>
	<rdf:li rdf:resource="http://www.johndcook.com/blog/2012/05/01/big-data-is-easy/"/>
	<rdf:li rdf:resource="http://wiki.stdout.org/rcookbook/"/>
	<rdf:li rdf:resource="http://www.r-bloggers.com/comparing-julia-and-r%e2%80%99s-vocabularies/"/>
	<rdf:li rdf:resource="http://www.johnmyleswhite.com/notebook/2012/04/04/simulated-annealing-in-julia/"/>
	<rdf:li rdf:resource="http://www.johndcook.com/blog/2012/03/09/monkeying-with-bayes-theorem/"/>
	<rdf:li rdf:resource="http://www.r-bloggers.com/data-visualization/"/>
	<rdf:li rdf:resource="http://www.r-bloggers.com/abc-in-roma-r-lab-2/"/>
	<rdf:li rdf:resource="http://www.r-bloggers.com/modeling-trick-the-signed-pseudo-logarithm/"/>
	<rdf:li rdf:resource="http://www.johndcook.com/blog/2012/02/29/comparing-r-to-smoking/"/>
	<rdf:li rdf:resource="http://chrisblattman.com/2012/02/17/dear-statisticians-please-start-using-your-powers-for-good-not-evil/"/>
	<rdf:li rdf:resource="http://dl.acm.org/citation.cfm?id=1195284"/>
	<rdf:li rdf:resource="http://stat.asu.edu/~chavez/CCCPUB/Towards%20real%20time%20epidemiology%20data%20assimilation,%20modeling%20and%20anomaly%20detection%20of%20health%20surveillance%20data%20streams.pdf"/>
	<rdf:li rdf:resource="http://www.lancs.ac.uk/staff/diggle/ENAR_slides.pdf"/>
	<rdf:li rdf:resource="http://feeds.arstechnica.com/~r/arstechnica/index/~3/NzKuPgcpSdY/seeing-a-power-law-in-data-doesnt-make-it-real.ars"/>
	<rdf:li rdf:resource="http://www.johndcook.com/blog/2012/02/01/the-universal-solvent-of-statistics/"/>
	<rdf:li rdf:resource="http://www.johndcook.com/blog/2012/01/02/r-in-action/"/>
	<rdf:li rdf:resource="http://feedproxy.google.com/~r/TheEndeavour/~3/gBBNTQY8bwk/"/>
	<rdf:li rdf:resource="http://www.johndcook.com/blog/2011/04/20/teaching-bayesian-stats-backward/"/>
	<rdf:li rdf:resource="http://www.johndcook.com/blog/2011/04/14/significance-testing-and-congress/"/>
	<rdf:li rdf:resource="http://www.johndcook.com/blog/2011/04/13/pericchi-statistical-significance/"/>
	<rdf:li rdf:resource="http://chrisblattman.com/2011/04/11/how-i-regard-almost-every-empirical-development-or-conflict-paper-i-know/"/>
	<rdf:li rdf:resource="http://www.johndcook.com/blog/2010/04/29/simple-approximation-to-normal-distribution/"/>
	<rdf:li rdf:resource="http://www.mailund.dk/index.php/2010/04/26/is-r-an-epic-fail/"/>
	<rdf:li rdf:resource="http://www.johndcook.com/blog/2010/03/30/statistical-rule-of-three/"/>
      </rdf:Seq>
    </items>
  </channel><item rdf:about="http://www.cscs.umich.edu/~crshalizi/weblog/cat_36-402.html">
    <title>Advanced Data Analysis from an Elementary Point of View</title>
    <dc:date>2012-05-03T20:41:03+00:00</dc:date>
    <link>http://www.cscs.umich.edu/~crshalizi/weblog/cat_36-402.html</link>
    <dc:creator>rahuldave</dc:creator><dc:subject>data statistics</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:rahuldave/b:b69e5412b4c2/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:data"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:statistics"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://www.johnmyleswhite.com/notebook/2012/05/03/cumplyr-extending-the-plyr-package-to-handle-cross-dependencies/">
    <title>cumplyr: Extending the plyr Package to Handle Cross-Dependencies</title>
    <dc:date>2012-05-03T14:44:49+00:00</dc:date>
    <link>http://www.johnmyleswhite.com/notebook/2012/05/03/cumplyr-extending-the-plyr-package-to-handle-cross-dependencies/</link>
    <dc:creator>rahuldave</dc:creator><description><![CDATA[Introduction
For me, Hadley Wickham‘s reshape and plyr packages are invaluable because they encapsulate omnipresent design patterns in statistical computing: reshape handles switching between the different possible representations of the same underlying data, while plyr automates what Hadley calls the Split-Apply-Combine strategy, in which you split up your data into several subsets, perform some computation on each of these subsets and then combine the results into a new data set. Many of the computations implicit in traditional statistical theory are easily described in this fashion: for example, comparing the means of two groups is computationally equivalent to splitting a data set of individual observations up into subsets based on the group assignments, applying mean to those subsets and then pooling the results back together again.

The Split-Apply-Combine Strategy is Broader than plyr
The only weakness of plyr, which automates so many of the computations that instantiate the Split-Apply-Combine strategy, is that plyr implements one very specific version of the Split-Apply-Combine strategy: plyr always splits your data into disjoint subsets. By disjoint, I mean that any row of the original data set can occur in only one of the subsets created by the splitting function. For computations that involve cross-dependencies between observations, this makes plyr inapplicable: cumulative quantities like running means and broadly local quantities like kernelized means cannot be computed using plyr. To highlight that concern, let’s consider three very simple data analysis problems.

Computing Forward-Running Means
Suppose that you have the following data set:



Time
Value


1
1


2
3


3
5


To compute a forward-running mean, you need to split this data into three subsets:



Time
Value


1
1




Time
Value


1
1


2
3




Time
Value


1
1


2
3


3
5


In each of these clearly non-disjoint subsets, you would then compute the mean of Value and combine the results to give:



Time
Value


1
1


2
2


3
3


This sort of computation occurs often enough in a simpler form that R provides tools like cumsum and cumprod to deal with cumulative quantities. But the splitting problem in our example is not addressed by those tools, nor by plyr, because the cumulative quantities have to computed on subsets that are not disjoint.

Computing Backward-Running Means
Consider performing the same sort of calculation as described above, but moving in the opposite direction. In that case, the three non-disjoint subsets are:



Time
Value


3
5




Time
Value


2
3


3
5




Time
Value


1
1


2
3


3
5


And the final result is:



Time
Value


1
3


2
4


3
5


Computing Local Means (AKA Kernelized Means)
Imagine that, instead of looking forward or backward, we only want to know something about data that is close to the current observation being examined. For example, we might want to know the mean value of each row when pooled with its immediately proceeding and succeeding neighbors. This computation must create the following subsets of data:



Time
Value


1
1


2
3




Time
Value


1
1


2
3


3
5




Time
Value


2
3


3
5


Within these non-disjoint subsets, means are computed and the result is:



Time
Value


1
2


2
3


3
4


A Strategy for Handling Non-Disjoint Subsets
How can we build a general purpose tool to handle these sorts of computations? One way is to rethink how plyr works and then extend it with some trivial variations on its core principles. We can envision plyr as a system that uses a splitting operation that partitions our data into subsets in which each subset satisfies a group of equality constraints: you split the data into groups in which Variable 1 = Value 1 AND Variable 2 = Value 2, etc. Because you consider the conjunction of several equality constraints, the resulting subsets are disjoint.

Seen in this fashion, there is a simple relaxation of the equality constraints that allows us to solve the three problems described a moment ago: instead of looking at the conjunction of equality constraints, we use a conjunction of inequality constraints. For the time being, I’ll describe just three instantiations of this broader strategy.

Using Upper Bounds
Here, we divide data into groups in which Variable 1 <= Value 1 AND Variable 2 <= Value 2, etc. We will also allow equality constraints, so that the operations of plyr are a strict subset of the computations in this new model. For example, we might use the constraint Variable = Value 1 AND Variable 2 <= Value 2. If the upper bound is the Time variable, these contraints will allow us to compute the forward-moving mean we described earlier.

Using Lower Bounds
Instead of using upper bounds, we can use lower bounds to divide data into groups in which Variable >= Value 1 AND Variable 2 >= Value 2, etc. This allows us to implement the backward-moving mean described earlier.

Using Norm Balls
Finally, we can consider a combination of upper and lower bounds. For simplicity, we'll assume that these bounds have a fixed tightness around the "center" of each subset of our split data. To articulate this tightness formally, we look at a specific hypothetical equality constraint like Variable 1 = Value 1 and then loosen it so that norm(Variable 1 - Value 1) <= r. When r = 0, this system gives the original equality constraint. But when r > 0, we produce a "ball" of data around the constraint whose tightness is r. This lets us estimate the local means from our third example.

Implementation
To demo these ideas in a usable fashion, I've created a draft package for R called cumplyr. Here is an extended example of its usage in solving simple variants of the problems described in this post:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
library('cumplyr')
 
data <- data.frame(Time = 1:5, Value = seq(1, 9, by = 2))
 
iddply(data,
       equality.variables = c('Time'),
       lower.bound.variables = c(),
       upper.bound.variables = c(),
       norm.ball.variables = list(),
       func = function (df) {with(df, mean(Value))})
 
iddply(data,
       equality.variables = c(),
       lower.bound.variables = c('Time'),
       upper.bound.variables = c(),
       norm.ball.variables = list(),
       func = function (df) {with(df, mean(Value))})
 
iddply(data,
       equality.variables = c(),
       lower.bound.variables = c(),
       upper.bound.variables = c('Time'),
       norm.ball.variables = list(),
       func = function (df) {with(df, mean(Value))})
 
iddply(data,
       equality.variables = c(),
       lower.bound.variables = c(),
       upper.bound.variables = c(),
       norm.ball.variables = list('Time' = 1),
       func = function (df) {with(df, mean(Value))})
 
iddply(data,
       equality.variables = c(),
       lower.bound.variables = c(),
       upper.bound.variables = c(),
       norm.ball.variables = list('Time' = 2),
       func = function (df) {with(df, mean(Value))})
 
iddply(data,
       equality.variables = c(),
       lower.bound.variables = c(),
       upper.bound.variables = c(),
       norm.ball.variables = list('Time' = 5),
       func = function (df) {with(df, mean(Value))})

You can download this package from GitHub and play with it to see whether it helps you. Please submit feedback using GitHub if you have any comments, complaints or patches.

Comparing plyr with cumplyr
In the long run, I'm hoping to make the functions in cumplyr robust enough to submit a patch to plyr. I see these tools as one logical extension of plyr to encompass more of the framework described in Hadley's paper on the Split-Apply-Combine strategy.

For the time being, I would advise any users of cumplyr to make sure that you do not use cumplyr for anything that plyr could already do. cumplyr is very much demo software and I am certain that both its API and implementation will change. In contrast, plyr is fast and stable software that can be trusted to perform its job.

But, if you have a problem that cumplyr will solve and plyr will not, I hope you'll try cumplyr out and submit patches when it breaks.

Happy hacking!
]]></description>
<dc:subject>Programming Statistics</dc:subject>
<dc:identifier>https://pinboard.in/u:rahuldave/b:a5482cf69a97/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Programming"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Statistics"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://www.r-bloggers.com/measuring-time-series-characteristics/">
    <title>Measuring time series characteristics</title>
    <dc:date>2012-05-02T07:51:05+00:00</dc:date>
    <link>http://www.r-bloggers.com/measuring-time-series-characteristics/</link>
    <dc:creator>rahuldave</dc:creator><description><![CDATA[
(This article was first published on   Research tips » R, and kindly contributed to R-bloggers)      


A few years ago, I was working on a project where we measured various characteristics of a time series and used the information to determine what forecasting method to apply or how to cluster the time series into meaningful groups. The two main papers to come out of that project were:


Wang, Smith and Hyndman (2006) Characteristic-​​based clustering for time series data. Data Mining and Knowledge Discovery, 13(3), 335–364.
Wang, Smith-Miles and Hyndman (2009) “Rule induction for forecasting method selection: meta-​​learning the characteristics of univariate time series”, Neurocomputing, 72, 2581–2594.

I’ve since had a lot of requests for the code which one of my coauthors has been helpfully emailing to anyone who asked. But to make it easier, we thought it might be helpful if I post some updated code here. This is not the same as the R code we used in the paper, as I’ve improved it in several ways (so it will give different results). If you just want the code, skip to the bottom of the post.

Finding the period of the data
Usually in time series work, we know the period of the data (if the observations are monthly, the period is 12, for example). But in this project, some of our data was of unknown period and we wanted a method to automatically determine the appropriate period. The method we used was based on local peaks and troughs in the ACF. But I’ve since devised a better approach (prompted on crossvalidated.com) using an estimate of the spectral density:


find.freq <- function(x)
{
  n <- length(x)
  spec <- spec.ar(c(na.contiguous(x)),plot=FALSE)
  if(max(spec$spec)>10) # Arbitrary threshold chosen by trial and error.
  {
    period <- round(1/spec$freq[which.max(spec$spec)])
    if(period==Inf) # Find next local maximum
    {
      j <- which(diff(spec$spec)>0)
      if(length(j)>0)
      {
        nextmax <- j[1] + which.max(spec$spec[j[1]:500])
        if(nextmax <= length(spec$freq))
          period <- round(1/spec$freq[nextmax])
        else
          period <- 1
      }
      else
        period <- 1
    }
  }
  else
    period <- 1
 
  return(period)
}

 
The function is called find.freq because time series people often call the period of seasonality the “frequency” (which is of course highly confusing).

Decomposing the data into trend and seasonal components
We needed a measure of the strength of trend and the strength of seasonality, and to do this we decomposed the data into trend, seasonal and error terms.

Because not all data could be decomposed additively, we first needed to apply an automated Box-Cox transformation. We tried a range of Box-Cox parameters on a grid, and selected the one which gave the most normal errors. That worked ok, but I’ve since found some papers that provide quite good automated Box-Cox algorithms that I’ve implemented in the forecast package. So this code uses Guerrero’s (1993) method instead.

For seasonal time series, we decomposed the transformed data using an stl decomposition with periodic seasonality.

For non-seasonal time series, we estimated the trend of the transformed data using penalized regression splines via the mgcv package.


decomp <- function(x,transform=TRUE)
{
  require(forecast)
  # Transform series
  if(transform & min(x,na.rm=TRUE) >= 0)
  {
    lambda <- BoxCox.lambda(na.contiguous(x))
    x <- BoxCox(x,lambda)
  }
  else
  {
    lambda <- NULL
    transform <- FALSE
  }
  # Seasonal data
  if(frequency(x)>1)
  {
    x.stl <- stl(x,s.window="periodic",na.action=na.contiguous)
    trend <- x.stl$time.series[,2]
    season <- x.stl$time.series[,1]
    remainder <- x - trend - season
  }
  else #Nonseasonal data
  {
    require(mgcv)
    tt <- 1:length(x)
    trend <- rep(NA,length(x))
    trend[!is.na(x)] <- fitted(gam(x ~ s(tt)))
    season <- NULL
    remainder <- x - trend
  }
  return(list(x=x,trend=trend,season=season,remainder=remainder,
    transform=transform,lambda=lambda))
}

 

Putting everything on a [0,1] scale
We wanted to measure a range of characteristics such as strength of seasonality, strength of trend, level of nonlinearity, skewness, kurtosis, serial correlatedness, self-similarity, level of chaoticity (is that a word?) and the periodicity of the data. But we wanted all these on the same scale which meant mapping the natural range of each measure onto [0,1]. The following two functions were used to do this.


# f1 maps [0,infinity) to [0,1]
f1 <- function(x,a,b)
{
  eax <- exp(a*x)
  if (eax == Inf)
    f1eax <- 1
  else
    f1eax <- (eax-1)/(eax+b)
  return(f1eax)
}
 
# f2 maps [0,1] onto [0,1]
f2 <- function(x,a,b)
{
  eax <- exp(a*x)
  ea <- exp(a)
  return((eax-1)/(eax+b)*(ea+b)/(ea-1))
}

 
The values of  and  in each function were chosen so the measure had a 90th percentile of 0.10 when the data were iid standard normal, and a value of 0.9 using a well-known benchmark time series.

Calculating the measures
Now we are ready to calculate the measures on the original data, as well as on the adjusted data (after removing trend and seasonality).


measures <- function(x)
{
  require(forecast)
 
  N <- length(x)
  freq <- find.freq(x)
  fx <- c(frequency=(exp((freq-1)/50)-1)/(1+exp((freq-1)/50)))
  x <- ts(x,f=freq)
 
  # Decomposition
  decomp.x <- decomp(x)
 
  # Adjust data
  if(freq > 1)
    fits <- decomp.x$trend + decomp.x$season
  else # Nonseasonal data
    fits <- decomp.x$trend
  adj.x <- decomp.x$x - fits + mean(decomp.x$trend, na.rm=TRUE)
 
  # Backtransformation of adjusted data
  if(decomp.x$transform)
    tadj.x <- InvBoxCox(adj.x,decomp.x$lambda)
  else
    tadj.x <- adj.x
 
  # Trend and seasonal measures
  v.adj <- var(adj.x, na.rm=TRUE)
  if(freq > 1)
  {
    detrend <- decomp.x$x - decomp.x$trend
    deseason <- decomp.x$x - decomp.x$season
    trend <- ifelse(var(deseason,na.rm=TRUE) < 1e-10, 0, 
      max(0,min(1,1-v.adj/var(deseason,na.rm=TRUE))))
    season <- ifelse(var(detrend,na.rm=TRUE) < 1e-10, 0,
      max(0,min(1,1-v.adj/var(detrend,na.rm=TRUE))))
  }
  else #Nonseasonal data
  {
    trend <- ifelse(var(decomp.x$x,na.rm=TRUE) < 1e-10, 0,
      max(0,min(1,1-v.adj/var(decomp.x$x,na.rm=TRUE))))
    season <- 0
  }
 
  m <- c(fx,trend,season)
 
  # Measures on original data
  xbar <- mean(x,na.rm=TRUE)
  s <- sd(x,na.rm=TRUE)
 
  # Serial correlation
  Q <- Box.test(x,lag=10)$statistic/(N*10)
  fQ <- f2(Q,7.53,0.103)
 
  # Nonlinearity
  p <- terasvirta.test(na.contiguous(x))$statistic
  fp <- f1(p,0.069,2.304)
 
  # Skewness
  s <- abs(mean((x-xbar)^3,na.rm=TRUE)/s^3)
  fs <- f1(s,1.510,5.993)
 
  # Kurtosis
  k <- mean((x-xbar)^4,na.rm=TRUE)/s^4
  fk <- f1(k,2.273,11567)
 
  # Hurst=d+0.5 where d is fractional difference.
  H <- fracdiff(na.contiguous(x),0,0)$d + 0.5
 
  # Lyapunov Exponent
  if(freq > N-10)
      stop("Insufficient data")
  Ly <- numeric(N-freq)
  for(i in 1:(N-freq))
  {
    idx <- order(abs(x[i] - x))
    idx <- idx[idx < (N-freq)]
    j <- idx[2]
    Ly[i] <- log(abs((x[i+freq] - x[j+freq])/(x[i]-x[j])))/freq
    if(is.na(Ly[i]) | Ly[i]==Inf | Ly[i]==-Inf)
      Ly[i] <- NA
  }
  Lyap <- mean(Ly,na.rm=TRUE)
  fLyap <- exp(Lyap)/(1+exp(Lyap))
 
  m <- c(m,fQ,fp,fs,fk,H,fLyap)
 
  # Measures on adjusted data
  xbar <- mean(tadj.x, na.rm=TRUE)
  s <- sd(tadj.x, na.rm=TRUE)
 
  # Serial
  Q <- Box.test(adj.x,lag=10)$statistic/(N*10)
  fQ <- f2(Q,7.53,0.103)
 
  # Nonlinearity
  p <- terasvirta.test(na.contiguous(adj.x))$statistic
  fp <- f1(p,0.069,2.304)
 
  # Skewness
  s <- abs(mean((tadj.x-xbar)^3,na.rm=TRUE)/s^3)
  fs <- f1(s,1.510,5.993)
 
  # Kurtosis
  k <- mean((tadj.x-xbar)^4,na.rm=TRUE)/s^4
  fk <- f1(k,2.273,11567)
 
  m <- c(m,fQ,fp,fs,fk)
  names(m) <- c("frequency", "trend","seasonal",
    "autocorrelation","non-linear","skewness","kurtosis",
    "Hurst","Lyapunov",
    "dc autocorrelation","dc non-linear","dc skewness","dc kurtosis")
 
  return(m)
}

 

Here is a quick example applied to Australian monthly gas production:


library(forecast)
measures(gas)
     frequency              trend           seasonal    autocorrelation 
        0.1096             0.9989             0.9337             0.9985 
    non-linear           skewness           kurtosis              Hurst 
        0.4947             0.1282             1.0000             0.9996 
      Lyapunov dc autocorrelation      dc non-linear        dc skewness 
        0.5662             0.1140             0.0538             0.1743 
   dc kurtosis 
        1.0000

 
The function is far from perfect, and it is not hard to find examples where it fails. For example, it doesn’t work with multiple seasonality — try measure(taylor) and check the seasonality. Also, I’m not convinced the kurtosis provides anything useful here, or that the skewness measure is done in the best way possible. But it was really a proof of concept, so we will leave it to others to revise and improve the code.

In our papers, we took the measures obtained using R, and produced self-organizing maps using Viscovery. There is now a som package in R for that, so it might be possible to integrate that step into R as well. The dendogram was generated in matlab, although that could now also be done in R using the ggdendro package for example.


Download the code in a single file.



To leave a comment for the author, please follow the link and comment on his blog:  Research tips » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

]]></description>
<dc:subject>R_bloggers forecasting R Research_tips statistics</dc:subject>
<dc:identifier>https://pinboard.in/u:rahuldave/b:c791c19684a6/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:R_bloggers"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:forecasting"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:R"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Research_tips"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:statistics"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://www.johndcook.com/blog/2012/05/01/big-data-is-easy/">
    <title>Big data is easy</title>
    <dc:date>2012-05-01T14:16:12+00:00</dc:date>
    <link>http://www.johndcook.com/blog/2012/05/01/big-data-is-easy/</link>
    <dc:creator>rahuldave</dc:creator><description><![CDATA[Big data is easy; big models are hard.

If you just wanted to use simple models with tons of data, that would be easy. You could resample the data, throwing some of it away until you had a quantity of data you could comfortably manage.

But when you have tons of data, you want to take advantage of it and ask questions that simple models cannot answer. (”Big” data is often indirect data.) So the problem isn’t that you have a lot of data, it’s that you’re using models that require a lot of data. And that can be very hard.

I am not saying people should just use simple models. No, people are right to want to take advantage of their data, and often that does require complex models. (See Brad Efron’s explanation why.) But the primary challenge is intellectual, not physical.

Related post: Big data and humility

]]></description>
<dc:subject>Statistics Probability_and_Statistics</dc:subject>
<dc:identifier>https://pinboard.in/u:rahuldave/b:ebac9e0bdf19/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Probability_and_Statistics"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://wiki.stdout.org/rcookbook/">
    <title>Cookbook for R » Cookbook for R</title>
    <dc:date>2012-04-30T13:48:23+00:00</dc:date>
    <link>http://wiki.stdout.org/rcookbook/</link>
    <dc:creator>rahuldave</dc:creator><dc:subject>statistics</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:rahuldave/b:c57520f4b4c4/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:statistics"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://www.r-bloggers.com/comparing-julia-and-r%e2%80%99s-vocabularies/">
    <title>Comparing Julia and R’s Vocabularies</title>
    <dc:date>2012-04-09T14:00:19+00:00</dc:date>
    <link>http://www.r-bloggers.com/comparing-julia-and-r%e2%80%99s-vocabularies/</link>
    <dc:creator>rahuldave</dc:creator><description><![CDATA[
(This article was first published on   John Myles White » Statistics, and kindly contributed to R-bloggers)      


While exploring the Julia manual recently, I realized that it might be helpful to put the basic vocabularies of Julia and R side-by-side for easy comparison. So I took Hadley Wickham’s R Vocabulary section from the book he’s putting together on the devtools wiki, put all of the functions Hadley listed into a CSV file, and proceeded to fill in entries where I knew of an obvious Julia equivalent to an R function.

The results are on GitHub and, as they stand today, are shown below:



R
Julia
Category
Subcategory


https://

github.com/

hadley/devtools/

wiki/vocabulary
http://

julialang.org/

manual/

standard-

library-reference/
Resources
Vocabulary


?
help
Basics
First Functions


str

Basics
First Functions


%in%

Basics
Operators


match

Basics
Operators


=
=
Basics
Operators


<-
=
Basics
Operators


<<-

Basics
Operators


assign

Basics
Operators


$
[]
Basics
Operators


[]
[]
Basics
Operators


[[]]
[]
Basics
Operators


replace

Basics
Operators


head

Basics
Operators


tail

Basics
Operators


subset

Basics
Operators


with

Basics
Operators


within

Basics
Operators


all.equal

Basics
Comparison


identical

Basics
Comparison


!=
!=
Basics
Comparison


==
==
Basics
Comparison


>
>
Basics
Comparison


>=
>=
Basics
Comparison


<
<
Basics
Comparison


<=
<=
Basics
Comparison


is.na

Basics
Comparison


is.nan

Basics
Comparison


is.finite

Basics
Comparison


complete.cases

Basics
Comparison


*
*
Basics
Basic Math


+
+
Basics
Basic Math


-
-
Basics
Basic Math


/
/
Basics
Basic Math


^
^
Basics
Basic Math


%%
mod (%%)
Basics
Basic Math


%/%
div
Basics
Basic Math


abs
abs
Basics
Basic Math


sign
sign
Basics
Basic Math


acos
acos
Basics
Basic Math


acosh
acosh
Basics
Basic Math


asin
asin
Basics
Basic Math


asinh
asinh
Basics
Basic Math


atan
atan
Basics
Basic Math


atan2
atan2
Basics
Basic Math


atanh
atanh
Basics
Basic Math


sin
sin
Basics
Basic Math


sinh
sinh
Basics
Basic Math


cos
cos
Basics
Basic Math


cosh
cosh
Basics
Basic Math


tan
tan
Basics
Basic Math


tanh
tanh
Basics
Basic Math


ceiling
ceil
Basics
Basic Math


floor
floor
Basics
Basic Math


round
round
Basics
Basic Math


trunc
trunc
Basics
Basic Math


signif

Basics
Basic Math


exp
exp
Basics
Basic Math


log
log
Basics
Basic Math


log10
log10
Basics
Basic Math


log1p
log1p
Basics
Basic Math


log2
log2
Basics
Basic Math


logb

Basics
Basic Math


sqrt
sqrt
Basics
Basic Math


cummax

Basics
Basic Math


cummin

Basics
Basic Math


cumprod
cumprod
Basics
Basic Math


cumsum
cumsum
Basics
Basic Math


diff
diff
Basics
Basic Math


max
max
Basics
Basic Math


min
min
Basics
Basic Math


prod
prod
Basics
Basic Math


sum
sum
Basics
Basic Math


range

Basics
Basic Math


mean
mean
Basics
Basic Math


median
median
Basics
Basic Math


cor
cor_pearson
Basics
Basic Math


cov
cov_pearson
Basics
Basic Math


sd
std
Basics
Basic Math


var
var
Basics
Basic Math


pmax

Basics
Basic Math


pmin

Basics
Basic Math


rle

Basics
Basic Math


function
function
Basics
Functions


missing

Basics
Functions


on.exit

Basics
Functions


return
return
Basics
Functions


invisible

Basics
Functions


&
&
Basics
Logical & Set Operations


|
|
Basics
Logical & Set Operations


!
!
Basics
Logical & Set Operations


xor

Basics
Logical & Set Operations


all
all
Basics
Logical & Set Operations


any
any
Basics
Logical & Set Operations


intersect
intersect
Basics
Logical & Set Operations


union
union
Basics
Logical & Set Operations


setdiff

Basics
Logical & Set Operations


setequal

Basics
Logical & Set Operations


which
find
Basics
Logical & Set Operations


c
[] ({})
Basics
Vectors and Matrices


matrix
[] ({})
Basics
Vectors and Matrices


length
size (length)
Basics
Vectors and Matrices


dim
size
Basics
Vectors and Matrices


ncol
size(x, 1)
Basics
Vectors and Matrices


nrow
size(x, 2)
Basics
Vectors and Matrices


cbind
hcat
Basics
Vectors and Matrices


rbind
vcat
Basics
Vectors and Matrices


names

Basics
Vectors and Matrices


colnames

Basics
Vectors and Matrices


rownames

Basics
Vectors and Matrices


t
‘
Basics
Vectors and Matrices


diag
eye
Basics
Vectors and Matrices


sweep

Basics
Vectors and Matrices


as.matrix

Basics
Vectors and Matrices


data.matrix

Basics
Vectors and Matrices


c
[] ({})
Basics
Making Vectors


rep

Basics
Making Vectors


seq
[from:by:to]
Basics
Making Vectors


seq_along

Basics
Making Vectors


seq_len
[1:len]
Basics
Making Vectors


rev
reverse
Basics
Making Vectors


sample

Basics
Making Vectors


choose
factorial
Basics
Making Vectors


factorial
factorial
Basics
Making Vectors


combn

Basics
Making Vectors


(is/as).(character/numeric/logical)

Basics
Making Vectors


list
HashTable ([])
Basics
Lists & Data Frames


unlist

Basics
Lists & Data Frames


data.frame

Basics
Lists & Data Frames


as.data.frame

Basics
Lists & Data Frames


split

Basics
Lists & Data Frames


expand.grid

Basics
Lists & Data Frames


if
if
Basics
Control Flow


&&
&&
Basics
Control Flow


||
||
Basics
Control Flow


for
for
Basics
Control Flow


while
while
Basics
Control Flow


next
continue
Basics
Control Flow


break
break
Basics
Control Flow


switch

Basics
Control Flow


ifelse

Basics
Control Flow


fitted

Statistics
Linear Models


predict

Statistics
Linear Models


resid

Statistics
Linear Models


rstandard

Statistics
Linear Models


lm

Statistics
Linear Models


glm

Statistics
Linear Models


hat

Statistics
Linear Models


influence.measures

Statistics
Linear Models


logLik

Statistics
Linear Models


df

Statistics
Linear Models


deviance

Statistics
Linear Models


formula

Statistics
Linear Models


~

Statistics
Linear Models


I

Statistics
Linear Models


anova

Statistics
Linear Models


coef

Statistics
Linear Models


confint

Statistics
Linear Models


vcov

Statistics
Linear Models


contrasts

Statistics
Linear Models


apropos(‘\\.test$’)

Statistics
Miscellaneous Statistical Tests


beta
beta
Statistics
Random Numbers


binom
binom
Statistics
Random Numbers


cauchy
cauchy
Statistics
Random Numbers


chisq
chisq
Statistics
Random Numbers


exp
exp
Statistics
Random Numbers


f
f
Statistics
Random Numbers


gamma
gamma
Statistics
Random Numbers


geom
geom
Statistics
Random Numbers


hyper
hyper
Statistics
Random Numbers


lnorm
lnorm
Statistics
Random Numbers


logis
logis
Statistics
Random Numbers


multinom
multinom
Statistics
Random Numbers


nbinom
nbinom
Statistics
Random Numbers


norm
norm
Statistics
Random Numbers


pois
pois
Statistics
Random Numbers


signrank
signrank
Statistics
Random Numbers


t
t
Statistics
Random Numbers


unif
unif (rand)
Statistics
Random Numbers


weibull
weibull
Statistics
Random Numbers


wilcox
wilcox
Statistics
Random Numbers


birthday
birthday
Statistics
Random Numbers


tukey
tukey
Statistics
Random Numbers


crossprod
*
Statistics
Matrix Algebra


tcrossprod
*
Statistics
Matrix Algebra


eigen
eig
Statistics
Matrix Algebra


qr
qr
Statistics
Matrix Algebra


svd
svd
Statistics
Matrix Algebra


%*%
*
Statistics
Matrix Algebra


%o%

Statistics
Matrix Algebra


outer

Statistics
Matrix Algebra


rcond

Statistics
Matrix Algebra


solve
\
Statistics
Matrix Algebra


duplicated

Statistics
Ordering and Tabulating


unique

Statistics
Ordering and Tabulating


merge

Statistics
Ordering and Tabulating


order

Statistics
Ordering and Tabulating


rank

Statistics
Ordering and Tabulating


quantile
quantile
Statistics
Ordering and Tabulating


sort
sort
Statistics
Ordering and Tabulating


table

Statistics
Ordering and Tabulating


ftable

Statistics
Ordering and Tabulating


ls
whos
Working with R
Workspace


exists

Working with R
Workspace


get

Working with R
Workspace


rm

Working with R
Workspace


getwd
getcwd
Working with R
Workspace


setwd
setcwd
Working with R
Workspace


q
Ctrl-D
Working with R
Workspace


source
load
Working with R
Workspace


install.packages

Working with R
Workspace


library

Working with R
Workspace


require

Working with R
Workspace


help
help
Working with R
Help


?
help
Working with R
Help


help.search

Working with R
Help


apropos

Working with R
Help


RSiteSearch

Working with R
Help


citation

Working with R
Help


demo

Working with R
Help


example

Working with R
Help


vignette

Working with R
Help


traceback

Working with R
Debugging


browser

Working with R
Debugging


recover

Working with R
Debugging


options(error =)

Working with R
Debugging


stop

Working with R
Debugging


warning

Working with R
Debugging


message

Working with R
Debugging


tryCatch
try/catch
Working with R
Debugging


try
try
Working with R
Debugging


print
print (println)
I/O
Output


cat

I/O
Output


message

I/O
Output


warning

I/O
Output


dput

I/O
Output


format

I/O
Output


sink

I/O
Output


data

I/O
Reading and Writing Data


count.fields

I/O
Reading and Writing Data


read.csv
csvread
I/O
Reading and Writing Data


read.delim
dlmread
I/O
Reading and Writing Data


read.fwf

I/O
Reading and Writing Data


read.table

I/O
Reading and Writing Data


library(foreign)

I/O
Reading and Writing Data


write.table
dlmwrite
I/O
Reading and Writing Data


readLines
readlines
I/O
Reading and Writing Data


writeLines

I/O
Reading and Writing Data


load

I/O
Reading and Writing Data


save

I/O
Reading and Writing Data


readRDS

I/O
Reading and Writing Data


saveRDS

I/O
Reading and Writing Data


dir

I/O
Files and Directories


basename

I/O
Files and Directories


dirname

I/O
Files and Directories


file.path

I/O
Files and Directories


path.expand

I/O
Files and Directories


file.choose

I/O
Files and Directories


file.copy

I/O
Files and Directories


file.create

I/O
Files and Directories


file.remove

I/O
Files and Directories


path.rename

I/O
Files and Directories


dir.create

I/O
Files and Directories


file.exists

I/O
Files and Directories


tempdir

I/O
Files and Directories


tempfile

I/O
Files and Directories


download.file

I/O
Files and Directories


ISOdate

Special Data
Date / Time


ISOdatetime

Special Data
Date / Time


strftime

Special Data
Date / Time


strptime

Special Data
Date / Time


date

Special Data
Date / Time


difftime

Special Data
Date / Time


julian

Special Data
Date / Time


months

Special Data
Date / Time


quarters

Special Data
Date / Time


weekdays

Special Data
Date / Time


library(lubridate)

Special Data
Date / Time


grep
match
Special Data
Character Manipulation


agrep

Special Data
Character Manipulation


gsub

Special Data
Character Manipulation


strsplit
split
Special Data
Character Manipulation


chartr

Special Data
Character Manipulation


nchar
strlen
Special Data
Character Manipulation


tolower

Special Data
Character Manipulation


toupper

Special Data
Character Manipulation


substr

Special Data
Character Manipulation


paste
join
Special Data
Character Manipulation


library(stringr)

Special Data
Character Manipulation


factor

Special Data
Factors


levels

Special Data
Factors


nlevels

Special Data
Factors


reorder

Special Data
Factors


relevel

Special Data
Factors


cut

Special Data
Factors


findInterval

Special Data
Factors


interaction

Special Data
Factors


options(stringsAsFactors = FALSE)

Special Data
Factors


array
[]
Special Data
Array Manipulation


dim
size
Special Data
Array Manipulation


dimnames

Special Data
Array Manipulation


aperm

Special Data
Array Manipulation


library(abind)

Special Data
Array Manipulation



I’d like to note that holes in the list of Julia functions can exist for several reasons:

The language does not yet have the relevant features. This is true of things like factor() or data.frame().
The language has draft implementations of the relevant features, but they are not yet ready to make their way into this list. This is true of Doug Bates’ GLM code, for example.
I simply don’t know what the Julia equivalent is for an R function, but it may well exist. If you know of one, please fork the GitHub repository I’m using and revise the CSV file appropriately. I’ll integrate relevant pull requests as soon as I can find time.

In addition to explaining the presence of the many holes you can see this in this list, I’d also like to note how quickly these holes are being filled in: Doug Bates already finished a wrapper for the Rmath library, which means that Julia now has tools for calculating the PDF’s, CDF’s, and inverse CDF’s of most statistical distributions as well as the ability to draw random samples from them. That means that almost any sort of MCMC you’d like to do is already possible in Julia. (I, for one, am really interested to see if someone will use Julia’s sparse matrix support and these new Rmath functions to build MCMC code that’s easy on the eyes while also running at an appropriately fast speed on complicated, big data problems like matrix factorizations.)

On my end, I’ve been working on filling some of the missing entries in this list by adding in pieces that I think I understand well enough to implement from scratch, such as:

Optimization algorithms (optim.jl):

Simulated annealing
Gradient descent
Newton’s method

Statistical hypothesis tests (stats.jl):

t-Tests

Utility functions (utils.jl):

range
keys
cummax
cummin



To leave a comment for the author, please follow the link and comment on his blog:  John Myles White » Statistics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

]]></description>
<dc:subject>R_bloggers programming statistics</dc:subject>
<dc:identifier>https://pinboard.in/u:rahuldave/b:4a08330142b5/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:R_bloggers"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:programming"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:statistics"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://www.johnmyleswhite.com/notebook/2012/04/04/simulated-annealing-in-julia/">
    <title>Simulated Annealing in Julia</title>
    <dc:date>2012-04-04T20:38:53+00:00</dc:date>
    <link>http://www.johnmyleswhite.com/notebook/2012/04/04/simulated-annealing-in-julia/</link>
    <dc:creator>rahuldave</dc:creator><description><![CDATA[Building Optimization Functions for Julia
In hopes of adding enough statistical functionality to Julia to make it usable for my day-to-day modeling projects, I’ve written a very basic implementation of the simulated annealing (SA) algorithm, which I’ve placed in the same JuliaVsR GitHub repository that I used for the code for my previous post about Julia. For easier reading, my current code for SA is shown below:

The Simulated Annealing Algorithm

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# simulated_annealing()
# Arguments:
# * cost: Function from states to the real numbers. Often called an energy function, but this algorithm works for both positive and negative costs.
# * s0: The initial state of the system.
# * neighbor: Function from states to states. Produces what the Metropolis algorithm would call a proposal.
# * temperature: Function specifying the temperature at time i.
# * iterations: How many iterations of the algorithm should be run? This is the only termination condition.
# * keep_best: Do we return the best state visited or the last state visisted? (Should default to true.)
# * trace: Do we show a trace of the system's evolution?
 
function simulated_annealing(cost,
                             s0,
                             neighbor,
                             temperature,
                             iterations,
                             keep_best,
                             trace)
 
  # Set our current state to the specified intial state.
  s = s0
 
  # Set the best state we've seen to the intial state.
  best_s = s0
 
  # We always perform a fixed number of iterations.
  for i = 1:iterations
    t = temperature(i)
    s_n = neighbor(s)
    if trace println("$i: s = $s") end
    if trace println("$i: s_n = $s_n") end
    y = cost(s)
    y_n = cost(s_n)
    if trace println("$i: y = $y") end
    if trace println("$i: y_n = $y_n") end
    if y_n <= y
      s = s_n
      if trace println("Accepted") end
    else
      p = exp(- ((y_n - y) / t))
      if trace println("$i: p = $p") end
      if rand() <= p
        s = s_n
        if trace println("Accepted") end
      else
        s = s
        if trace println("Rejected") end
      end
    end
    if trace println() end
    if cost(s) < cost(best_s)
      best_s = s
    end
  end
 
  if keep_best
    best_s
  else
    s
  end
end

I’ve tried to implement the algorithm as it was presented by Bertsimas and Tsitsiklis in their 1993 review paper in Statistical Science. The major differences between my implementation and their description of the algorithm is that (1) I’ve made it possible to keep the best solution found during the search rather than always use the last solution found and (2) I’ve made no effort to select a value for their d parameter carefully: I’ve simply set it to 1 for all of my examples, though my code will allow you to specify another rule for setting the temperature of the annealer at time t other than the 1 / log(t) cooling scheme I’ve been using. (In fact, the code currently forces you to select a cooling scheme, since there are no default arguments in Julia yet.)

I chose simulated annealing as my first optimization algorithm to implement in Julia because it’s a natural relative of the Metropolis algorithm that I used in the previous post. Indeed, coding up an implementation of SA made me appreciate the fact that the Metropolis algorithm as used in Bayesian statistics can be considered a special case of the SA algorithm in which the temperature is always 1 and in which the cost function for a state with posterior probability p is -log(p).

Coding up the SA algorithm for myself also me made realize why it’s important that it uses an additive comparison of cost functions rather than a ratio (as in the Metropolis algorithm): the ratio goes haywire when the cost function can take on both positive and negative values (which, of course, doesn’t matter for Bayesian methods because probabilities are strictly non-negative). I discovered this when I initially tried to code up SA from my inaccurate memory without first consulting the literature and discovered that I couldn’t get a ratio-based algorithm to work no matter how many times I tried changing the cooling schedule.

To test out the SA implementation I’ve written, I’ve written two tests cases that attempt to minimize the Rosenbrock and Himmelbrau functions, which I found listed as examples of hard-to-minimize functions in the Wikipedia description of the Nelder-Mead method. You can see those two examples below this paragraph. In addition, I’ve used R to generate plots showing how the SA algorithm works under repeated application on the same optimization problem; in these plots, I’ve used a heatmap to show the cost functions value at each (x, y) position, colored crosshairs to indicate the position of a true minimum of the function in question, and red dots to indicate the purported solutions found by my implementation of SA.

Finding the Minimum of the Rosenbrock Function

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Find a solution of the Rosenbrock function using SA.
load("simulated_annealing.jl")
load("../rng.jl")
 
function rosenbrock(x, y)
  (1 - x)^2 + 100(y - x^2)^2
end
 
function neighbors(z)
  [rand_uniform(z[1] - 1, z[1] + 1), rand_uniform(z[2] - 1, z[2] + 1)]
end
 
srand(1)
 
solution = simulated_annealing(z -> rosenbrock(z[1], z[2]),
                               [0, 0],
                               neighbors,
                               log_temperature,
                               10000,
                               true,
                               false)


Finding the Minima of the Himmelbrau Function

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Find a solution of the Himmelbrau function using SA.
load("simulated_annealing.jl")
load("../rng.jl")
 
function himmelbrau(x, y)
  (x^2 + y - 11)^2 + (x + y^2 - 7)^2
end
 
function neighbors(z)
  [rand_uniform(z[1] - 1, z[1] + 1), rand_uniform(z[2] - 1, z[2] + 1)]
end
 
srand(1)
 
solution = simulated_annealing(z -> himmelbrau(z[1], z[2]),
                               [0, 0],
                               neighbors,
                               log_temperature,
                               10000,
                               true,
                               false)


Moving Forward
Now that I’ve got a form of SA working, I’m interested in coding up a suite of optimization functions for Julia so that I can start to do maximum likelihood estimation in pure Julia. Once that’s possible, I can use Julia to do real science, e.g. when I need to fit simple models for which finding the MLE is appropriate. (I will leave the development of cleaner statistical functions for special cases of maximum likelihood estimation to more capable people, like Douglas Bates, who has already produced some GLM code.)

At present my code is meant simply to demonstrate how one could write an implementation of simulated annealing in Julia. I’m sure that the code can be more efficient and I suspect that I’ve violated some of the idioms of the language. In addition, I’d much prefer that this function use default values for many of the arguments, as there is no reason that an end-user needs to be concerned with finding the best cooling schedule if SA seems to work out of the box on their problem with the cooling schedule I’ve been using.

With those disclaimers about my code in place, I’d like to think that I haven’t made any mathematical errors and that this function can be used by others. So, I’d ask that those interested please tear apart my code so that I can make it usable as a general purpose function for optimization in Julia. Alternatively, please tell me that there’s no need for a pure Julia implementation of SA because, for example, there are nice C libraries that would be much easier to link to than to re-implement.

With an implementation of SA in place, I’ll probably start working on implementing L-BFGS-S soon, which is the other optimization algorithm I use often in R. (To be honest, I use L-BFGS-S almost exclusively, but SA was much easier to implement.)

Incidentally, this code base demonstrates how I view the relationship between R and Julia: Julia is a beautiful new language that is still missing many important pieces. We can all work together to build the best pieces of R that are missing from Julia. While we’re working on improving Julia, we’ll need to keep using R to handle things like visualization of our results. For this post, I turned back to ggplot2 for all of the graphics generation.
]]></description>
<dc:subject>Statistics</dc:subject>
<dc:identifier>https://pinboard.in/u:rahuldave/b:60a23d91daaa/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Statistics"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://www.johndcook.com/blog/2012/03/09/monkeying-with-bayes-theorem/">
    <title>Monkeying with Bayes’ theorem</title>
    <dc:date>2012-03-09T19:01:17+00:00</dc:date>
    <link>http://www.johndcook.com/blog/2012/03/09/monkeying-with-bayes-theorem/</link>
    <dc:creator>rahuldave</dc:creator><description><![CDATA[In Peter Norvig’s talk The Unreasonable Effectiveness of Data, starting at 37:42, he describes a translation algorithm based on Bayes’ theorem. Pick the English word that has the highest posterior probability as the translation. No surprise here. Then at 38:16 he says something curious.

So this is all nice and theoretical and pure, but as well as being mathematically inclined, we are also realists. So we experimented some, and we found out that when you raise that first factor [in Bayes' theorem] to the 1.5 power, you get a better result.

In other words, if we change Bayes’ theorem (!) the algorithm works better. He goes on to explain

Now should we dig up Bayes and notify him that he was wrong? No, I don’t think that’s it. …

I imagine most statisticians would respond that this cannot possibly be right. While it appears to work, there must be some underlying reason why and we should find that reason before using an algorithm based on an ad hoc tweak.

While such a reaction is understandable, it’s also a little hypocritical. Statisticians are constantly drawing inference from empirical data without understanding the underlying mechanisms that generate the data. When analyzing someone else’s data, a statistician will say that of course we’d rather understand the underlying mechanism than fit statistical models, that just not always possible. Reality is too complicated and we’ve got to do the best we can.

I agree, but that same reasoning applied at a higher level of abstraction could be used to accept Norvig’s translation algorithm. Here’s this model (derived from spurious math, but we’ll ignore that). Let’s see empirically how well it works.



]]></description>
<dc:subject>Statistics Bayesian Probability_and_Statistics</dc:subject>
<dc:identifier>https://pinboard.in/u:rahuldave/b:856f4bf519b8/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Bayesian"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Probability_and_Statistics"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://www.r-bloggers.com/data-visualization/">
    <title>Data visualization</title>
    <dc:date>2012-03-04T22:37:45+00:00</dc:date>
    <link>http://www.r-bloggers.com/data-visualization/</link>
    <dc:creator>rahuldave</dc:creator><description><![CDATA[
(This article was first published on   Research tips » R, and kindly contributed to R-bloggers)      


For those who have not read the seminal works of Tufte and Cleveland, please hang your heads in shame. To salvage some sense of self-worth, you can then head over to Solomon Messing’s blog where he is starting a series on data visualization based on the principles developed by Tufte and Cleveland (with R examples).

The classics are also worth reading, and remain relevant despite the 20 or 30 years that have elapsed since they appeared.





To leave a comment for the author, please follow the link and comment on his blog:  Research tips » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

]]></description>
<dc:subject>R_bloggers graphics R statistics Uncategorized</dc:subject>
<dc:identifier>https://pinboard.in/u:rahuldave/b:c48507b81035/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:R_bloggers"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:graphics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:R"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Uncategorized"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://www.r-bloggers.com/abc-in-roma-r-lab-2/">
    <title>ABC in Roma [R lab #2]</title>
    <dc:date>2012-03-02T06:07:14+00:00</dc:date>
    <link>http://www.r-bloggers.com/abc-in-roma-r-lab-2/</link>
    <dc:creator>rahuldave</dc:creator><description><![CDATA[
(This article was first published on   Xi'an's Og » R, and kindly contributed to R-bloggers)      


Here are the R codes of the second R lab organised by Serena Arima in supplement of my lectures (now completed!). This morning I covered ABC model choice and the following example is the benchmark used in the course (and in the paper) about the impact of summary statistics. (Warning! It takes a while to run…)


n.iter=10000
n=c(10,100,1000)
n.sims=100
prob.m1=matrix(0,nrow=n.sims,ncol=length(n))
prob.m2=prob.m1

### Simulation from Normal model
for(sims in 1:n.sims){

## True data generation from the Normal distribution and summary statistics
for(i in 1:length(n)){
 y.true=rnorm(n[i],0,1)
stat.true=c(mean(y.true),median(y.true),var(y.true))

## ABC algorithm
# Sample from the prior
mu=rnorm(n.iter,0,2)
dist.m1=rep(NA,n.iter)
dist.m2=rep(NA,n.iter)

for(j in 1:n.iter){
# Data generation under both models
# Normal model
y.m1=rnorm(n[i],mu[j])
stat.m1=c(mean(y.m1),median(y.m1),var(y.m1))
# computing the distance
dist.m1[j]=sum((stat.m1-stat.true)^2)
# Laplace model
y.m2=mu[j]+sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.m2=c(mean(y.m2),median(y.m2),var(y.m2))
# computing the distance
dist.m2[j]=sum((stat.m2-stat.true)^2)
}

# select epsilon as 1% distance quantile
epsilon=quantile(c(dist.m1,dist.m2),prob=0.01)

# compute the posterior probability that data come from
# the model
prob.m1[sims,i]=sum(dist.m1<epsilon)/(2*n.iter*0.01)
}}

### Simulation from the Laplace model
for(sims in 1:n.sims){

## True data generation from the Laplace distribution and summary statistics
 for(i in 1:length(n)){
y.true=sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.true=c(mean(y.true),median(y.true),var(y.true))

## ABC algorithm
# Sample from the prior
mu=rnorm(n.iter,0,2)
dist.m1=rep(NA,n.iter)
dist.m2=rep(NA,n.iter)

for(j in 1:n.iter){
# Data generation under both models
# Normal model
y.m1=rnorm(n[i],mu[j])
stat.m1=c(mean(y.m1),median(y.m1),var(y.m1))
# computing the distance
dist.m1[j]=sum((stat.m1-stat.true)^2)
# Laplace model
y.m2=mu[j]+sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.m2=c(mean(y.m2),median(y.m2),var(y.m2))
# computing the distance
dist.m2[j]=sum((stat.m2-stat.true)^2)
}

# select epsilon as 1% distance quantile
epsilon=quantile(c(dist.m1,dist.m2),prob=0.01)

# compute the posterior probability that data come from
# the model
prob.m2[sims,i]=sum(dist.m2<epsilon)/(2*n.iter*0.01)
}
}

# Visualize the results
y.true=sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.true=c(mean(y.true),median(y.true),var(y.true))

## ABC algorithm
# Sample from the prior
mu=rnorm(n.iter,0,2)
dist.m1=rep(NA,n.iter)
dist.m2=rep(NA,n.iter)

for(j in 1:n.iter){
# Data generation under both models
# Normal model
y.m1=rnorm(n[i],mu[j])
stat.m1=c(mean(y.m1),median(y.m1),var(y.m1))
# computing the distance
dist.m1[j]=sum((stat.m1-stat.true)^2)
# Laplace model
y.m2=mu[j]+sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.m2=c(mean(y.m2),median(y.m2),var(y.m2))
# computing the distance
dist.m2[j]=sum((stat.m2-stat.true)^2)
}

# select epsilon as 1% distance quantile
epsilon=quantile(c(dist.m1,dist.m2),prob=0.01)

# compute the posterior probability that data come from
# the model
prob.m2[sims,i]=sum(dist.m2<epsilon)/(2*n.iter*0.01)
}
}

# Visualize the results
y.true=sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.true=c(mean(y.true),median(y.true),var(y.true))

## ABC algorithm
# Sample from the prior
mu=rnorm(n.iter,0,2)
dist.m1=rep(NA,n.iter)
dist.m2=rep(NA,n.iter)

for(j in 1:n.iter){
# Data generation under both models
# Normal model
y.m1=rnorm(n[i],mu[j])
stat.m1=c(mean(y.m1),median(y.m1),var(y.m1))
# computing the distance
dist.m1[j]=sum((stat.m1-stat.true)^2)
# Laplace model
y.m2=mu[j]+sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.m2=c(mean(y.m2),median(y.m2),var(y.m2))
# computing the distance
dist.m2[j]=sum((stat.m2-stat.true)^2)
}

# select epsilon as 1% distance quantile
epsilon=quantile(c(dist.m1,dist.m2),prob=0.01)

# compute the posterior probability that data come from
# the model
prob.m2[sims,i]=sum(dist.m2<epsilon)/(2*n.iter*0.01)
}}

# Visualize the results
boxplot(data.frame(prob.m1[,1],prob.m2[,1]),
names=c("M1","M2"),main="N=10")
boxplot(data.frame(prob.m1[,2],prob.m2[,2]),
names=c("M1","M2"),main="N=100")
boxplot(data.frame(prob.m1[,3],prob.m2[,3]),
names=c("M1","M2"),main="N=1000")


Once again, I had a terrific time teaching this “ABC in Roma” course, and not only for the immense side benefit of enjoy Roma in a particularly pleasant weather (for late February).

Filed under: R, Statistics, University life Tagged: ABC, La Sapienza, PhD course, R, Roma        

To leave a comment for the author, please follow the link and comment on his blog:  Xi'an's Og » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

]]></description>
<dc:subject>R_bloggers ABC La_Sapienza PhD_course R Roma statistics University_life</dc:subject>
<dc:identifier>https://pinboard.in/u:rahuldave/b:65af70b24476/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:R_bloggers"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:ABC"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:La_Sapienza"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:PhD_course"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:R"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Roma"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:University_life"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://www.r-bloggers.com/modeling-trick-the-signed-pseudo-logarithm/">
    <title>Modeling Trick: the Signed Pseudo Logarithm</title>
    <dc:date>2012-03-02T05:19:42+00:00</dc:date>
    <link>http://www.r-bloggers.com/modeling-trick-the-signed-pseudo-logarithm/</link>
    <dc:creator>rahuldave</dc:creator><description><![CDATA[
(This article was first published on   Win-Vector Blog » R, and kindly contributed to R-bloggers)      


Much of the data that the analyst uses exhibits extraordinary range.  For example: incomes, company sizes, popularity of books and any “winner takes all process”; (see: Living in A Lognormal World).  Tukey recommended the logarithm as an important “stabilizing transform” (a transform that brings data into a more usable form prior to generating exploratory statistics, analysis or modeling).  One benefit of such transforms is: data that is normal (or Gaussian) meets more of the stated expectations of common modeling methods like least squares linear regression.  So data from distributions like the lognormal is well served by a log() transformation (that transforms the data closer to Gaussian) prior to analysis.  However, not all data is appropriate for a log-transform (such as data with zero or negative values).  We discuss a simple transform that we call a signed pseudo logarithm that is particularly appropriate to signed wide-range data (such as profit and loss).Log-transforming data is essential when analyzing systems that operate in relative terms or are “scale invariant” (such as financial returns).   For example Geometric Brownian motion  is a stochastic process central to a number of financial models (such as option pricing).  Geometric Brownian motion is actually the exponential of a standard linear Brownian motion (where increments are normal or Gaussian).   In this case a logarithmic transformation actually moves from the observed data to a more natural frame of reference (where increments are additive instead of being multiplicative).  However, a major shortcoming of the log transform is its inability to deal with zero values and negative values.

A signed value we have often been asked to characterize or predict is: profit and loss (often called P&L).    One natural model for P&L would be as the difference between a revenue and an expense.  If the revenue and expense were both normally distributed then their difference would also be normal (and we would not need any stabilizing transform).  However, if the revenue and expense were both log-normally distributed (say both proportional to task size or some other log-normal parameter) then the difference is not normal (it retains the propensity for extreme values or heavy tails of the original distributions).  And for many financial size measures (company size, contract size and so on) the log-normal distribution is a much more realistic model than the normal distribution.  In some situations P&L’s are formed from completely observed revenues and expenses (so we can model everything without sign problems), in other situations the signed P&L from an unobserved (or unrecorded) underling process and we are forced to deal with signed quantities.

For signed data we suggest the following transformation (code in R):



pseudoLog10 <- function(x) { asinh(x/2)/log(10) }



asinh() is a somewhat ugly function that is the inverse of sinh().   sinh() is defined as:



sinh(x) = (e^x - e^(-x))/2



The important point is for x such that |x| is large 2*sinh(x) rapidly approaches sign(x)*e^(|x|).  Thus we should expect asinh(x/2) to look a lot like sign(x)*log(|x|) (which is why we call it a signed pseudo logarithm).   For pseudoLog10() we take the previous function divided by log(10) to ensure that we are in log-10 like units (i.e. pseudoLog10(100) is nearly 2, pseudoLog10(1000) is nearly 3 and so on).  Business audiences tend to have an easier time with log-10 (or dB) units (which can be explained as counting the number of decimal digits) than natural log or log-e units.

So for large positive numbers pseudoLog10() pretty much behaves like log10() (itself a standard transform).  In fact pseudoLog10() has the following nice properties:


pseudoLog10(x) is defined for all real x.
pseudoLog10(0) = 0.
pseudoLog10(-x) = -pseudoLog10(x).
pseudoLog10(x) is monotone in x.
For x such that |x| is large: pseudoLog10(x) is very near sign(x)*log10(|x|).

We strongly recomend trying this transformation before feeding heavy tail data into a linear or logistic model.

However, we can not  recommend the transformation for presentation.   Consider the simple case of plotting the distribution or density of normal data with mean zero and standard-deviation 10 (see  My Favorite Graphs for description of a density plot):





library(ggplot2)
pseudoLog10 <- function(x) { asinh(x/2)/log(10) }
d <- data.frame(x=rnorm(n=1000,sd=100))
ggplot(d) + geom_density(aes(x=x))






The density plot shows what we would expect- a near normal distribution (most points towards the center and mass falling off quickly as we move away).  However, the plot of the pseudoLog10 transformed data is not what we would hope:





d$pseudoLog10x <- pseudoLog10(d$x)
ggplot(d) + geom_density(aes(x=pseudoLog10x))






The data density (falsely) appears bimodal!  This is because the pseudoLog10() transform is compressing ranges more and more violently as we move away from the origin (and not compressing near the origin).  So as we move away from origin: the product of the real data density times the degree of range compression climbs, achieves a maximum and then falls.   This phenomena (which is just a “change of variables” for densities) gives us the bimodal appearance for unimodal distributions that have significant mass outside of the range [-10,10].   The bimodal appearance is mostly a fact about the transform not really a feature of the underlying data.

We see value in examining at the relative sizes and centers of these two modes for asymmetric distributions (such as the profit and loss statement for a set of accounts that are mostly losing money).  The position and relative sizes of the modes gives us an initial hint what to look for (helps with questions like: “are total losses driven by many accounts or by few accounts” and so on).    We can not, however, recommend the pseudoLog10() transform for presentation.  The most striking feature of the graph is almost always the bimodal appearance of the data; and the bimodal appearance is almost always an artifact of the transform (not a real feature of the data).   You can not in good conscious push a presentation where the most prominent and exciting observation is not in fact in the data.

We do still recommend trying the pseudoLog10() transform when building a linear or logistic model with wide ranged data.  The transformation usefully compresses range which allows the modeled coefficients to be a function of most of the data and not a function of a few extreme values.  Models that depend on most of their data (or on central estimates from their data) tend to be safer, achieve higher statistical significance and cross-validate more reliably.  Models that are dominated by a few extreme values tend to be unsafe, not achieve statistical significance and not cross-validate reliably.  The bimodal artifact can work in the favor of modeling as it tends to compress a transformed variable into “typical positive example” and “typical negative example” while still allowing magnitudes to enter the model in some form.

Used with care the pseudoLog10() or arcsinh() transform can be an important data preparation step for signed data with large range.  Many financial summaries (such as P&L) meet these conditions and often profit from the transform.

Related posts:
Your Data is Never the Right Shape



To leave a comment for the author, please follow the link and comment on his blog:  Win-Vector Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

]]></description>
<dc:subject>R_bloggers arcsinh R singed_pseudo_logarithm stabilizing_transform statistics</dc:subject>
<dc:identifier>https://pinboard.in/u:rahuldave/b:d3b534855889/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:R_bloggers"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:arcsinh"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:R"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:singed_pseudo_logarithm"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:stabilizing_transform"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:statistics"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://www.johndcook.com/blog/2012/02/29/comparing-r-to-smoking/">
    <title>Comparing R to smoking</title>
    <dc:date>2012-02-29T20:46:29+00:00</dc:date>
    <link>http://www.johndcook.com/blog/2012/02/29/comparing-r-to-smoking/</link>
    <dc:creator>rahuldave</dc:creator><description><![CDATA[Francois Pinard comparing the R programming language to smoking:

Using R is a bit akin to smoking. The beginning is difficult, one may get headaches and even gag the first few times. But in the long run, it becomes pleasurable and even addictive. Yet, deep down, for those willing to be honest, there is something not fully healthy in it.

I’ve never gotten to the point that I would call using R pleasurable.

Quote via Stats in the Wild

Related posts:

Reviews of R in Action and The Art of R Programming
R quirks

]]></description>
<dc:subject>Statistics Rstats</dc:subject>
<dc:identifier>https://pinboard.in/u:rahuldave/b:6cd625f8ee7e/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Rstats"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://chrisblattman.com/2012/02/17/dear-statisticians-please-start-using-your-powers-for-good-not-evil/">
    <title>Dear statisticians: Please start using your powers for good not evil</title>
    <dc:date>2012-02-17T15:44:19+00:00</dc:date>
    <link>http://chrisblattman.com/2012/02/17/dear-statisticians-please-start-using-your-powers-for-good-not-evil/</link>
    <dc:creator>rahuldave</dc:creator><description><![CDATA[And here I waste all my time predicting outbreaks of violence, when I could be doing this:
As Pole’s computers crawled through the data, he was able to identify about 25 products that, when analyzed together, allowed him to assign each shopper a “pregnancy prediction” score. More important, he could also estimate her due date to within a small window, so Target could send coupons timed to very specific stages of her pregnancy.
…About a year after Pole created his pregnancy-prediction model, a man walked into a Target outside Minneapolis and demanded to see the manager. He was clutching coupons that had been sent to his daughter, and he was angry…
“My daughter got this in the mail!” he said. “She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?”
The manager didn’t have any idea what the man was talking about. He looked at the mailer. Sure enough, it was addressed to the man’s daughter and contained advertisements for maternity clothing, nursery furniture and pictures of smiling infants. The manager apologized and then called a few days later to apologize again.
On the phone, though, the father was somewhat abashed. “I had a talk with my daughter,” he said. “It turns out there’s been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.”
Actually, sir, I don’t think you do.
Full article. h/t @justinwolfers and Forbes
 
   
]]></description>
<dc:subject>statistics forecasting prediction</dc:subject>
<dc:identifier>https://pinboard.in/u:rahuldave/b:a5311996a72e/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:forecasting"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:prediction"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://dl.acm.org/citation.cfm?id=1195284">
    <title>Bayesian probabilistic reasoning applied to mathematical epidemiology for predictive spatiotemporal analysis of infectious diseases</title>
    <dc:date>2012-02-14T19:39:02+00:00</dc:date>
    <link>http://dl.acm.org/citation.cfm?id=1195284</link>
    <dc:creator>rahuldave</dc:creator><dc:subject>epidemiology statistics</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:rahuldave/b:0a80d6d9bc44/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:epidemiology"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:statistics"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://stat.asu.edu/~chavez/CCCPUB/Towards%20real%20time%20epidemiology%20data%20assimilation,%20modeling%20and%20anomaly%20detection%20of%20health%20surveillance%20data%20streams.pdf">
    <title>[untitled]</title>
    <dc:date>2012-02-14T19:38:34+00:00</dc:date>
    <link>http://stat.asu.edu/~chavez/CCCPUB/Towards%20real%20time%20epidemiology%20data%20assimilation,%20modeling%20and%20anomaly%20detection%20of%20health%20surveillance%20data%20streams.pdf</link>
    <dc:creator>rahuldave</dc:creator><dc:subject>statistics epidemiology</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:rahuldave/b:4b66882af2ee/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:epidemiology"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://www.lancs.ac.uk/staff/diggle/ENAR_slides.pdf">
    <title>[untitled]</title>
    <dc:date>2012-02-14T19:38:18+00:00</dc:date>
    <link>http://www.lancs.ac.uk/staff/diggle/ENAR_slides.pdf</link>
    <dc:creator>rahuldave</dc:creator><dc:subject>statistics epidemiology</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:rahuldave/b:eb2de3539c1f/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:epidemiology"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://feeds.arstechnica.com/~r/arstechnica/index/~3/NzKuPgcpSdY/seeing-a-power-law-in-data-doesnt-make-it-real.ars">
    <title>Linking correlation to causation with power laws and scale free systems</title>
    <dc:date>2012-02-09T19:30:19+00:00</dc:date>
    <link>http://feeds.arstechnica.com/~r/arstechnica/index/~3/NzKuPgcpSdY/seeing-a-power-law-in-data-doesnt-make-it-real.ars</link>
    <dc:creator>rahuldave</dc:creator><description><![CDATA[  
  
  
  

        
    
An essential part of science involves finding correlations between two sets of measurements and seeking explanations for those correlations. However, relationships can be suggested by data even when they don't actually exist, and correlations may occur due to random fluctuations rather than a deep underlying principle (as the infamous "correlation does not equal causation" cliché suggests). These errors are easy to make, and the scientific literature is full of them.



So how can researchers establish if a correlation is both real and meaningful? In a Perspective in the February 10 issue
of Science, Michael P.H. Stumpf and Mason A. Porter
examine the type of correlation known as a power law, where one set of measurements is related to a second via an exponent. They argue that two things must be in place for a power law
to be valid as a predictive model: it must hold over a wide range of data to eliminate chance associations, and it must have a plausible mechanism to explain why the correlation showed up in the data.
    
          
      
        
    


      Read the comments on this post


   
]]></description>
<dc:subject>News Science correlations powerlaw statistics</dc:subject>
<dc:identifier>https://pinboard.in/u:rahuldave/b:2b1eec98d001/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:News"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Science"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:correlations"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:powerlaw"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:statistics"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://www.johndcook.com/blog/2012/02/01/the-universal-solvent-of-statistics/">
    <title>The universal solvent of statistics</title>
    <dc:date>2012-02-01T16:02:21+00:00</dc:date>
    <link>http://www.johndcook.com/blog/2012/02/01/the-universal-solvent-of-statistics/</link>
    <dc:creator>rahuldave</dc:creator><description><![CDATA[Andrew Gelman just posted an interesting article on the philosophy of Bayesian statistics. Here’s my favorite passage.

This reminds me of a standard question that Don Rubin … asks in virtually any  situation:  “What would you do if you had all the data?”  For me, that  “what would you do” question is one of the universal solvents of  statistics.

Emphasis added.

I had not heard Don Rubin’s question before, but I think I’ll be asking it often. It reminds me of Alice’s famous dialog with the Cheshire Cat:

“Would you tell me, please, which way I ought to go from here?”

“That depends a good deal on where you want to get to,” said the Cat.

“I don’t much care where–” said Alice.

“Then it doesn’t matter which way you go,” said the Cat.



Related post: Irrelevant uncertainty

]]></description>
<dc:subject>Statistics Uncategorized Bayesian Probability_and_Statistics</dc:subject>
<dc:identifier>https://pinboard.in/u:rahuldave/b:c7d00ee2e732/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Uncategorized"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Bayesian"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Probability_and_Statistics"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://www.johndcook.com/blog/2012/01/02/r-in-action/">
    <title>R in Action</title>
    <dc:date>2012-01-02T20:16:15+00:00</dc:date>
    <link>http://www.johndcook.com/blog/2012/01/02/r-in-action/</link>
    <dc:creator>rahuldave</dc:creator><description><![CDATA[Comments]]></description>
<dc:subject>Statistics Books Probability_and_Statistics Rstats</dc:subject>
<dc:identifier>https://pinboard.in/u:rahuldave/b:8d9bdb6d3ec6/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Books"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Probability_and_Statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Rstats"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://feedproxy.google.com/~r/TheEndeavour/~3/gBBNTQY8bwk/">
    <title>R in Action</title>
    <dc:date>2012-01-02T16:32:31+00:00</dc:date>
    <link>http://feedproxy.google.com/~r/TheEndeavour/~3/gBBNTQY8bwk/</link>
    <dc:creator>rahuldave</dc:creator><description><![CDATA[No Starch Press sent me a copy of The Art of R Programming last Fall and I wrote a review of it here. Then a couple weeks ago, Manning sent me a copy of R in Action. Here I’ll give a quick comparison of the two books, then focus specifically on R in Action.





Comparing R books

Norman Matloff, author of The Art of R Programming, is a statistician-turned-computer scientist. As the title may imply, Matloff’s book has more of a programmer’s perspective on R as a language.

Robert Kabacoff, author of R in Action, is a psychology professor-turned-statistical consultant. And as its title may imply, Kabacoff’s book is more about using R to analyze data. That is, the book is organized by analytical task rather than by language feature.

Many R books are organized like a statistical text. In fact, many are statistics texts, organized according to the progression of statistical theory with R code sprinkled in. R in Action is organized roughly in the order of steps one would take to analyze data, starting with importing data and ending with producing reports.

In short, The Art of R Programming is for programmers, R in Action is for data analysts, and most other R books I’ve seen are for statisticians. Of course a typical R user is to some extent a programmer, an analyst, and a statistician. But this comparison gives you some idea which book you might want to reach for depending on which hat you’re wearing at the moment. For example, I’d pick up The Art of R Programming if I had a question about interfacing R and C, but I’d pick up R in Action if I wanted to read about importing SAS data or using the ggplot2 graphics package.

R in Action

Kabacoff begins his book off with two appropriate quotes.

What is the use of a book, without pictures or conversations? — Alice, Alice in Wonderland

It’s wonderous, with treasures to satiate desires both subtle and gross; but it’s not for the timid. — Q,  “Q Who?” Star Trek: The Next Generation

R in Action is filled with pictures and conversations. It is also a treasure chest of practical information.

The first third of the book concerns basic data management and graphics. This much of the book would be accessible to someone with no background in statistics. The middle third of the book is devoted to basic statistics: correlation, linear regression, etc. The final third of the book contains more advanced statistics and graphics. (I was pleased to see the book has an appendix on using Sweave and odfWeave to produce reports.)

R in Action includes practical details that I have not seen in other books on R. Perhaps this is because the book is focused on analyzing and graphing data rather than exploring the dark corners of R or rounding out statistical theory.

Kabacoff says that he wrote the book that he wishes he’d had years ago. I also wish I’d had his book years ago.

Related links:

R programming for those coming from other languages (referenced in R in Action)

Calling C++ from R

Better R console fonts

]]></description>
<dc:subject>Statistics Books Probability_and_Statistics Rstats</dc:subject>
<dc:identifier>https://pinboard.in/u:rahuldave/b:1e2ce682352d/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Books"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Probability_and_Statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Rstats"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://www.johndcook.com/blog/2011/04/20/teaching-bayesian-stats-backward/">
    <title>Teaching Bayesian stats backward</title>
    <dc:date>2011-04-20T15:04:42+00:00</dc:date>
    <link>http://www.johndcook.com/blog/2011/04/20/teaching-bayesian-stats-backward/</link>
    <dc:creator>rahuldave</dc:creator><description><![CDATA[Most presentations of Bayesian statistics I’ve seen start with elementary examples of Bayes’ Theorem. And most of these use the canonical example of testing for rare diseases. But the connection between these examples and Bayesian statistics is not obvious at first. Maybe this isn’t the best approach.

What if we begin with the end in mind? Bayesian calculations produce posterior probability distributions on parameters. An effective way to teach Bayesian statistics might be to start there. Suppose we had probability distributions on our parameters. Never mind where they came from. Never mind classical objections that say you can’t do this. What if you could? If you had such distributions, what could you do with them?

For starters, point estimation and interval estimation become trivial. You could, for example, use the distribution mean as a point estimate and the area between two quantiles as an interval estimate. The distributions tell you far more than  point estimates or interval estimates could; these estimates are simply summaries of the information contained in the distributions.

It makes logical sense to start with Bayes’ Theorem since that’s the tool used to construct posterior distributions. But I think it makes pedagogical sense to start with the posterior distribution and work backward to how one would come up with such a thing.

Bayesian statistics is so named because Bayes’ Theorem is essential to its calculations. But that’s a little like classical statistics Central Limitist statistics because it relies heavily on the Central Limit Theorem.

The key idea of Bayesian statistics is to represent all uncertainty by probability distributions. That idea can be obscured by an early emphasis on calculations.

Related posts:

Interview with David Spiegelhalter
Occam’s razor and Bayes’ theorem
Four reasons to use Bayesian inference

]]></description>
<dc:subject>Statistics Bayesian Education</dc:subject>
<dc:identifier>https://pinboard.in/u:rahuldave/b:941288c31c21/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Bayesian"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Education"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://www.johndcook.com/blog/2011/04/14/significance-testing-and-congress/">
    <title>Significance testing and Congress</title>
    <dc:date>2011-04-14T13:55:51+00:00</dc:date>
    <link>http://www.johndcook.com/blog/2011/04/14/significance-testing-and-congress/</link>
    <dc:creator>rahuldave</dc:creator><description><![CDATA[The US Supreme Court’s criticism of significance testing has been in the news lately. Here’s a criticism of significance testing involving the US Congress. Consider the following syllogism.


If a person is an American, he is not a member of Congress.
This person is a member of Congress.
Therefore he is not American.

The initial premise is false, but the reasoning is correct if we assume the initial premise is true.

The premise that Americans are never members of Congress is clearly false. But it’s almost true! The probability of an American being a member of Congress is quite small, about 535/309,000,000. So what happens if we try to salvage the syllogism above by inserting “probably” in the initial premise and conclusion?


If a person is an American, he is probably not a member of Congress.
This person is a member of Congress.
Therefore he is probably not American.

What went wrong? The probability is backward. We want to know the   probability that someone is American given he is a member of   Congress, not the probability he is a member of Congress given he is American.

Science continually uses flawed reasoning analogous to the example above. We start with a “null hypothesis,” a hypothesis we seek to disprove. If our data are highly unlikely assuming this hypothesis, we reject that hypothesis.


If the null hypothesis is correct, then these data are highly unlikely.
These data have occurred.
Therefore, the null hypothesis is highly unlikely.

Again the probability is backward. We want to know the probability of the hypothesis given the data, not the probability of the data given the hypothesis.

We can’t reject a null hypothesis just because we’ve seen data that are rare under this hypothesis. Maybe our data are even more rare under the alternative. It is rare for an American to be in Congress, but it is even more rare for someone who is not American to be in the US Congress!

I found this illustration in The Earth is Round (p < 0.05) by Jacob Cohen (1994). Cohen in turn credits Pollard and Richardson (1987) in his references.

Related posts:

How insignificant is significance testing?
Five criticisms of significance testing
Most published research results are false
Classical statistics in a nutshell

]]></description>
<dc:subject>Statistics Bayesian Probability_and_Statistics Science</dc:subject>
<dc:identifier>https://pinboard.in/u:rahuldave/b:22c91fe22f79/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Bayesian"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Probability_and_Statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Science"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://www.johndcook.com/blog/2011/04/13/pericchi-statistical-significance/">
    <title>How insignificant is statistical significance?</title>
    <dc:date>2011-04-13T20:48:56+00:00</dc:date>
    <link>http://www.johndcook.com/blog/2011/04/13/pericchi-statistical-significance/</link>
    <dc:creator>rahuldave</dc:creator><description><![CDATA[Luis Pericchi sent me a brief note commenting on the recent US Supreme Court decision involving statistical significance and medical reporting. Here is his paper, about a page and a half.

How insignificant is statistical significance? (PDF)

Related post: Significance testing and Congress

]]></description>
<dc:subject>Statistics Probability_and_Statistics</dc:subject>
<dc:identifier>https://pinboard.in/u:rahuldave/b:1fdfa4260e22/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Probability_and_Statistics"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://chrisblattman.com/2011/04/11/how-i-regard-almost-every-empirical-development-or-conflict-paper-i-know/">
    <title>How I regard almost every empirical development or conflict paper I know</title>
    <dc:date>2011-04-11T13:14:05+00:00</dc:date>
    <link>http://chrisblattman.com/2011/04/11/how-i-regard-almost-every-empirical-development-or-conflict-paper-i-know/</link>
    <dc:creator>rahuldave</dc:creator><description><![CDATA[
Man I love xkcd.
 
   
]]></description>
<dc:subject>research statistics</dc:subject>
<dc:identifier>https://pinboard.in/u:rahuldave/b:149625a03fe2/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:research"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:statistics"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://www.johndcook.com/blog/2010/04/29/simple-approximation-to-normal-distribution/">
    <title>Simple approximation to normal distribution</title>
    <dc:date>2010-04-29T14:50:17+00:00</dc:date>
    <link>http://www.johndcook.com/blog/2010/04/29/simple-approximation-to-normal-distribution/</link>
    <dc:creator>rahuldave</dc:creator><description><![CDATA[Here’s a simple approximation to the normal distribution I just ran across. The density function is

f(x) = (1 + cos(x))/2π

over the interval (-π, π). The plot below graphs this density with a solid blue line. For comparison, the density of a normal distribution with the same variance is plotted with a dashed orange line.



The approximation is good enough to use for teaching. Students may benefit from doing an exercise twice, once with this approximation and then again with the normal distribution. Having an approximation they can integrate in closed form may help take some of the mystery out of the normal distribution.

The approximation may have practical uses. The agreement between the PDFs isn’t great. However, the agreement between the CDFs (which is more important) is surprisingly good. The maximum difference between the two CDFs is only 0.018. (The differences between the PDFs oscillate, and so their integrals, the CDFs, are closer together.)

I ran across this approximation here. It goes back to the 1961 paper “A cosine approximation to the normal distribution” by D. H. Raab and E. H. Green, Psychometrika, Volume 26, pages 447-450.

Update 1: See the paper referenced in the first comment. It gives a much more accurate approximation using a logistic function. The cosine approximation is a little simpler and may be better for teaching. However, the logistic approximation has infinite support. That could be an advantage since students might be distracted by the finite support of the cosine approximation.

The logistic approximation for the standard normal CDF is

F(x) = 1/(1 + exp(-0.07056 x3 – 1.5976 x))

and has a maximum error of 0.00014 at x = ± 3.16.

Update 2:

How might you use this approximation the other way around, approximating a cosine by a normal density? See Always invert.

Related posts:

Rolling dice for normal samples
Sums of uniform random variables

]]></description>
<dc:subject>Math Statistics Probability_and_Statistics</dc:subject>
<dc:identifier>https://pinboard.in/u:rahuldave/b:847189f70cbf/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Math"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Probability_and_Statistics"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://www.mailund.dk/index.php/2010/04/26/is-r-an-epic-fail/">
    <title>Is R an ‘epic fail’?</title>
    <dc:date>2010-04-26T06:03:19+00:00</dc:date>
    <link>http://www.mailund.dk/index.php/2010/04/26/is-r-an-epic-fail/</link>
    <dc:creator>rahuldave</dc:creator><description><![CDATA[Is R an ‘epic fail’?

Something as popular and widespread as R can hardly be called a ‘failure’ in any meaningful sense, so of course the question is really in which aspects R is inferior to alternatives.

For most users who need a bit of data analysis, it is probably a poor first choice. R is a programming language with a lot of statistical and data visualisation support, but it is a programming language.  If you don’t want to do any programming, don’t muck about with R!  There are lots of visualisation tools and statistical tools that are much easier to use.

Of course, without a bit of programming, you are limited to what those tools can do, so if you need analysis that is not provided, you need to either find a programmer or learn how to program, and for the latter, R isn’t a bad choice.

You can get pretty far with very little effort in R, once you have learned how to program. Now learning how to program does require quite a bit of effort, but if you need to there really isn’t any way around it.  Just like there isn’t any Royal Road to mathematics (as Euclid is supposed to have said).

Sure, as a programming language R has its idiosyncrasies, but which programming languages do not?
]]></description>
<dc:subject>Work programming R statistics</dc:subject>
<dc:identifier>https://pinboard.in/u:rahuldave/b:a98065d24020/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Work"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:programming"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:R"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:statistics"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://www.johndcook.com/blog/2010/03/30/statistical-rule-of-three/">
    <title>Estimating the chances of something that hasn’t happened yet</title>
    <dc:date>2010-03-30T13:51:23+00:00</dc:date>
    <link>http://www.johndcook.com/blog/2010/03/30/statistical-rule-of-three/</link>
    <dc:creator>rahuldave</dc:creator><description><![CDATA[Suppose you’re proofreading a book. If you’ve read 20 pages and found 7 typos, you might reasonably estimate that the chances of a page having a typo are 7/20. But what if you’ve read 20 pages and found no typos. Are you willing to conclude that the chances of a page having a typo are 0/20, i.e. the book has absolutely no typos?

To take another example, suppose you are testing children for perfect pitch. You’ve tested 100 children so far and haven’t found any with perfect pitch. Do you conclude that children don’t have perfect pitch? You know that some do because you’ve heard of instances before. But your data suggest perfect pitch in children is at least rare. But how rare?

The rule of three gives a quick and dirty way to estimate these kinds of probabilities. It says that if you’ve tested N cases and haven’t found what you’re looking for, a reasonable estimate is that the probability is less than 3/N. So in our proofreading example, if you haven’t found any typos in 20 pages, you could estimate that the probability of a page having a typo is less than 15%. In the perfect pitch example, you could conclude that fewer than 3% of children have perfect pitch.

Note that the rule of three says that your probability estimate goes down in proportion to the number of cases you’ve studied. If you’d read 200 pages without finding a typo, your estimate would drop from 15% to 1.5%. But it doesn’t suddenly drop to zero. I imagine most people would harbor a suspicion that that there may be typos even though they haven’t seen any in the first few pages. But at some point they might say “I’ve read so many pages without finding any errors, there must not be any.” The situation is a little different with the perfect pitch example, however, because you may know before you start that the probability cannot be zero.

If the sight of math makes you squeamish, you might want to stop reading now. Just remember that if you haven’t seen something happen in N observations, a good estimate is that the chances of it happening are less than 3/N.

What makes the rule of three work? Suppose the probability of what you’re looking for is p. If we want a 95% confidence interval, we want to find the largest p so that the probability of no successes out of n trials is 0.05, i.e. we want to solve (1-p)n = 0.05 for p. Taking logs of both sides, n log(1-p) = log(0.05) ≈ -3. Since log(1-p) is approximately -p for small values of p, we have p ≈ 3/n.

The derivation above gives the frequentist perspective. I’ll now give the Bayesian derivation of the same result. Then you can say “p is probably less than 3/N” in clear conscience since Bayesians are allowed to make such statements.

Suppose you start with a uniform prior on p. The posterior distribution on p after having seen 0 successes and N failures has a beta(1, N+1) distribution. If you calculate the posterior probability of p being less than 3/N you get an expression that approaches 1 – exp(-3) as N gets large, and 1 – exp(-3) ≈ 0.95.

Related posts:

What  is a confidence interval?
Probability mistake can give a good approximation
Four reasons to use Bayesian inference

]]></description>
<dc:subject>Statistics Bayesian Probability_and_Statistics</dc:subject>
<dc:identifier>https://pinboard.in/u:rahuldave/b:55bc56ab748b/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Bayesian"/>
	<rdf:li rdf:resource="https://pinboard.in/u:rahuldave/t:Probability_and_Statistics"/>
</rdf:Bag></taxo:topics>
</item>
</rdf:RDF>