<?xml version="1.0" encoding="UTF-8"?>
 <rdf:RDF xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://web.resource.org/cc/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/">
  <channel rdf:about="http://pinboard.in">
    <title>Pinboard (cshalizi)</title>
    <link>https://pinboard.in/u:cshalizi/public/</link>
    <description>recent bookmarks from cshalizi</description>
    <items>
      <rdf:Seq>	<rdf:li rdf:resource="https://arxiv.org/abs/2502.20755"/>
	<rdf:li rdf:resource="https://dspace.mit.edu/handle/1721.1/155358"/>
	<rdf:li rdf:resource="https://arxiv.org/abs/2109.03582"/>
	<rdf:li rdf:resource="https://arxiv.org/abs/2105.03481"/>
	<rdf:li rdf:resource="https://projecteuclid.org/euclid.aos/1611889233"/>
	<rdf:li rdf:resource="https://arxiv.org/abs/2012.09828"/>
	<rdf:li rdf:resource="https://arxiv.org/abs/1506.02785"/>
	<rdf:li rdf:resource="https://arxiv.org/abs/1903.11117"/>
	<rdf:li rdf:resource="https://projecteuclid.org/euclid.aos/1597370670"/>
	<rdf:li rdf:resource="https://projecteuclid.org/euclid.ejs/1576573369"/>
	<rdf:li rdf:resource="https://arxiv.org/abs/1810.11953"/>
	<rdf:li rdf:resource="https://arxiv.org/abs/1910.08883"/>
	<rdf:li rdf:resource="https://arxiv.org/abs/1909.13464"/>
	<rdf:li rdf:resource="https://arxiv.org/abs/1602.02210"/>
	<rdf:li rdf:resource="http://auai.org/uai2015/proceedings/papers/230.pdf"/>
	<rdf:li rdf:resource="http://arxiv.org/abs/1411.2045"/>
	<rdf:li rdf:resource="http://arxiv.org/abs/1409.2344"/>
	<rdf:li rdf:resource="http://arxiv.org/abs/1407.1212"/>
	<rdf:li rdf:resource="http://arxiv.org/abs/1405.0558"/>
	<rdf:li rdf:resource="http://arxiv.org/abs/1001.0591"/>
	<rdf:li rdf:resource="http://arxiv.org/abs/1406.2083"/>
	<rdf:li rdf:resource="http://arxiv.org/abs/1210.4584"/>
	<rdf:li rdf:resource="http://arxiv.org/abs/1307.1954"/>
	<rdf:li rdf:resource="http://arxiv.org/abs/1207.6076"/>
	<rdf:li rdf:resource="http://arxiv.org/abs/1305.0423"/>
	<rdf:li rdf:resource="http://arxiv.org/abs/1304.4564"/>
	<rdf:li rdf:resource="http://arxiv.org/abs/1304.5939"/>
	<rdf:li rdf:resource="http://www.mitpressjournals.org/doi/abs/10.1162/NECO_a_00442"/>
	<rdf:li rdf:resource="http://arxiv.org/abs/1304.0796"/>
	<rdf:li rdf:resource="http://projecteuclid.org/DPubS?service=UI&amp;version=1.0&amp;verb=Display&amp;handle=euclid.ejs/1347974672"/>
	<rdf:li rdf:resource="http://normaldeviate.wordpress.com/2012/07/14/modern-two-sample-tests/"/>
	<rdf:li rdf:resource="http://jmlr.csail.mit.edu/papers/v13/gretton12a.html"/>
	<rdf:li rdf:resource="http://projecteuclid.org/DPubS?service=UI&amp;version=1.0&amp;verb=Display&amp;handle=euclid.aos/1176350835"/>
	<rdf:li rdf:resource="http://arxiv.org/abs/1202.1561"/>
	<rdf:li rdf:resource="http://projecteuclid.org/DPubS?service=UI&amp;version=1.0&amp;verb=Display&amp;handle=euclid.aoms/1177731355"/>
	<rdf:li rdf:resource="http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6145469&amp;arnumber=6018305&amp;tag=1"/>
	<rdf:li rdf:resource="http://pubs.amstat.org/doi/abs/10.1198/jasa.2011.tm10576"/>
      </rdf:Seq>
    </items>
  </channel><item rdf:about="https://arxiv.org/abs/2502.20755">
    <title>[2502.20755] Minimax Optimal Kernel Two-Sample Tests with Random Features</title>
    <dc:date>2025-03-16T19:31:58+00:00</dc:date>
    <link>https://arxiv.org/abs/2502.20755</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["Reproducing Kernel Hilbert Space (RKHS) embedding of probability distributions has proved to be an effective approach, via MMD (maximum mean discrepancy) for nonparametric hypothesis testing problems involving distributions defined over general (non-Euclidean) domains. While a substantial amount of work has been done on this topic, only recently, minimax optimal two-sample tests have been constructed that incorporate, unlike MMD, both the mean element and a regularized version of the covariance operator. However, as with most kernel algorithms, the computational complexity of the optimal test scales cubically in the sample size, limiting its applicability. In this paper, we propose a spectral regularized two-sample test based on random Fourier feature (RFF) approximation and investigate the trade-offs between statistical optimality and computational efficiency. We show the proposed test to be minimax optimal if the approximation order of RFF (which depends on the smoothness of the likelihood ratio and the decay rate of the eigenvalues of the integral operator) is sufficiently large. We develop a practically implementable permutation-based version of the proposed test with a data-adaptive strategy for selecting the regularization parameter and the kernel. Finally, through numerical experiments on simulated and benchmark datasets, we demonstrate that the proposed RFF-based test is computationally efficient and performs almost similar (with a small drop in power) to the exact test."]]></description>
<dc:subject>to:NB hilbert_space statistics two-sample_tests random_features</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:e765167ec668/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to:NB"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hilbert_space"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:random_features"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="https://dspace.mit.edu/handle/1721.1/155358">
    <title>Likelihood-Free Hypothesis Testing and Applications of the Energy Distance</title>
    <dc:date>2024-12-06T14:04:25+00:00</dc:date>
    <link>https://dspace.mit.edu/handle/1721.1/155358</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["This thesis studies questions in nonparametric testing and estimation that are inspired by machine learning. One of the main problems of our interest is likelihood-free hypothesis testing: given three samples X, Y and Z with sample sizes n, n and m respectively, one must decide whether the distribution of Z is closer to that of X or that of Y . We fully characterize the problem’s sample complexity for multiple distribution classes and with high probability. We uncover connections to two-sample, goodness-of-fit and robust testing, and show the existence of a trade-off of the form mn ≍ k/ε^4, where k is an appropriate notion of complexity and ε is the total variation separation between the distributions of X and Y . We generalize our problem to allow Z to come from a mixture of the distributions of X and Y , and propose a kernel-based test for its solution, and also verify the existence of a trade-off between m and n on experimental data from particle physics. In addition, we demonstrate that the family of “classifier accuracy” tests are not only popular in practice but also provably near-optimal, recovering and simplifying a multitude of classical and recent results. Finally, we study affine classifiers as a tool for estimation and testing, with the key technical tool being a connection to the energy distance. In particular, we propose a density estimation routine based on minimizing the generalized energy distance, targeting smooth densities and Gaussian mixtures. We interpret our results in terms of half-space separability over these classes, and derive analogous results for discrete distributions. As a consequence we deduce that any two discrete distributions are well-separated by a half-space, provided their support is embedded as a packing of a high-dimensional unit ball. We also scrutinize two recent applications of the energy distance in the two-sample testing literature."
]]></description>
<dc:subject>to:NB to_read hypothesis_testing two-sample_tests statistics via:_onionesque kernel_methods goodness-of-fit</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:d56e7c266c6a/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to:NB"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to_read"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hypothesis_testing"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:via:_onionesque"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:kernel_methods"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:goodness-of-fit"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="https://arxiv.org/abs/2109.03582">
    <title>[2109.03582] Higher Order Kernel Mean Embeddings to Capture Filtrations of Stochastic Processes</title>
    <dc:date>2023-02-15T19:57:50+00:00</dc:date>
    <link>https://arxiv.org/abs/2109.03582</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["Stochastic processes are random variables with values in some space of paths. However, reducing a stochastic process to a path-valued random variable ignores its filtration, i.e. the flow of information carried by the process through time. By conditioning the process on its filtration, we introduce a family of higher order kernel mean embeddings (KMEs) that generalizes the notion of KME and captures additional information related to the filtration. We derive empirical estimators for the associated higher order maximum mean discrepancies (MMDs) and prove consistency. We then construct a filtration-sensitive kernel two-sample test able to pick up information that gets missed by the standard MMD test. In addition, leveraging our higher order MMDs we construct a family of universal kernels on stochastic processes that allows to solve real-world calibration and optimal stopping problems in quantitative finance (such as the pricing of American options) via classical kernel-based regression methods. Finally, adapting existing tests for conditional independence to the case of stochastic processes, we design a causal-discovery algorithm to recover the causal graph of structural dependencies among interacting bodies solely from observations of their multidimensional trajectories."

--- ETA after reading: This feels like a strange paper.  I'm not sure I truly understand what theiur "signature statistics" do, nor do I quite get the claimed advantage of higher-order process kernels over "first-order" kernels.  (Proofs are referred to other, older papers.)  And the notion of "causality" between processes seems very weird, since I don't see how it accounts for the flow of time, and of influence, within or across processes, they're being treated like big but indecomposable objects.  Probably should track down references and see if this makes more sense when I put those together.]]></description>
<dc:subject>to:NB stochastic_processes kernel_methods causal_discovery time_series statistical_inference_for_stochastic_processes hilbert_space re:codename:catherine_wheel two-sample_tests statistics have_read path_signatures</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:105636bb7bad/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to:NB"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:stochastic_processes"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:kernel_methods"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:causal_discovery"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:time_series"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistical_inference_for_stochastic_processes"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hilbert_space"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:re:codename:catherine_wheel"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:have_read"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:path_signatures"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="https://arxiv.org/abs/2105.03481">
    <title>[2105.03481] Stein's Method Meets Statistics: A Review of Some Recent Developments</title>
    <dc:date>2021-05-12T18:13:21+00:00</dc:date>
    <link>https://arxiv.org/abs/2105.03481</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["Stein's method is a collection of tools for analysing distributional comparisons through the study of a class of linear operators called Stein operators. Originally studied in probability, Stein's method has also enabled some important developments in statistics. This early success has led to a high research activity in this area in recent years. The goal of this survey is to bring together some of these developments in theoretical statistics as well as in computational statistics and, in doing so, to stimulate further research into the successful field of Stein's method and statistics. The topics we discuss include: explicit error bounds for asymptotic approximations of estimators and test statistics, a measure of prior sensitivity in Bayesian statistics, tools to benchmark and compare sampling methods such as approximate Markov chain Monte Carlo, deterministic alternatives to sampling methods, control variate techniques, and goodness-of-fit testing."]]></description>
<dc:subject>to:NB steins_method probability statistics monte_carlo two-sample_tests goodness-of-fit re:codename:catherine_wheel</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:42b841f98d00/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to:NB"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:steins_method"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:probability"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:monte_carlo"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:goodness-of-fit"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:re:codename:catherine_wheel"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="https://projecteuclid.org/euclid.aos/1611889233">
    <title>Kim , Ramdas , Singh , Wasserman : Classification accuracy as a proxy for two-sample testing</title>
    <dc:date>2021-02-04T15:31:39+00:00</dc:date>
    <link>https://projecteuclid.org/euclid.aos/1611889233</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["When data analysts train a classifier and check if its accuracy is significantly different from chance, they are implicitly performing a two-sample test. We investigate the statistical properties of this flexible approach in the high-dimensional setting. We prove two results that hold for all classifiers in any dimensions: if its true error remains ϵϵ-better than chance for some ϵ>0ϵ>0 as d,n→∞d,n→∞, then (a) the permutation-based test is consistent (has power approaching to one), (b) a computationally efficient test based on a Gaussian approximation of the null distribution is also consistent. To get a finer understanding of the rates of consistency, we study a specialized setting of distinguishing Gaussians with mean-difference δδ and common (known or unknown) covariance ΣΣ, when d/n→c∈(0,∞)d/n→c∈(0,∞). We study variants of Fisher’s linear discriminant analysis (LDA) such as “naive Bayes” in a nontrivial regime when ϵ→0ϵ→0 (the Bayes classifier has true accuracy approaching 1/2), and contrast their power with corresponding variants of Hotelling’s test. Surprisingly, the expressions for their power match exactly in terms of nn, dd, δδ, ΣΣ, and the LDA approach is only worse by a constant factor, achieving an asymptotic relative efficiency (ARE) of 1/π‾‾√1/π for balanced samples. We also extend our results to high-dimensional elliptical distributions with finite kurtosis. Other results of independent interest include minimax lower bounds, and the optimality of Hotelling’s test when d=o(n)d=o(n). Simulation results validate our theory, and we present practical takeaway messages along with natural open problems."]]></description>
<dc:subject>to:NB hypothesis_testing two-sample_tests classifiers high-dimensional_statistics heard_the_talk kith_and_kin singh.aarti wasserman.larry ramdas.aaditya</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:9a8de542290c/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to:NB"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hypothesis_testing"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:classifiers"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:high-dimensional_statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:heard_the_talk"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:kith_and_kin"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:singh.aarti"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:wasserman.larry"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:ramdas.aaditya"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="https://arxiv.org/abs/2012.09828">
    <title>[2012.09828] Nonparametric Two-Sample Hypothesis Testing for Random Graphs with Negative and Repeated Eigenvalues</title>
    <dc:date>2020-12-18T10:33:02+00:00</dc:date>
    <link>https://arxiv.org/abs/2012.09828</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["We propose a nonparametric two-sample test statistic for low-rank, conditionally independent edge random graphs whose edge probability matrices have negative eigenvalues and arbitrarily close eigenvalues. Our proposed test statistic involves using the maximum mean discrepancy applied to suitably rotated rows of a graph embedding, where the rotation is estimated using optimal transport. We show that our test statistic, appropriately scaled, is consistent for sufficiently dense graphs, and we study its convergence under different sparsity regimes. In addition, we provide empirical evidence suggesting that our novel alignment procedure can perform better than the naïve alignment in practice, where the naïve alignment assumes an eigengap."]]></description>
<dc:subject>to:NB network_data_analysis re:network_differences two-sample_tests</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:aa95516da3d0/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to:NB"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:network_data_analysis"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:re:network_differences"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="https://arxiv.org/abs/1506.02785">
    <title>[1506.02785] On the Error of Random Fourier Features</title>
    <dc:date>2020-12-14T01:48:11+00:00</dc:date>
    <link>https://arxiv.org/abs/1506.02785</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["Kernel methods give powerful, flexible, and theoretically grounded approaches to solving many problems in machine learning. The standard approach, however, requires pairwise evaluations of a kernel function, which can lead to scalability issues for very large datasets. Rahimi and Recht (2007) suggested a popular approach to handling this problem, known as random Fourier features. The quality of this approximation, however, is not well understood. We improve the uniform error bound of that paper, as well as giving novel understandings of the embedding's variance, approximation error, and use in some machine learning methods. We also point out that surprisingly, of the two main variants of those features, the more widely used is strictly higher-variance for the Gaussian kernel and has worse bounds."]]></description>
<dc:subject>random_features kernel_methods approximation computational_statistics concentration_of_measure two-sample_tests regression schneider.jeff have_read to_teach:childs_garden_of_statistical_learning_theory in_NB</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:bdadd4e91fb9/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:random_features"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:kernel_methods"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:approximation"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:computational_statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:concentration_of_measure"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:regression"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:schneider.jeff"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:have_read"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to_teach:childs_garden_of_statistical_learning_theory"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:in_NB"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="https://arxiv.org/abs/1903.11117">
    <title>[1903.11117] Testing for Differences in Stochastic Network Structure</title>
    <dc:date>2020-11-25T14:52:28+00:00</dc:date>
    <link>https://arxiv.org/abs/1903.11117</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["How can one determine whether a community-level treatment, such as the introduction of a social program or trade shock, alters agents' incentives to form links in a network? This paper proposes analogues of a two-sample Kolmogorov-Smirnov test, widely used in the literature to test the null hypothesis of "no treatment effects", for network data. It first specifies a testing problem in which the null hypothesis is that two networks are drawn from the same random graph model. It then describes two randomization tests based on the magnitude of the difference between the networks' adjacency matrices as measured by the 2→2 and ∞→1 operator norms. Power properties of the tests are examined analytically, in simulation, and through two real-world applications. A key finding is that the test based on the ∞→1 norm can be substantially more powerful than that based on the 2→2 norm for the kinds of sparse and degree-heterogeneous networks common in economics."]]></description>
<dc:subject>to:NB network_data_analysis re:network_differences two-sample_tests hypothesis_testing to_read</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:ba62b7e5b0dc/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to:NB"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:network_data_analysis"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:re:network_differences"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hypothesis_testing"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to_read"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="https://projecteuclid.org/euclid.aos/1597370670">
    <title>Ghoshdastidar , Gutzeit , Carpentier , von Luxburg : Two-sample hypothesis testing for inhomogeneous random graphs</title>
    <dc:date>2020-11-18T21:44:11+00:00</dc:date>
    <link>https://projecteuclid.org/euclid.aos/1597370670</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["The study of networks leads to a wide range of high-dimensional inference problems. In many practical applications, one needs to draw inference from one or few large sparse networks. The present paper studies hypothesis testing of graphs in this high-dimensional regime, where the goal is to test between two populations of inhomogeneous random graphs defined on the same set of nn vertices. The size of each population mm is much smaller than nn, and can even be a constant as small as 1. The critical question in this context is whether the problem is solvable for small mm.
"We answer this question from a minimax testing perspective. Let PP, QQ be the population adjacencies of two sparse inhomogeneous random graph models, and dd be a suitably defined distance function. Given a population of mm graphs from each model, we derive minimax separation rates for the problem of testing P=QP=Q against d(P,Q)>ρd(P,Q)>ρ. We observe that if mm is small, then the minimax separation is too large for some popular choices of dd, including total variation distance between corresponding distributions. This implies that some models that are widely separated in dd cannot be distinguished for small mm, and hence, the testing problem is generally not solvable in these cases.
"We also show that if m>1m>1, then the minimax separation is relatively small if dd is the Frobenius norm or operator norm distance between PP and QQ. For m=1m=1, only the latter distance provides small minimax separation. Thus, for these distances, the problem is solvable for small mm. We also present near-optimal two-sample tests in both cases, where tests are adaptive with respect to sparsity level of the graphs."]]></description>
<dc:subject>to:NB to_read statistics two-sample_tests network_data_analysis re:network_differences</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:979b239f43f9/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to:NB"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to_read"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:network_data_analysis"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:re:network_differences"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="https://projecteuclid.org/euclid.ejs/1576573369">
    <title>Kim , Lee , Lei : Global and local two-sample tests via regression</title>
    <dc:date>2020-11-16T16:11:48+00:00</dc:date>
    <link>https://projecteuclid.org/euclid.ejs/1576573369</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["Two-sample testing is a fundamental problem in statistics. Despite its long history, there has been renewed interest in this problem with the advent of high-dimensional and complex data. Specifically, in the machine learning literature, there have been recent methodological developments such as classification accuracy tests. The goal of this work is to present a regression approach to comparing multivariate distributions of complex data. Depending on the chosen regression model, our framework can efficiently handle different types of variables and various structures in the data, with competitive power under many practical scenarios. Whereas previous work has been largely limited to global tests which conceal much of the local information, our approach naturally leads to a local two-sample testing framework in which we identify local differences between multivariate distributions with statistical confidence. We demonstrate the efficacy of our approach both theoretically and empirically, under some well-known parametric and nonparametric regression methods. Our proposed methods are applied to simulated data as well as a challenging astronomy data set to assess their practical usefulness."]]></description>
<dc:subject>to:NB two-sample_tests nonparametrics high-dimensional_statistics regression kith_and_kin lee.ann_b. lei.jing heard_the_talk</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:d95b8656b5cd/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to:NB"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:nonparametrics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:high-dimensional_statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:regression"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:kith_and_kin"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:lee.ann_b."/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:lei.jing"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:heard_the_talk"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="https://arxiv.org/abs/1810.11953">
    <title>[1810.11953] Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift</title>
    <dc:date>2019-10-29T15:00:54+00:00</dc:date>
    <link>https://arxiv.org/abs/1810.11953</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["We might hope that when faced with unexpected inputs, well-designed software systems would fire off warnings. Machine learning (ML) systems, however, which depend strongly on properties of their inputs (e.g. the i.i.d. assumption), tend to fail silently. This paper explores the problem of building ML systems that fail loudly, investigating methods for detecting dataset shift, identifying exemplars that most typify the shift, and quantifying shift malignancy. We focus on several datasets and various perturbations to both covariates and label distributions with varying magnitudes and fractions of data affected. Interestingly, we show that across the dataset shifts that we explore, a two-sample-testing-based approach, using pre-trained classifiers for dimensionality reduction, performs best. Moreover, we demonstrate that domain-discriminating approaches tend to be helpful for characterizing shifts qualitatively and determining if they are harmful."
]]></description>
<dc:subject>to:NB dataset_shift machine_learning model_checking lipton.zachary two-sample_tests statistics</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:7ff5939f686e/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to:NB"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:dataset_shift"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:machine_learning"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:model_checking"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:lipton.zachary"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="https://arxiv.org/abs/1910.08883">
    <title>[1910.08883] The Exact Equivalence of Independence Testing and Two-Sample Testing</title>
    <dc:date>2019-10-22T13:44:52+00:00</dc:date>
    <link>https://arxiv.org/abs/1910.08883</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["Testing independence and testing equality of distributions are two tightly related statistical hypotheses. Several distance and kernel-based statistics are recently proposed to achieve universally consistent testing for either hypothesis. On the distance side, the distance correlation is proposed for independence testing, and the energy statistic is proposed for two-sample testing. On the kernel side, the Hilbert-Schmidt independence criterion is proposed for independence testing and the maximum mean discrepancy is proposed for two-sample testing. In this paper, we show that two-sample testing are special cases of independence testing via an auxiliary label vector, and prove that distance correlation is exactly equivalent to the energy statistic in terms of the population statistic, the sample statistic, and the testing p-value via permutation test. The equivalence can be further generalized to K-sample testing and extended to the kernel regime. As a consequence, it suffices to always use an independence statistic to test equality of distributions, which enables better interpretability of the test statistic and more efficient testing."]]></description>
<dc:subject>to:NB two-sample_tests dependence_measures statistics hypothesis_testing</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:348b88959e6d/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to:NB"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:dependence_measures"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hypothesis_testing"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="https://arxiv.org/abs/1909.13464">
    <title>[1909.13464] Network Differential Connectivity Analysis</title>
    <dc:date>2019-10-01T16:17:41+00:00</dc:date>
    <link>https://arxiv.org/abs/1909.13464</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["Identifying differences in networks has become a canonical problem in many biological applications. Here, we focus on testing whether two Gaussian graphical models are the same. Existing methods try to accomplish this goal by either directly comparing their estimated structures, or testing the null hypothesis that the partial correlation matrices are equal. However, estimation approaches do not provide measures of uncertainty, e.g., p-values, which are crucial in drawing scientific conclusions. On the other hand, existing testing approaches could lead to misleading results in some cases. To address these shortcomings, we propose a qualitative hypothesis testing framework, which tests whether the connectivity patterns in the two networks are the same. Our framework is especially appropriate if the goal is to identify nodes or edges that are differentially connected. No existing approach could test such hypotheses and provide corresponding measures of uncertainty, e.g., p-values. We investigate theoretical and numerical properties of our proposal and illustrate its utility in biological applications. Theoretically, we show that under appropriate conditions, our proposal correctly controls the type-I error rate in testing the qualitative hypothesis. Empirically, we demonstrate the performance of our proposal using simulation datasets and applications in cancer genetics and brain imaging studies."]]></description>
<dc:subject>to:NB network_data_analysis hypothesis_testing two-sample_tests statistics re:network_differences</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:612fcc0d1d37/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to:NB"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:network_data_analysis"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hypothesis_testing"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:re:network_differences"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="https://arxiv.org/abs/1602.02210">
    <title>[1602.02210] Classification accuracy as a proxy for two sample testing</title>
    <dc:date>2019-05-28T17:02:27+00:00</dc:date>
    <link>https://arxiv.org/abs/1602.02210</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["When data analysts train a classifier and check if its accuracy is significantly different from a half, they are implicitly performing a two-sample test. We investigate the statistical optimality of this indirect but flexible method in the high-dimensional setting of d/n→c∈(0,∞). We provide a concrete answer for the case of distinguishing Gaussians with mean-difference δ and common (known or unknown) covariance Σ, by contrasting the indirect approach using variants of linear discriminant analysis (LDA) such as naive Bayes, with the direct approach using corresponding variants of Hotelling's test. Somewhat surprisingly, the indirect approach achieves the same power as the direct approach in terms of n,d,δ,Σ, and is only worse by a constant factor, achieving an asymptotic relative efficiency of 1/π for the balanced sample case. Other results of independent interest are provided, such as minimax lower bounds, and optimality of Hotelling's test when d=o(n). Simulation results validate our theory, and we present practical takeaway messages along with several open problems."]]></description>
<dc:subject>to:NB classifiers two-sample_tests statistics hypothesis_testing kith_and_kin ramdas.aaditya wasserman.larry singh.aarti</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:8d9ae6124edc/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to:NB"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:classifiers"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hypothesis_testing"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:kith_and_kin"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:ramdas.aaditya"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:wasserman.larry"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:singh.aarti"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://auai.org/uai2015/proceedings/papers/230.pdf">
    <title>Training generative neural networks via maximum mean discrepancy optimization</title>
    <dc:date>2015-07-15T14:02:10+00:00</dc:date>
    <link>http://auai.org/uai2015/proceedings/papers/230.pdf</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["We consider training a deep neural network to generate samples from an unknown distribu- tion given i.i.d. data. We frame learning as an optimization minimizing a two-sample test statistic—informally speaking, a good genera- tor network produces samples that cause a two- sample test to fail to reject the null hypothesis. As our two-sample test statistic, we use an un- biased estimate of the maximum mean discrep- ancy, which is the centerpiece of the nonpara- metric kernel two-sample test proposed by Gret- ton et al. [2]. We compare to the adversar- ial nets framework introduced by Goodfellow et al. [1], in which learning is a two-player game between a generator network and an adversarial discriminator network, both trained to outwit the other. From this perspective, the MMD statistic plays the role of the discriminator. In addition to empirical comparisons, we prove bounds on the generalization error incurred by optimizing the empirical MMD."

--- On first glance, there's no obvious limitation to neural networks, and indeed it's rather suggestive of indirect inference (to me)]]></description>
<dc:subject>to:NB simulation stochastic_models neural_networks machine_learning two-sample_tests hypothesis_testing nonparametrics kernel_methods statistics computational_statistics ghahramani.zoubin</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:34338d71a393/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to:NB"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:simulation"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:stochastic_models"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:neural_networks"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:machine_learning"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hypothesis_testing"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:nonparametrics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:kernel_methods"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:computational_statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:ghahramani.zoubin"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://arxiv.org/abs/1411.2045">
    <title>[1411.2045] Multivariate f-Divergence Estimation With Confidence</title>
    <dc:date>2015-01-22T05:04:09+00:00</dc:date>
    <link>http://arxiv.org/abs/1411.2045</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["The problem of f-divergence estimation is important in the fields of machine learning, information theory, and statistics. While several nonparametric divergence estimators exist, relatively few have known convergence properties. In particular, even for those estimators whose MSE convergence rates are known, the asymptotic distributions are unknown. We establish the asymptotic normality of a recently proposed ensemble estimator of f-divergence between two distributions from a finite number of samples. This estimator has MSE convergence rate of O(1/T), is simple to implement, and performs well in high dimensions. This theory enables us to perform divergence-based inference tasks such as testing equality of pairs of distributions based on empirical samples. We experimentally validate our theoretical results and, as an illustration, use them to empirically bound the best achievable classification error."]]></description>
<dc:subject>estimation entropy_estimation information_theory statistics two-sample_tests in_NB hero.alfred_o._iii</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:0dbac386b55d/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:estimation"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:entropy_estimation"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:information_theory"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:in_NB"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hero.alfred_o._iii"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://arxiv.org/abs/1409.2344">
    <title>[1409.2344] A nonparametric two-sample hypothesis testing problem for random dot product graphs</title>
    <dc:date>2015-01-20T13:33:07+00:00</dc:date>
    <link>http://arxiv.org/abs/1409.2344</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["We consider the problem of testing whether two finite-dimensional random dot product graphs have generating latent positions that are independently drawn from the same distribution, or distributions that are related via scaling or projection. We propose a test statistic that is a kernel-based function of the adjacency spectral embedding for each graph. We obtain a limiting distribution for our test statistic under the null and we show that our test procedure is consistent across a broad range of alternatives."]]></description>
<dc:subject>network_data_analysis hypothesis_testing two-sample_tests re:network_differences statistics to_read in_NB to_teach:graphons</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:041f7a5d64d9/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:network_data_analysis"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hypothesis_testing"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:re:network_differences"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to_read"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:in_NB"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to_teach:graphons"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://arxiv.org/abs/1407.1212">
    <title>[1407.1212] Comparison of multivariate distributions using quantile-quantile plots and related tests</title>
    <dc:date>2015-01-20T01:59:24+00:00</dc:date>
    <link>http://arxiv.org/abs/1407.1212</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["The univariate quantile-quantile (Q-Q) plot is a well-known graphical tool for examining whether two data sets are generated from the same distribution or not. It is also used to determine how well a specified probability distribution fits a given sample. In this article, we develop and study a multivariate version of the Q-Q plot based on the spatial quantile. The usefulness of the proposed graphical device is illustrated on different real and simulated data, some of which have fairly large dimensions. We also develop certain statistical tests that are related to the proposed multivariate Q-Q plot and study their asymptotic properties. The performance of those tests are compared with that of some other well-known tests for multivariate distributions available in the literature."]]></description>
<dc:subject>goodness-of-fit two-sample_tests statistics in_NB</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:f635706b9701/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:goodness-of-fit"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:in_NB"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://arxiv.org/abs/1405.0558">
    <title>[1405.0558] The Falling Factorial Basis and Its Statistical Applications</title>
    <dc:date>2014-12-02T00:39:08+00:00</dc:date>
    <link>http://arxiv.org/abs/1405.0558</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["We study a novel spline-like basis, which we name the "falling factorial basis", bearing many similarities to the classic truncated power basis. The advantage of the falling factorial basis is that it enables rapid, linear-time computations in basis matrix multiplication and basis matrix inversion. The falling factorial functions are not actually splines, but are close enough to splines that they provably retain some of the favorable properties of the latter functions. We examine their application in two problems: trend filtering over arbitrary input points, and a higher-order variant of the two-sample Kolmogorov-Smirnov test."]]></description>
<dc:subject>to:NB have_read splines nonparametrics statistics two-sample_tests kith_and_kin tibshirani.ryan</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:dfcdfbd65639/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to:NB"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:have_read"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:splines"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:nonparametrics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:kith_and_kin"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:tibshirani.ryan"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://arxiv.org/abs/1001.0591">
    <title>[1001.0591] Comparing Distributions and Shapes using the Kernel Distance</title>
    <dc:date>2014-10-16T15:15:28+00:00</dc:date>
    <link>http://arxiv.org/abs/1001.0591</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["Starting with a similarity function between objects, it is possible to define a distance metric on pairs of objects, and more generally on probability distributions over them. These distance metrics have a deep basis in functional analysis, measure theory and geometric measure theory, and have a rich structure that includes an isometric embedding into a (possibly infinite dimensional) Hilbert space. They have recently been applied to numerous problems in machine learning and shape analysis. 
"In this paper, we provide the first algorithmic analysis of these distance metrics. Our main contributions are as follows: (i) We present fast approximation algorithms for computing the kernel distance between two point sets P and Q that runs in near-linear time in the size of (P cup Q) (note that an explicit calculation would take quadratic time). (ii) We present polynomial-time algorithms for approximately minimizing the kernel distance under rigid transformation; they run in time O(n + poly(1/epsilon, log n)). (iii) We provide several general techniques for reducing complex objects to convenient sparse representations (specifically to point sets or sets of points sets) which approximately preserve the kernel distance. In particular, this allows us to reduce problems of computing the kernel distance between various types of objects such as curves, surfaces, and distributions to computing the kernel distance between point sets. These take advantage of the reproducing kernel Hilbert space and a new relation linking binary range spaces to continuous range spaces with bounded fat-shattering dimension."]]></description>
<dc:subject>to:NB kernel_estimators two-sample_tests statistics probability re:network_differences have_read</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:4ffff8f257a3/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to:NB"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:kernel_estimators"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:probability"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:re:network_differences"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:have_read"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://arxiv.org/abs/1406.2083">
    <title>[1406.2083] Kernel MMD, the Median Heuristic and Distance Correlation in High Dimensions</title>
    <dc:date>2014-07-12T00:29:26+00:00</dc:date>
    <link>http://arxiv.org/abs/1406.2083</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["This paper is about two related methods for two sample testing and independence testing which have emerged over the last decade: Maximum Mean Discrepancy (MMD) for the former problem and Distance Correlation (dCor) for the latter. Both these methods have been suggested for high-dimensional problems, and sometimes claimed to be unaffected by increasing dimensionality of the samples. We will show theoretically and practically that the power of both methods (for different reasons) does actually decrease polynomially with dimension. We also analyze the median heuristic, which is a method for choosing tuning parameters of translation invariant kernels. We show that different bandwidth choices could result in the MMD decaying polynomially or even exponentially in dimension."]]></description>
<dc:subject>to:NB hypothesis_testing two-sample_tests kernel_estimators dependence_measures kith_and_kin wasserman.larry singh.aarti ramdas.aaditya high-dimensional_statistics</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:ca409b80b289/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to:NB"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hypothesis_testing"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:kernel_estimators"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:dependence_measures"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:kith_and_kin"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:wasserman.larry"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:singh.aarti"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:ramdas.aaditya"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:high-dimensional_statistics"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://arxiv.org/abs/1210.4584">
    <title>[1210.4584] Two-Sample Testing in High-Dimensional Models</title>
    <dc:date>2014-03-05T14:21:05+00:00</dc:date>
    <link>http://arxiv.org/abs/1210.4584</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["We propose novel methodology for testing equality of model parameters between two high-dimensional populations. The technique is very general and applicable to a wide range of models. The method is based on sample splitting: the data is split into two parts; on the first part we reduce the dimensionality of the model to a manageable size; on the second part we perform significance testing (p-value calculation) based on a restricted likelihood ratio statistic. Assuming that both populations arise from the same distribution, we show that the restricted likelihood ratio statistic is asymptotically distributed as a weighted sum of chi-squares with weights which can be efficiently estimated from the data. In high-dimensional problems, a single data split can result in a "p-value lottery". To ameliorate this effect, we iterate the splitting process and aggregate the resulting p-values. This multi-split approach provides improved p-values. We illustrate the use of our general approach in two-sample comparisons of high-dimensional regression models ("differential regression") and graphical models ("differential network"). In both cases we show results on simulated data as well as real data from recent, high-throughput cancer studies."]]></description>
<dc:subject>hypothesis_testing high-dimensional_statistics two-sample_tests statistics re:network_differences in_NB</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:e310404d6b50/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hypothesis_testing"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:high-dimensional_statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:re:network_differences"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:in_NB"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://arxiv.org/abs/1307.1954">
    <title>[1307.1954] B-tests: Low Variance Kernel Two-Sample Tests</title>
    <dc:date>2014-02-11T21:35:29+00:00</dc:date>
    <link>http://arxiv.org/abs/1307.1954</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["A family of maximum mean discrepancy (MMD) kernel two-sample tests is introduced. Members of the test family are called Block-tests or B-tests, since the test statistic is an average over MMDs computed on subsets of the samples. The choice of block size allows control over the tradeoff between test power and computation time. In this respect, the B-test family combines favorable properties of previously proposed MMD two-sample tests: B-tests are more powerful than a linear time test where blocks are just pairs of samples, yet they are more computationally efficient than a quadratic time test where a single large block incorporating all the samples is used to compute a U-statistic. A further important advantage of the B-tests is their asymptotically Normal null distribution: this is by contrast with the U-statistic, which is degenerate under the null hypothesis, and for which estimates of the null distribution are computationally demanding. Recent results on kernel selection for hypothesis testing transfer seamlessly to the B-tests, yielding a means to optimize test power via kernel choice."]]></description>
<dc:subject>to:NB two-sample_tests kernel_methods hilbert_space hypothesis_testing statistics</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:94ff051ff32e/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to:NB"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:kernel_methods"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hilbert_space"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hypothesis_testing"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://arxiv.org/abs/1207.6076">
    <title>[1207.6076] Equivalence of distance-based and RKHS-based statistics in hypothesis testing</title>
    <dc:date>2013-11-17T20:15:33+00:00</dc:date>
    <link>http://arxiv.org/abs/1207.6076</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["We provide a unifying framework linking two classes of statistics used in two-sample and independence testing: on the one hand, the energy distances and distance covariances from the statistics literature; on the other, maximum mean discrepancies (MMD), that is, distances between embeddings of distributions to reproducing kernel Hilbert spaces (RKHS), as established in machine learning. In the case where the energy distance is computed with a semimetric of negative type, a positive definite kernel, termed distance kernel, may be defined such that the MMD corresponds exactly to the energy distance. Conversely, for any positive definite kernel, we can interpret the MMD as energy distance with respect to some negative-type semimetric. This equivalence readily extends to distance covariance using kernels on the product space. We determine the class of probability distributions for which the test statistics are consistent against all alternatives. Finally, we investigate the performance of the family of distance kernels in two-sample and independence tests: we show in particular that the energy distance most commonly employed in statistics is just one member of a parametric family of kernels, and that other choices from this family can yield more powerful tests."]]></description>
<dc:subject>kernel_methods hilbert_space two-sample_tests statistics nonparametrics to_read in_NB entableted independence_tests</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:4290f416a332/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:kernel_methods"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hilbert_space"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:nonparametrics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to_read"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:in_NB"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:entableted"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:independence_tests"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://arxiv.org/abs/1305.0423">
    <title>[1305.0423] Testing Hypotheses by Regularized Maximum Mean Discrepancy</title>
    <dc:date>2013-05-03T14:57:03+00:00</dc:date>
    <link>http://arxiv.org/abs/1305.0423</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["Do two data samples come from different distributions? Recent studies of this fundamental problem focused on embedding probability distributions into sufficiently rich characteristic Reproducing Kernel Hilbert Spaces (RKHSs), to compare distributions by the distance between their embeddings. We show that Regularized Maximum Mean Discrepancy (RMMD), our novel measure for kernel-based hypothesis testing, yields substantial improvements even when sample sizes are small, and excels at hypothesis tests involving multiple comparisons with power control. We derive asymptotic distributions under the null and alternative hypotheses, and assess power control. Outstanding results are obtained on: challenging EEG data, MNIST, the Berkley Covertype, and the Flare-Solar dataset."]]></description>
<dc:subject>two-sample_tests statistics hilbert_space kernel_methods to_read in_NB</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:bca72578fa28/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hilbert_space"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:kernel_methods"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to_read"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:in_NB"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://arxiv.org/abs/1304.4564">
    <title>[1304.4564] A high-dimensional two-sample test for the mean using random subspaces</title>
    <dc:date>2013-04-23T18:06:43+00:00</dc:date>
    <link>http://arxiv.org/abs/1304.4564</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["A common problem in genetics is that of testing whether a set of highly dependent gene expressions differ between two populations, typically in a high-dimensional setting where the data dimension is larger than the sample size. Most high-dimensional tests for the equality of two mean vectors rely on naive diagonal or trace estimators of the covariance matrix, ignoring dependencies between variables. A test recently proposed by Lopes et al. (2012) implicitly incorporates dependencies by using random pseudo-projections to a lower-dimensional space. Their test offers higher power when the variables are dependent, but lacks desirable invariance properties and relies on asymptotic p-values that are too conservative. We illustrate how a permutation approach can be used to obtain p-values for the Lopes et al. test and how modifying the test using random subspaces leads to a test statistic that is invariant under linear transformations of the marginal distributions. The resulting test does not rely on assumptions about normality or the structure of the covariance matrix. We show by simulation that the new test has higher power than competing tests in realistic settings motivated by microarray gene expression data. We also discuss the computational aspects of high-dimensional permutation tests and provide an efficient R implementation of the proposed test."]]></description>
<dc:subject>random_projections two-sample_tests hypothesis_testing statistics in_NB</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:c1229ad954ac/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:random_projections"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hypothesis_testing"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:in_NB"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://arxiv.org/abs/1304.5939">
    <title>[1304.5939] Exact and asymptotically robust permutation tests</title>
    <dc:date>2013-04-23T18:01:18+00:00</dc:date>
    <link>http://arxiv.org/abs/1304.5939</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["Given independent samples from P and Q, two-sample permutation tests allow one to construct exact level tests when the null hypothesis is P=Q. On the other hand, when comparing or testing particular parameters $\theta$ of P and Q, such as their means or medians, permutation tests need not be level $\alpha$, or even approximately level $\alpha$ in large samples. Under very weak assumptions for comparing estimators, we provide a general test procedure whereby the asymptotic validity of the permutation test holds while retaining the exact rejection probability $\alpha$ in finite samples when the underlying distributions are identical. The ideas are broadly applicable and special attention is given to the k-sample problem of comparing general parameters, whereby a permutation test is constructed which is exact level $\alpha$ under the hypothesis of identical distributions, but has asymptotic rejection probability $\alpha$ under the more general null hypothesis of equality of parameters. A Monte Carlo simulation study is performed as well. A quite general theory is possible based on a coupling construction, as well as a key contiguity argument for the multinomial and multivariate hypergeometric distributions."]]></description>
<dc:subject>hypothesis_testing two-sample_tests statistics in_NB have_read</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:65e76d4ff0ca/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hypothesis_testing"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:in_NB"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:have_read"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://www.mitpressjournals.org/doi/abs/10.1162/NECO_a_00442">
    <title>Relative Density-Ratio Estimation for Robust Distribution Comparison</title>
    <dc:date>2013-04-04T17:32:51+00:00</dc:date>
    <link>http://www.mitpressjournals.org/doi/abs/10.1162/NECO_a_00442</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["Divergence estimators based on direct approximation of density ratios without going through separate approximation of numerator and denominator densities have been successfully applied to machine learning tasks that involve distribution comparison such as outlier detection, transfer learning, and two-sample homogeneity test. However, since density-ratio functions often possess high fluctuation, divergence estimation is a challenging task in practice. In this letter, we use relative divergences for distribution comparison, which involves approximation of relative density ratios. Since relative density ratios are always smoother than corresponding ordinary density ratios, our proposed method is favorable in terms of nonparametric convergence speed. Furthermore, we show that the proposed divergence estimator has asymptotic variance independent of the model complexity under a parametric setup, implying that the proposed estimator hardly overfits even with complex models. Through experiments, we demonstrate the usefulness of the proposedapproach."

--- This is _not_ the relative density between p and q in the Handcock-Morris sense, just the ratio between p and ap+(1-a)q, for adjustable a.  (This is to keep the density ratio from going to infinite anywhere.)  The whole thing seems like a bit of a hack...]]></description>
<dc:subject>density_estimation statistics two-sample_tests goodness-of-fit have_read in_NB</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:f476c5fd4154/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:density_estimation"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:goodness-of-fit"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:have_read"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:in_NB"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://arxiv.org/abs/1304.0796">
    <title>[1304.0796] Direction-Projection-Permutation for High Dimensional Hypothesis Tests</title>
    <dc:date>2013-04-04T16:35:06+00:00</dc:date>
    <link>http://arxiv.org/abs/1304.0796</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["Motivated by the prevalence of high dimensional low sample size datasets in modern statistical applications, we propose a general nonparametric framework, Direction-Projection-Permutation (DiProPerm), for testing high dimensional hypotheses. The method is aimed at rigorous testing of whether lower dimensional visual differences are statistically significant. Theoretical analysis under the non-classical asymptotic regime of dimension going to infinity for fixed sample size reveals that certain natural variations of DiProPerm can have very different behaviors. An empirical power study both confirms the theoretical results and suggests DiProPerm is a powerful test in many settings. Finally DiProPerm is applied to a high dimensional gene expression dataset."]]></description>
<dc:subject>two-sample_tests high-dimensional_statistics goodness-of-fit hypothesis_testing statistics to_teach:undergrad-ADA have_read in_NB visual_display_of_quantitative_information</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:94595c32ca65/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:high-dimensional_statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:goodness-of-fit"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hypothesis_testing"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to_teach:undergrad-ADA"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:have_read"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:in_NB"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:visual_display_of_quantitative_information"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://projecteuclid.org/DPubS?service=UI&amp;version=1.0&amp;verb=Display&amp;handle=euclid.ejs/1347974672">
    <title>Sriperumbudur , Fukumizu , Gretton , Schölkopf , Lanckriet : On the empirical estimation of integral probability metrics</title>
    <dc:date>2012-09-18T14:08:02+00:00</dc:date>
    <link>http://projecteuclid.org/DPubS?service=UI&amp;version=1.0&amp;verb=Display&amp;handle=euclid.ejs/1347974672</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["Given two probability measures, ℙ and ℚ defined on a measurable space, S, the integral probability metric (IPM) is defined as $$γEuScript{F}(ℙ,ℚ)=Math@Opleft{leftvert ∫S f dℙ-∫S f dℚrightvert : f∈EuScript{F}right},$$ where $EuScript{F}$ is a class of real-valued bounded measurable functions on $S$. By appropriately choosing $EuScript{F}$, various popular distances between $mathbb{P}$ and $mathbb{Q}$, including the Kantorovich metric, Fortet-Mourier metric, dual-bounded Lipschitz distance (also called the Dudley metric), total variation distance, and kernel distance, can be obtained.
"In this paper, we consider the problem of estimating gamma_{EuScript{F}} from finite random samples drawn i.i.d. from ℙ and ℚ. Although the above mentioned distances cannot be computed in closed form for every ℙ and ℚ, we show their empirical estimators to be easily computable, and strongly consistent (except for the total-variation distance). We further analyze their rates of convergence. Based on these results, we discuss the advantages of certain choices of EuScript{F} (and therefore the corresponding IPMs) over others—in particular, the kernel distance is shown to have three favorable properties compared with the other mentioned distances: it is computationally cheaper, the empirical estimate converges at a faster rate to the population value, and the rate of convergence is independent of the dimension d of the space (for S=ℝd). We also provide a novel interpretation of IPMs and their empirical estimators by relating them to the problem of binary classification: while the IPM between class-conditional distributions is the negative of the optimal risk associated with a binary classifier, the smoothness of an appropriate binary classifier (e.g., support vector machine, Lipschitz classifier, etc.) is inversely related to the empirical estimator of the IPM between these class-conditional distributions."]]></description>
<dc:subject>to_read kernel_methods statistics probability two-sample_tests in_NB</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:ca18d39b56f2/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to_read"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:kernel_methods"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:probability"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:in_NB"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://normaldeviate.wordpress.com/2012/07/14/modern-two-sample-tests/">
    <title>Modern Two-Sample Tests « Normal Deviate</title>
    <dc:date>2012-07-14T15:50:36+00:00</dc:date>
    <link>http://normaldeviate.wordpress.com/2012/07/14/modern-two-sample-tests/</link>
    <dc:creator>cshalizi</dc:creator><dc:subject>two-sample_tests statistics hypothesis_testing wasserman.larry</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:7d452ff9efd6/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hypothesis_testing"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:wasserman.larry"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://jmlr.csail.mit.edu/papers/v13/gretton12a.html">
    <title>A Kernel Two-Sample Test</title>
    <dc:date>2012-04-03T00:34:03+00:00</dc:date>
    <link>http://jmlr.csail.mit.edu/papers/v13/gretton12a.html</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD). We present two distribution-free tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests."]]></description>
<dc:subject>to_read hilbert_space kernel_methods goodness-of-fit statistics concentration_of_measure probability two-sample_tests re:network_differences in_NB</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:2958bc7b3490/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to_read"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hilbert_space"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:kernel_methods"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:goodness-of-fit"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:concentration_of_measure"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:probability"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:re:network_differences"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:in_NB"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://projecteuclid.org/DPubS?service=UI&amp;version=1.0&amp;verb=Display&amp;handle=euclid.aos/1176350835">
    <title>Henze : A Multivariate Two-Sample Test Based on the Number of Nearest Neighbor Type Coincidences</title>
    <dc:date>2012-02-17T22:53:30+00:00</dc:date>
    <link>http://projecteuclid.org/DPubS?service=UI&amp;version=1.0&amp;verb=Display&amp;handle=euclid.aos/1176350835</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["For independent $d$-variate random samples $X_1, cdots, X_{n_1}$ i.i.d. $f(x), Y_1, cdots, Y_{n_2}$ i.i.d. $g(x)$, where the densities $f$ and $g$ are assumed to be continuous a.e., consider the number $T$ of all $k$ nearest neighbor comparisons in which observations and their neighbors belong to the same sample. We show that, if $f = g$ a.e., the limiting (normal) distribution of $T$, as $min(n_1, n_2) rightarrow infty, n_1/(n_1 + n_2) rightarrow tau, 0 < tau < 1$, does not depend on $f$. An omnibus procedure for testing the hypothesis $H_0: f = g$ a.e. is obtained by rejecting $H_0$ for large values of $T$. The result applies to a general distance (generated by a norm on $mathbb{R}^d$) for determining nearest neighbors, and it generalizes to the multisample situation."]]></description>
<dc:subject>to_read statistics hypothesis_testing two-sample_tests re:AoS_project in_NB</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:bb7bbf941041/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:to_read"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hypothesis_testing"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:re:AoS_project"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:in_NB"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://arxiv.org/abs/1202.1561">
    <title>[1202.1561] Tree Models for Difference and Change Detection in a Complex Environment</title>
    <dc:date>2012-02-10T05:18:16+00:00</dc:date>
    <link>http://arxiv.org/abs/1202.1561</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["A new family of tree models is proposed, which we call "differential trees." A differential tree model is constructed from multiple data sets and aims to detect distributional differences between them. The new methodology differs from the existing difference and change detection techniques in its nonparametric nature, model construction from multiple data sets, and applicability to high-dimensional data. Through a detailed study of an arson case in New Zealand, where an individual is known to have been laying vegetation fires within a certain time period, we illustrate how these models can help detect changes in the frequencies of event occurrences and uncover unusual clusters of events in a complex environment."

--- After reading, I think their exposition is needlessly hard to follow, but let me take a stab at it.  In an ordinary classification tree, we are interested in the distribution of the class labels Y given the predictors X, i.e., Pr(Y|X), and make splits on X so that (in essence) the conditional entropy H[Y|X] becomes small.  This is of course equivalent to making splits so that the divergence of Pr(Y|X) from Pr(Y) is maximized.  What they are interested in is not classification but _describing_ how the different classes are distinct, so the relevant distribution is Pr(X|Y), and they want a big divergence between Pr(X) and Pr(X|Y).

ETA: Published version:
http://projecteuclid.org/euclid.aoas/1346418578 .  Haven't compared it to what  I read.
]]></description>
<dc:subject>re:network_differences statistics hypothesis_testing density_estimation decision_trees have_read data_mining two-sample_tests in_NB</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:1d69327d5561/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:re:network_differences"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hypothesis_testing"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:density_estimation"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:decision_trees"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:have_read"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:data_mining"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:in_NB"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://projecteuclid.org/DPubS?service=UI&amp;version=1.0&amp;verb=Display&amp;handle=euclid.aoms/1177731355">
    <title>Scheffe : Statistical Inference in the Non-Parametric Case (1943)</title>
    <dc:date>2012-02-08T21:30:25+00:00</dc:date>
    <link>http://projecteuclid.org/DPubS?service=UI&amp;version=1.0&amp;verb=Display&amp;handle=euclid.aoms/1177731355</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA[We knew nothing.]]></description>
<dc:subject>have_read statistics nonparametrics history_of_statistics estimation hypothesis_testing two-sample_tests in_NB</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:e1e4a2fb000b/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:have_read"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:nonparametrics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:history_of_statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:estimation"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hypothesis_testing"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:in_NB"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6145469&amp;arnumber=6018305&amp;tag=1">
    <title>f-Divergence Estimation and Two-Sample Homogeneity Test Under Semiparametric Density-Ratio Models</title>
    <dc:date>2012-02-07T19:15:38+00:00</dc:date>
    <link>http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6145469&amp;arnumber=6018305&amp;tag=1</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["A density ratio is defined by the ratio of two probability densities. We study the inference problem of density ratios and apply a semiparametric density-ratio estimator to the two-sample homogeneity test. In the proposed test procedure, the $f$-divergence between two probability densities is estimated using a density-ratio estimator. The $f$ -divergence estimator is then exploited for the two-sample homogeneity test. We derive an optimal estimator of $f$-divergence in the sense of the asymptotic variance in a semiparametric setting, and provide a statistic for two-sample homogeneity test based on the optimal estimator. We prove that the proposed test dominates the existing empirical likelihood score test. Through numerical studies, we illustrate the adequacy of the asymptotic theory for finite-sample inference."]]></description>
<dc:subject>statistics density_estimation information_theory hypothesis_testing two-sample_tests in_NB density_ratio_estimation</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:d50cd7cd174b/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:density_estimation"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:information_theory"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hypothesis_testing"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:in_NB"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:density_ratio_estimation"/>
</rdf:Bag></taxo:topics>
</item>
<item rdf:about="http://pubs.amstat.org/doi/abs/10.1198/jasa.2011.tm10576">
    <title>Nonparametric Tests for Homogeneity Based on Non-Bipartite Matching</title>
    <dc:date>2012-01-16T16:13:17+00:00</dc:date>
    <link>http://pubs.amstat.org/doi/abs/10.1198/jasa.2011.tm10576</link>
    <dc:creator>cshalizi</dc:creator><description><![CDATA["Given a sequence of observations, has a change occurred in the underlying probability distribution with respect to observation order? This problem of detecting change points arises in a variety of applications including health prognostics for mechanical systems, syndromic disease surveillance in geographically dispersed populations, anomaly detection in information networks, and multivariate process control in general. Detecting change points in high-dimensional settings is challenging, and most change-point methods for multidimensional problems rely upon distributional assumptions or the use of observation history to model probability distributions. We present three new nonparametric statistical tests for heterogeneity based on the combinatorial properties of minimum non-bipartite matching (MNBM). The key idea underlying each of these tests is that if a sequence of independent random observations undergoes a change in distribution—either an abrupt “shift” or a gradual “drift”—a MNBM based on inter-point distances tends to produce pairings that are closer in the sequence labeling than would be the case if the observations were drawn from the same distribution. Our tests follow on the work of Rosenbaum (2005) who used MNBM to derive a simple cross-match test statistic for the two-sample problem based on this idea. Similar ideas are present in the minimum spanning tree (MST) test derived by Friedman and Rafsky (1979, 1981). We extend these approaches by utilizing ensembles of orthogonal MNBMs which greatly increase information extraction from the data, leading to tests that compare favorably to parametric procedures while maintaining level and good power properties across distributions."]]></description>
<dc:subject>statistics hypothesis_testing density_estimation change-point_problem two-sample_tests in_NB</dc:subject>
<dc:source>https://pinboard.in/</dc:source>
<dc:identifier>https://pinboard.in/u:cshalizi/b:53436bb49b5d/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:statistics"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:hypothesis_testing"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:density_estimation"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:change-point_problem"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:two-sample_tests"/>
	<rdf:li rdf:resource="https://pinboard.in/u:cshalizi/t:in_NB"/>
</rdf:Bag></taxo:topics>
</item>
</rdf:RDF>