You are currently browsing the monthly archive for January 2009.

While trying to prove some series (which should be easy to guess) were divergent, I stumbled across the following cute little result. We write \log^{(k)}x = \log \log \log .... \log x with k \logs.

Consider the integral \int_a^b \frac{dt}{t \log t \log \log t ... \log^{(n)}t}.

Then the substitution t=e^s gives this integral as equal to \int_{\log a}^{\log b} \frac{ds}{s \log s \log \log s .... \log^{(n-1)} s}, so making n such substitutions will just reduce this integral to solving \int \frac{dx}{x}, which is well-known.

This term I am trialling the use of a laptop in taking lecture notes in two part IA courses, to see if it worked out more effectively than my rather messy hand-written notes. There is a good chance I might abandon this method after a few more lectures, particularly in Probability, but for the time being I thought I would keep them updated online in case anyone might find them useful (if they’ve missed a lecture or whatever…).

A health warning worth mentioning is that my notes are often excessively concise, and omit or paraphrase quite a lot of what the lecturer writes. This is a note style that suits me (leaving me to fill in the gaps during revision, and omitting any material I am very comfortable with), but might need to be borne in mind if you are using them to help learn the course.

Any corrections or comments would be well-received.

Probability
Dynamics and Relativity

Update: I have actually completed both sets of notes, but do not wish to post them on the internet for public consumption (as this might undermine lecturers’ control of their teaching next year). Contact me if you would be interested and I’ll gladly email them to you.

My lack of recent activity was owing to a boat club trip to Seville, with lots of sun, rowing and walking, but little internet access. However, now I am back, and during the trip the following curious pseudo-problem was posed.

Many Cambridge college boat clubs keep track of pairs of their members who have had romantic liasons (precise definitions of which vary from club to club but “sexual kissing” seems to be the standard benchmark), and so build up what are known as incest charts, which are essentially graphs where each member is a vertex and two people are joined by an edge if they have engaged one another in acts of ‘incest’.

First and Third had plans to put an anonymous (the vertices are left unnamed) incest chart on their website, but given the great size of the club, the actual generation of the graphic creates an interesting problem which isn’t exactly mathematical but clearly requires some kind of algorithmic mathematical handling.

Problem: Given a graph, expressed as a set of edges (any vertices of degree zero are ignored), find an algorithm for generating a graphical representation of the graph which is maximally intelligible to someone viewing the website. In particular, edges should cross as infrequently as possible and edges should be short and ideally straight.

This is a problem familiar to anyone who has done graph theory. You get half way through a diagram and then end up having to include lots of loopy things to make your graph make sense. It seems, to the author, a difficult task. If anyone has any ideas or knows of such an algorithm already, I would be interested to hear.

In Part IA Vectors and Matrices this theorem was proved, but the following proof (which I thought of during the lecture and seemed most natural) wasn’t lectured, or if it was it wasn’t clear to me that this was the approach taken.

Claim: Let A be an n\times n matrix with characteristic equation f(x)=0. Then f(A)=0.

Proof: We prove it in the special case where A is diagonalisable.

A is diagonalisable, so there exists a basis of eigenvectors e_1,e_2,...,e_n. It will suffice to show that the linear map corresponding to L= f(A) sends everything to zero.

Any vector can be written v=v_1e_1+v_2e_2+....+v_ne_n, so by the linearity of L it suffices to show that each component is mapped to the zero vector. But clearly L(v_je_j) = v_jL(e_j) = v_j f(\lambda_j)e_j = 0e_j = 0, where \lambda_j is the eigenvalue of e_j.

So we are done. 🙂

In my last post I discovered a surprising result about the distribution of |x|: x \in [0,1]^d, namely that as the dimension of the cube grew, the distribution actually seemed to get thinner, converging to almost a spike at \sqrt{d/3}. I am going to attempt to generalise that result a little here, and draw some (mainly conceptual and heuristic) conclusions to inform future intuition about such problems.

So we now look at a set of random variables X_1,X_2,...,X_d which all have the same distribution (say, that of X, which we assume to be nonconstant), and consider the distribution of S = X_1+X_2+....+X_d.

By following a similar method to our investigation in the previous post, straightforward calculation and applications of the linearity of expectation and the mutual independence of the variables give:

  • \mu = \mathbb{E}(S) = d\mathbb{E}(X);
  • \sigma^2 = \text{Var}(S) = d\mathbb{E}(X^2) - d\mathbb{E}(X)^2 = d\text{Var(X)}.

Therefore, as d increases, the standard deviation increases at a much slower rate than the mean. In particular, we can apply Chebyshev’s inequality and read off two quick results.

  1. For any k, \epsilon>0, \mathbb{P}(|S-\mu| \geq kd^{1/2+\epsilon}) \leq \frac{Var(X)}{k^2d^{2\epsilon}} \rightarrow 0.
  2. For k=\lambda\text{Var}^{1/2}(X), \lambda >1\mathbb{P}(|S-\mu| \geq kd^{1/2}) \leq \frac{1}{\lambda^2}.

These can be used to give quantitative estimates of the very quick drop in density of the distribution away from the mean. We shall conclude by briefly revisiting the result of the past post, and then drawing some general intuitive pointers.

In the previous problem, we set X to have the square of a uniformly distributed variable on {[0,1]}, and deduce that the result we require (with \alpha = 1/2) is essentially (ignoring some lower order terms) equivalent to finding c>0 such that for all sufficiently large d

\mathbb{P}(|S-\mu| \leq 3^{-1/2}d^{1/2}) > c.

But this is immediate from the second result above and the fact that 3^{-1/2}>\text{Var}^{1/2}(U([0,1]^d)^2) = \sqrt{4/45}.

Reflection on this general phenomenon

The intuition I can use to make sense of this result derives from the heuristic “if you take more samples, you are more likely to get an average”. Although in our case we are just taking a sum, this is essentially equivalent to taking an average, and it is perhaps unsurprising that the distribution in the limit is therefore actually very dense around the expected value and almost zero everywhere else. It might actually be sensible to turn this intuition into mathematics.

So set T = S/d, and multiplying past results by suitable constants give us that

  • \mathbb{E}(T) = \mathbb{E}(X)
  • \text{Var}(T) = d^{-1}\text{Var}(X)

In other words, the variance is inversely proportional to the number of samples we take before averaging, which is roughly what we would intuitively expect. So this result is actually, however surprising it seemed at first, really just the primary school idea that increasing sample size increases the accuracy of a result, but multiplying everything by d.

This has been the (to me very surprising) result of my enquiry into the following beautiful problem, posed to the author by Prof. Ben Green. I am recording my findings primarily for my own reference, and would urge any reader not familiar with the problem to at least try to attack it themselves before reading on.

Question: If points (if you like, vectors) x,y are selected randomly and independently with uniform probability from \ [ 0,1]^d, show there exists c>0 such that \mathbb{P}(||x|-|y|| \leq 1) \geq c for all d.

My initial attempts to solve this problem were not particularly successful, since trying to relate |x| and |y| as they both varied over their distributions seemed tricky. My initial heuristic was along the lines of: “The random variable |x| ranges over the interval 0 \leq |x| \leq \sqrt{d} so as d grows the distribution must spread itself across a greater distance, hence be less at any given point under a suitable continuous transform to keep the distribution ‘appropriate’ (with mean at the correct, also growing, place). Thus while the distribution does peak near the middle, it does not seem likely that a given O(1) length neighbourhood can have \Omega (1) probability as d \rightarrow \infty .”

With this heuristic in mind, the only method of approach which seemed likely to work was to take some kind of sum or integral over various (and a number growing with d) disjoint intervals on the distribution to get a spread-out enough estimate to eliminate the lack of an \Omega(1) probability on any given interval. However, actually going ahead and doing something like this proved to be very tricky, especially for one such as myself with little formal probabilistic training.

So I decided to leave the above heuristic aside and just see how far from \Omega(1) the probability looked like around a natural-looking modal point. I achieved the following surprising result.

Theorem 1: For a fixed constant \alpha,

\mathbb{P}(||x|-\sqrt{\frac{d}{3}}| \leq \alpha) \geq 1 - (\sqrt{15}\alpha-\frac{3\sqrt{5}\alpha^2}{2\sqrt{d}})^{-2}.

In particular, the above question is a trivial consequence of the case \alpha=\frac{1}{2}.

Owing to the nature of the Euclidean metric, it seemed natural to set up variables as follows:

Let X_1,X_2,...,X_d \in [0,1] be selected randomly and uniformly, and set F = \sum_i X_i^2. A simple application of the linearity of expectation showed that \mathbb{E}(F) = d/3, and as F = |x|^2 seemed a much easier object to work with than |x| I decided to take a look at \mathbb{P}(||x|-\sqrt{d/3}| \leq \alpha) for some constant \alpha >0 to be played with later.

This is equal to \mathbb{P}(||x| \in [\sqrt{d/3}-\alpha, \sqrt{d/3}+\alpha])

= \mathbb{P}(F \in [d/3-2\alpha\sqrt{d/3}+\alpha^2, d/3+2\alpha\sqrt{d/3} + \alpha^2])

which is then clearly \geq \mathbb{P}(|F-d/3| \leq 2\alpha\sqrt{d/3} - \alpha^2).

Which looks like the kind of quantity one can bound above by Chebyshev’s inequality and get a lower bound around \Omega (d^{1/2}\text{Var}(F)^{-1/2}). The initial heuristic relied on the ‘near uniformity’ assumption that this variance would be of magnitude just a little less than d^2, but here I would be very pleasantly surprised.

\mathbb{E}(F^2) = \sum_i \mathbb{E}(X_i^4) + \sum_{i\not=j} \mathbb{E}(X_i^2X_j^2) = \frac{d}{5} + \frac{d(d-1)}{9}

But this implies the much lower than expected \text{Var}(F) = \frac{d}{5}-\frac{d}{9} = \frac{4d}{45}. In other words, the heuristic based on assuming F is roughly smoothly distributed was fallacious. Instead, the distribution of F has a ‘spike’ at its expected value and is almost zero away from it.

So \mathbb{P}(||x| \in [\sqrt{d/3}-\alpha, \sqrt{d/3}+\alpha]) \geq 1 - \mathbb{P}(|F-d/3| > 2\alpha\sqrt{d/3} - \alpha^2).

Writing 2\alpha\sqrt{d/3} - \alpha^2 = (\sqrt{15}\alpha-\frac{3\sqrt{5}\alpha^2}{2\sqrt{d}})\text{Var}^{1/2}(F) we can apply Chebyshev’s inequality to gain the lower bound \mathbb{P}(||x| \in [\sqrt{d/3}-\alpha, \sqrt{d/3}+\alpha]) \geq 1 - (\sqrt{15}\alpha-\frac{3\sqrt{5}\alpha^2}{2\sqrt{d}})^{-2}, which is precisely theorem 1.

So my heuristic has been torn to pieces, and, instead, as d gets large, the ‘amount of space’ in the d-dimensional unit cube whose distance from the origin is within a certain (small) distance of \sqrt{d/3}  actually must be rather a lot. For example, take \alpha=\frac{1}{2}. Then P(||x|-\sqrt{d/3} \geq 1-(\frac{\sqrt{15}}{2}-O(d^{-1/2}))^{-2}, so as d tends to infinity the probability approaches a value of at least \frac{11}{15}, so more than two thirds of points in a unit cube of high dimension are (with maximum error \frac{1}{2}) roughly distance \sqrt{\frac{d}{3}} from the origin (or, by symmetry, any other vertex).

I found this result initially very surprising (and sent Prof. Green a semi-apologetic email in which I acknowledged that I could just be having an evening of insanity and the next morning discover my estimates were based on false assumptions). The basic reason the smooth expected distribution ended up being a spike was that in the variance calculation the quadratic terms cancelled out, leaving only smaller order terms. In another post in the next few days I shall try to investigate whether this phenomenon (where the quadratic parts of E(X^2), E(X)^2 are equal) can be found in any interesting families of general situations, as this now seems likely and should be investigated to inform my intuition in future.