Visualizing a decade and a half of physics research


In our last post we describe a technique known as Latent Dirichlet Allocation (LDA) for classifying physics research papers by topic. Using nothing but the data we came up with what seem to be some pretty informative categories. The really interesting question is whether those empirically-derived topic categories bear any resemblance to the clusters that emerge from the data when we look at who’s citing who. If our topical groupings are useful, we expect that clusters of papers sharing large numbers of citations will also tend to share the same topic.


We used Gephi, an excellent open source viz platform that Quid used to launch their product beta, to visualize the physic.gexf data object that we produced in the last post. Though Gephi supposedly supports a ton of different input file formats, we’ve had the best luck with its native gexf XML format. The settings we used to produce this graph were:

  • Force Atlas 2 layout
  • 4 cores
  • Tolerance: 1.0
  • No approximation
  • Scaling: 25
  • Gravity: 0.5
  • Prevent overlap

To get started with Gephi, check out the quick start guide or this YouTube video on Facebook network visualization.


As a first pass we kept the graph simple, only visualizing those papers that had at least 50 connections to other papers in the original data set. Circles represent papers. When a paper cites another paper that creates a link between between circles. The more heavily cited a paper is the larger it is. The topic classifications from the last post control the colors of the circles and connections. Here’s what we came up with:


We can see that there is very strong alignment between clustering pattern and topic classification for several of the most important topics. The pink and green topic clusters on the left are related to quantum an quantum electrodynamic theory. While they have fairly distinct topic groupings, we can see that there are a fair number of papers from each category that cite or are cited by papers from different clusters. The degree of representation of these two topics (33% of the total sample) shows that this is a major focus of academic research. At the same time, the broad distribution of nodes from these topics reveals that quantum theory supports many other sub fields in the physics literature.

The most important paper in terms of number of citations in the data set is represented by the large blue circle toward the upper right of the plot. This corresponded to Large N Field Theories, String Theory and Gravity, Aharony et al (1999). Some of the keywords associated with this paper’s topic were geometry, space, mathematical, differentiable and manifolds.

The clay-red topic in the center of the graph is an interesting one. Though this topic only represents 3.74% of the sample it has an usual number of highly-cited, centrally located papers. Some keywords associated with this topic were gauge, theory, supersymmetric, su, seiberg, yang, moduli, and theories. It’s could be that this topic is at least partially related to several key authors whose names come up a lot in the abstracts of papers that cite them, and that the work they do is also broadly relevant across the physics field.

The orange topic to the right of the plot contains keywords abelian, dual, magnetic, monopole, higgs, topological, field, and electric. Electromagnetism is probably a fair designation for this topic. Its a relatively large, homogenous grouping that doesn’t seem particularly closely related other clusters. In particular, is has almost no overlap with cosmology (the yellow clusters at top of plot).

Follow up

We’d love your thoughts on how we can improve any and all aspects of this analysis. What did you find helpful? What was confusing? Would you like to chat with us about how you can apply this to your own work? E-commerce companies, intelligence analysts at DoD, retailers, political campaigns, and many other actors have found this type of analysis to be extremely revealing. Whether it’s a prospective engagement or just you sharing your thoughts on how to improve this example we’d love to hear from you.

To start a conversation check out our website or get in touch with us at or +1 (619) 365 – 4231.

Visualizing a decade and a half of physics research

Using Latent Dirichlet Allocation (LDA) to topically classify research papers




This project started with a relatively simple goal: use network data freely available from Stanford to create a network graph showing how over a decade of physics papers published since 1992 are related to one another via citation. We wanted to color this graph by topic and therein lay the problem: no topical information was present in the meta data. If we wanted to assign topics to papers, we’d have to make those assignments ourselves. 

Latent Dirichlet Allocation (LDA) is a really cool method of simultaneously generating and assigning topic categories to a body of text. The intuition is pretty straightforward: we can think of documents as being mixtures of topics, and can discover the words associated with topics by examining the words that different documents have in common. The technicals are a bit more complicated. Edwin Chen takes a pretty good stab at it here and here. We use a common method here called collapsed Gibb’s sampling, an application of Bayesian statistics and Monte Carlo sampling.


LDA assumes that all documents are of a uniform length. Since this clearly isn’t the case here, we needed to condense our abstracts into counts of the most common words in the literature. We did this by building a “term-document matrix,” and only counted the number of occurrences of a word in an abstract if that word occurred in at least 100 of the +25,000 abstracts in the data. We built this matrix using python’s textmining package

The most important parameter that the user controls in the LDA is the number of topics used for the generative model. This is a judgement call. We checked out Wikipedia’s Outline of Physics article and decided to start with 25 topics.

Once we’d structured our text data for LDA and decided on the number of topics to feed into the generative model, the last thing we needed to do was select the number of Gibb’s sampling iterations. Experimentally we found that the log likelihood function was essentially flat after iteration 200.


Here were some of the most important topics that emerged from our analysis, along with some interesting words associated with them:

  • Topic 21 (17.5%): algebra, quantum, group, representation, operators, deformed
  • Topic 24 (16%): supersymmetry, supersymmetric, duality, chiral, symmetries, cern, gauge 
  • Topic 18 (9.67%): abelian, dual, magnetic, monopole, higgs, topological, field, electric
  • Topic 8 (7.67%): geometry, space, mathematical, differentiable, manifolds 
  • Topic 3 (6.64%): noncommutative, string, wave, spectrum, background, scattering
  • Topic 5 (5.7%): dimensional, lattice, partition, scaling
  • Topic 4 (4.33%): black, hole, holes, entropy, dilation, dimensional, hawking, einstein 

The analysis seems to do a pretty good job identifying sub-topics in the physics literature. For example, we had no idea that there was something called supersymmetric gauge theory, which is closely related to quantum electrodynamics, until we Googled the keywords from topic 24. Topic 4 clearly relates to cosmology and black holes.

You can easily replicate this analysis for yourself by stopping by our GitHub repo. Check out our next post where we represent the results of this data analysis in Gephi to see how well these topics actually relate to the data.

Using Latent Dirichlet Allocation (LDA) to topically classify research papers

When linear regressions fail (and what to do about it)

Summary: Machine learning for marketers

In this post we argue that direct marketers need to get familiar with machine learning. You can think of machine learning as black-box predictive algorithms that are more sophisticated, and require more computing power, than linear regression. We’ve open-sourced the data and code for this study so that you (or your analytics team) can replicate our analysis. Here are the key takeaways: 

  • Machine learning can improve your marketing campaign ROI by 15 – 50%. If you’re not using machine learning for your predictive analytics stack you’re leaving money on the table.
  • Conventional regression-based forecasting is especially vulnerable to breakdown when used as a predictive response modeling tool. In fact, forecasts with linear models can do more harm than good. 
  • By randomly sampling both the rows and columns of a training data set multiple times analysts can develop a series of linear models that substantially improve on the performance of a single regression. 
  • Random forests (a classic machine learning algorithm) significantly outperform linear models.  

If you need more convincing, read on. If you’d like to talk with our team about how you can start benefiting from machine learning today check out our website or get in touch with us at or +1 (619) 365 – 4231. We will replicate the analysis presented here gratis on your own past campaign data to prove the value of this concept.

Study methodology

The data set used for this study is an anonymized version of actual customer data. Conventionally, analysts validate and tune models using random samples of the training set. However, we want to know how models trained on past data perform when applied to future campaigns. To test this we took the first 50% of marketing transactions recorded to date in 2015 and used the predictive models trained with that data to predict outcomes in the subsequent 50%. 

We present two graphics that should be familiar to most marketers, lift charts and cumulative gains charts

All models are calculated using the scikit-learn python library which, along with vowpal wabbit and R, comprises the bulk of our predictive analytics stack here at Polhooper.

The trouble with normal linear regression

Data sets have two types of correlations. The first is “signal,” real relationships between the variables that apply broadly to the problem being studied. The second source of correlation is “noise,” relationships between variables that, while statistically significant in a particular data set, are not generally valid. The trouble with regular linear regression is that it fits both the signal and the noise in a set of training data. When used to forecast outcomes in new data, this tendency to fit to noise can lead to worse-than-random prediction performance. 

To improve the performance of the base linear regression we use a technique called bootstrap aggregation. By randomly sampling both the rows and columns of a training data set, training a linear model, and repeating this process multiple times we can improve model performance by reducing the tendency of a single model to fit on noise in the data. With bootstrap aggregation we build a whole bunch of models with subsampled data and average their predictions together to get a single, final prediction. 

(Aggregating a bunch of linear models looks something like this)

For our ensemble we trained 250 linear models, using 50% subsamples of the original data and 25% subsamples of the original feature set for each model. Here’s what the gain and lift charts, respectively, looked liked for the simple linear regression (green) compared to the bagged regression ensemble (purple):


Key take-aways:

  • A single linear regression fit on the whole data set (green line),  actually becomes worse than random at selecting qualified leads, depending on how many leads you contact.
  • The ensemble of linear models with the randomly-sampled data improves prediction performance by 15 – 20% over the single linear model.
  • In general, linear regression isn’t that great at forecasting customer engagement for this problem.

Adding in some machine learning

Random forests are a classic and very powerful machine learning algorithm (theory nerds can check out the original paper by Breiman, 2001). The algorithm uses the same process that we described above for linear models, except instead of training linear regressions it trains a bunch of classification and regression trees. We added a random forest model (yellow line) and a logistic regression (orange line) to the lift and gain analysis for comparison:


Key take-aways:

  • A random forest improves prediction performance over the second-best model by 15 – 30%, depending on the number of leads contacted. This is a 30 – 50% improvement over a single linear model. 

Putting this in dollars and cents terms, let’s say that in a given month this customer planned to contact 200,000 out of 1,000,000 of leads available, and to select those leads with a customer response model. Also assume that 20% of contacted leads convert into customers, and that the lifetime value of a customer averages $1,000. In this case, machine learning-based targeting grosses $800K in revenue, while the logistic/linear ensemble grosses $676K and the linear regression grosses $568K.

In other words, not using machine learning for targeting in this example would have cost the company $124 – 232K!


Machine learning makes predictions better.

Better predictions mean better targeting and stronger marketing ROIs.

If you or your data provider aren’t using machine learning you’re missing out.

To start a conversation check out our website or get in touch with us at or +1 (619) 365 – 4231.

We’ve open sourced the data for this study via Dropbox and the code via our GitHub account. Check out the README file before beginning for instructions on setting things up. Email us with any questions, suggestions, or, especially, results that are better than what we’ve put out here.


When linear regressions fail (and what to do about it)

We’re changing things up!

We recently migrated our website platform from Squarespace to Strikingly. In the process, we realized that it made sense to have our blog page separate from our primary landing page (even though we’re keeping it within the same domain). Among other reasons, WordPress made sense because of how easy it is to embed live and static code snippets within blog text

New posts involving promised open source marketing data to follow shortly. Stay tuned!

We’re changing things up!