Using Latent Dirichlet Allocation (LDA) to topically classify research papers

repo: https://github.com/polhooper/physics

data: https://snap.stanford.edu/data/cit-HepTh.html

Background

This project started with a relatively simple goal: use network data freely available from Stanford to create a network graph showing how over a decade of physics papers published since 1992 are related to one another via citation. We wanted to color this graph by topic and therein lay the problem: no topical information was present in the meta data. If we wanted to assign topics to papers, we’d have to make those assignments ourselves. 

Latent Dirichlet Allocation (LDA) is a really cool method of simultaneously generating and assigning topic categories to a body of text. The intuition is pretty straightforward: we can think of documents as being mixtures of topics, and can discover the words associated with topics by examining the words that different documents have in common. The technicals are a bit more complicated. Edwin Chen takes a pretty good stab at it here and here. We use a common method here called collapsed Gibb’s sampling, an application of Bayesian statistics and Monte Carlo sampling.

Process

LDA assumes that all documents are of a uniform length. Since this clearly isn’t the case here, we needed to condense our abstracts into counts of the most common words in the literature. We did this by building a “term-document matrix,” and only counted the number of occurrences of a word in an abstract if that word occurred in at least 100 of the +25,000 abstracts in the data. We built this matrix using python’s textmining package

The most important parameter that the user controls in the LDA is the number of topics used for the generative model. This is a judgement call. We checked out Wikipedia’s Outline of Physics article and decided to start with 25 topics.

Once we’d structured our text data for LDA and decided on the number of topics to feed into the generative model, the last thing we needed to do was select the number of Gibb’s sampling iterations. Experimentally we found that the log likelihood function was essentially flat after iteration 200.

Results

Here were some of the most important topics that emerged from our analysis, along with some interesting words associated with them:

  • Topic 21 (17.5%): algebra, quantum, group, representation, operators, deformed
  • Topic 24 (16%): supersymmetry, supersymmetric, duality, chiral, symmetries, cern, gauge 
  • Topic 18 (9.67%): abelian, dual, magnetic, monopole, higgs, topological, field, electric
  • Topic 8 (7.67%): geometry, space, mathematical, differentiable, manifolds 
  • Topic 3 (6.64%): noncommutative, string, wave, spectrum, background, scattering
  • Topic 5 (5.7%): dimensional, lattice, partition, scaling
  • Topic 4 (4.33%): black, hole, holes, entropy, dilation, dimensional, hawking, einstein 

The analysis seems to do a pretty good job identifying sub-topics in the physics literature. For example, we had no idea that there was something called supersymmetric gauge theory, which is closely related to quantum electrodynamics, until we Googled the keywords from topic 24. Topic 4 clearly relates to cosmology and black holes.

You can easily replicate this analysis for yourself by stopping by our GitHub repo. Check out our next post where we represent the results of this data analysis in Gephi to see how well these topics actually relate to the data.

Using Latent Dirichlet Allocation (LDA) to topically classify research papers

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s