PubMed model created using NIH SBIR funding

As part of a recent SBIR awarded by NIH, we have clustered PubMed records from 2000 through mid-2019. The resulting model contains 16.26 million documents in 35,965 clusters and is available for anyone to use.

The method used to cluster PubMed records, along with a comparison of the relative accuracy of an earlier PubMed model, is available in a short paper we presented at the 2018 STI meeting in Leiden. That paper can be found here.

A list of the PMID, publication year, and cluster number for each PMID is available in a gzipped file (73MB).

An Excel file that contains a list of features associated with each cluster can be downloaded here (197MB). To understand what a particular cluster is about, enter a cluster number (between 0 and 35964) in the yellow box at the top of the “RA5_SHEET” worksheet. The sheet will populate with information about the cluster that is retrieved from the other worksheets. Several cluster-level metrics are provided at the bottom of that sheet including:

  • Fraction of papers with industry addresses
  • Average number of patent references to papers in the cluster
  • Research level
  • Numbers of funding types and grants referenced per paper
  • Amount of NIH funding per paper
  • Cluster percentile for each of the above metrics

Questions can be addressed to Kevin Boyack at kboyack at