PubMed model created using NIH SBIR funding
As part of a recent SBIR awarded by NIH, we have clustered PubMed records from 2000 through mid-2019. The resulting model contains 16.26 million documents in 35,965 clusters and is available for anyone to use.
The method used to cluster PubMed records, along with a comparison of the relative accuracy of an earlier PubMed model, is available in a short paper we presented at the 2018 STI meeting in Leiden. That paper can be found here.
A list of the PMID, publication year, and cluster number for each PMID is available in a gzipped file (73MB).
An Excel file that contains a list of features associated with each cluster can be downloaded here (197MB). To understand what a particular cluster is about, enter a cluster number (between 0 and 35964) in the yellow box at the top of the “RA5_SHEET” worksheet. The sheet will populate with information about the cluster that is retrieved from the other worksheets. Several cluster-level metrics are provided at the bottom of that sheet including:
- Fraction of papers with industry addresses
- Average number of patent references to papers in the cluster
- Research level
- Numbers of funding types and grants referenced per paper
- Amount of NIH funding per paper
- Cluster percentile for each of the above metrics
Questions can be addressed to Kevin Boyack at kboyack at mapofscience.com