Clustering More Than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches
Abstract:
This project aims to provide a highly accurate interactive map of medical research that can be easily used by both technical and non-technical users. Most current science maps in use today are small in scale and have not been validated. Accurate decisions require high quality and high coverage data, well defined and tested data analysis workflows, and a resulting representation that matches the visual perception and cognitive processing capabilities of human users.
Phase I of this project compares and determines the relative accuracies of maps of medical research based on commonly used text-based and citation-based similarity measures at a scale of over two million documents.
Team
The project is lead by SciTech Strategies Inc. in collaboration with the Cyberinfrastructure for Network Science Center at Indiana University. There are subcontracts to different researchers and one company. The full team comprises:
- Kevin W. Boyack, Richard Klavans, SciTech Strategies Inc.
- Katy Börner, Russell J. Duhon, Nianli Ma, Indiana University
- Bob Schijvenaars, Aaron Sorensen, Collexis Holdings Inc.
- André Skupin, San Diego State University
The following people, although not part of the formal team, will also contribute to the project.
- Edmund Talley, National Institute of Health
- Dave Newman, University of California, Irvine
Please cite as: Boyack, Kevin W., David Newman, Russell Jackson Duhon, Richard Klavans, Michael Patek, Joseph R. Biberstine, Bob Schijvenaars, André Skupin, Nianli Ma, and Katy Börner. 2011. "Clustering More Than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches". PLoS ONE 6(3): 1-11.
Datasets
It was decided that all work will be documented in real time and at a level of detail that supports the exact replication of work. All subcontractors will have access to this documentation as well as to intermediate data results. While the Scopus data cannot be made available, all Medline based derivative data will be made freely available from this page. Data compilation and statistics are documented in STS-Documentation.pdf.
Raw data
List of PMIDs | sts-pmids.txt.gz (4.9MB) |
List of stop words | sts-stop-words.txt.gz |
List of PMIDs with titles and abstracts | pmid-title-abstr1.txt.gz (538MB) |
List PMIDs with titles and abstracts | pmid-title-abstr2.txt.gz (528MB) |
Analysis Input Data
Title/Abstract term adjacency list | sts-text-adj.gz (642MB) |
MeSH adjacency list | sts-mesh-adj.gz (98MB) |
Term Frequency Data
Title/Abstract term frequencies | sts-text-freq.gz (17MB) |
MeSH term freqencies | sts-mesh-freq.gz (211KB) |
Analysis results will also be made available from this site. They will comprise:
Analysis Result Data
Linkage-based analysis |
||
Co-citation | sts-cocite-sim.gz | sts-cocite-clust.gz (56MB) |
Bibliographic coupling | sts-bibcoup-topn.sim.gz (121MB) | sts-bibcoup-clust.gz (9.1MB) |
Direct citation | sts-directcit-topn.sim.gz (67MB) | sts-direct-clust.gz (8.8MB) |
Title/Abstract term analysis |
||
Co-occurrence | sts-TA-co-topn.sim.gz (212MB) | sts-TA-co-clust.gz (8.4MB) |
LSA | sts-TA-lsa-topn.sim.gz (194MB) | sts-TA-lsa-clust.gz (8.9MB) |
Topic model (UCI) | sts-TA-topics-uci.sim.gz (117MB) | sts-TA-topics-clust.gz (9.4MB) |
Collexis | sts-TA-collx-topn.sim.gz (146MB) | sts-TA-collx-clust.gz (9.3MB) |
MeSH analysis |
||
Co-occurrence | sts-mesh-co.sim.gz (155MB) | sts-mesh-co-clust.gz (9.5MB) |
LSA | sts-mesh-lsa-topn.sim.gz (198MB) | sts-mesh-lsa-clust.gz (10MB) |
Self-organizing maps (SOM) | sts-mesh-som.sim | sts-mesh-som.clust.gz (9.4MB) |
Collexis | sts-mesh-collx.sim.gz (149MB) | sts-mesh-collx-clust.gz (9.3MB) |
Other analysis |
||
NCBI related records data | sts-ncbi-topn.sim.gz (115MB) | sts-ncbi-clust.gz (9.4MB) |
Validation |
||
Bib coupling coherence result | bc-lev1-coh.gz (387KB) | |
Co-citation coherence result | cc-lev1-coh.gz (379KB) | |
Direct citation coherence result | dc-lev1-coh.gz (590KB) | |
Co-word MeSH coherence result | co-mesh-lev1-coh.gz (290KB) | |
LSA MeSH coherence result | lsa-mesh-lev1_coh.gz (294KB) | |
SOM MeSH coherence result | som-mesh-lev1-coh.gz (346KB) | |
Collexis MeSH coherence result | collx-mesh-lev1-coh.gz (314KB) | |
Co-word TA coherence result | co-ta-lev1-coh.gz (251KB) | |
LSA TA coherence result | lsa-ta-lev1-coh.gz (280KB) | |
NCBI coherence result | ncbi-lev1-coh.gz (340KB) | |
Collexis TA coherence result | collx-ta-lev1-coh.gz (340KB) | |
Topics TA coherence result | topic-ta-lev1.gz (284KB) |
Acknowledgements
This project is funded by NIH SBIR Contract HHSN268200900053C.
Indiana University's Big Red supercomputer used in this study is supported by the National Science Foundation under Grant No. ACI-0338618l, OCI-0451237, OCI-0535258, and OCI-0504075. This research was supported in part by the Indiana METACyt Initiative. The Indiana METACyt Initiative of Indiana University is supported in part by Lilly Endowment, Inc. This work was supported in part by Shared University Research grants from IBM, Inc. to Indiana University. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).