
PubMatrixR is an R package that performs systematic literature searches on PubMed and PMC databases using pairwise combinations of search terms. It creates co-occurrence matrices showing the number of publications that mention both terms from two different sets, enabling researchers to explore relationships between genes, diseases, pathways, or any other biomedical concepts.
This repository maintains and extends the original
PubMatrixR package with improved validation, offline-safe
tests/vignettes, and heatmap helpers.
pheatmapInteractive Shiny App: https://toledoem.shinyapps.io/pubmatrix-app/
No installation required - just open the link and start analyzing!
install.packages("PubMatrixR")# Install remotes if you haven't already
if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes")
# Install PubMatrixR
remotes::install_github("ToledoEM/PubMatrixR-v2")PubMatrixR requires the following R packages:
pbapply - Progress bars for apply functionspheatmap - Static heatmap generationxml2 - XML parsing for API responsesreadODS - To export results in OpenDocument Spreadsheet
- Excel compatible - for hyperlink export.library(PubMatrixR)
# Define two sets of search terms
genes_set1 <- c("SREBP1", "SOX4", "GLP1R")
genes_set2 <- c("NR1H4", "liver", "obesity")
# Perform the search and create a matrix
result <- PubMatrix(
A = genes_set1,
B = genes_set2,
Database = "pubmed",
daterange = c(2010, 2024),
outfile = "my_results",
export_format = "csv" # Options: NULL (no export), "csv", or "ods"
)
# Create a heatmap with overlap percentages and Euclidean clustering
plot_pubmatrix_heatmap(result)The main function that performs pairwise literature searches and generates co-occurrence matrices.
| Parameter | Type | Default | Description |
|---|---|---|---|
file |
character | - | Path to file containing search terms (alternative to A/B vectors) |
A |
character vector | NULL | First set of search terms |
B |
character vector | NULL | Second set of search terms |
API.key |
character | NULL | NCBI E-utilities API key (optional, increases rate limits) |
Database |
character | “pubmed” | Database to search: “pubmed” or “pmc” |
daterange |
numeric vector | NULL | Date range as c(start_year, end_year) |
outfile |
character | NULL | Base filename for outputs (without extension). Required if export_format is specified. |
export_format |
character | NULL | Export format for the hyperlinked results matrix. Options: NULL (default, no file export), ‘csv’ (Excel-compatible with HYPERLINK formulas), or ‘ods’ (LibreOffice/OpenOffice format). |
Returns a matrix-like data frame where:
BAWhen using the file parameter, the input file should
contain:
term1_from_A
term2_from_A
term3_from_A
#
term1_from_B
term2_from_B
term3_from_B
The # character separates the two sets of search
terms.
PubMatrixR provides dedicated functions for creating heatmaps from PubMatrix results.
Creates a formatted heatmap displaying overlap percentages in cells, with Euclidean distance clustering for row/column ordering.
Cell Values: Overlap percentages derived from co-occurrence counts Clustering Method: Euclidean distance on the overlap percentage matrix
| Parameter | Type | Default | Description |
|---|---|---|---|
matrix |
numeric matrix | - | A PubMatrix result matrix containing publication co-occurrence counts |
title |
character | “PubMatrix Co-occurrence Heatmap” | Heatmap title |
cluster_rows |
logical | TRUE | Whether to cluster rows using Euclidean distance |
cluster_cols |
logical | TRUE | Whether to cluster columns using Euclidean distance |
show_numbers |
logical | TRUE | Display overlap percentages in cells |
filename |
character | NULL | Optional filename to save plot |
# First generate a matrix
result <- PubMatrix(A = c("gene1", "gene2"), B = c("disease1", "disease2"))
# Create heatmap with overlap percentages and Euclidean clustering
plot_pubmatrix_heatmap(result)
# Save to file
plot_pubmatrix_heatmap(result, filename = "my_heatmap.png")Thin wrapper around plot_pubmatrix_heatmap() for quick
visualization.
# Quick heatmap using defaults
pubmatrix_heatmap(result, title = "Quick PubMatrix Heatmap")library(PubMatrixR)
# Define gene sets
genes_of_interest <- c("TP53", "BRCA1", "EGFR", "MYC")
pathways <- c("apoptosis", "DNA repair", "cell cycle", "oncogene")
# Perform search
results <- PubMatrix(
A = genes_of_interest,
B = pathways,
Database = "pubmed",
daterange = c(2015, 2024),
outfile = "gene_pathway_matrix"
)
# View results
print(results)
# TP53 BRCA1 EGFR MYC
# apoptosis 1456 234 567 890
# DNA repair 789 1456 123 234
# cell cycle 1234 456 890 567
# oncogene 567 123 789 1456library(PubMatrixR)
library(msigdf)
library(dplyr)
# Extract gene symbols from MSigDB pathways
wnt_genes <- msigdf::msigdf.human %>%
filter(grepl("wnt", geneset, ignore.case = TRUE)) %>%
pull(symbol) %>%
unique() %>%
sample(10) # Sample 10 genes for demonstration
obesity_genes <- msigdf::msigdf.human %>%
filter(grepl("obesity", geneset, ignore.case = TRUE)) %>%
pull(symbol) %>%
unique() %>%
sample(10) # Sample 10 genes for demonstration
# Search for co-occurrences
wnt_obesity_matrix <- PubMatrix(
A = wnt_genes,
B = obesity_genes,
Database = "pubmed",
outfile = "wnt_obesity_cooccurrence"
)
# Create heatmap with overlap percentages and Euclidean clustering
plot_pubmatrix_heatmap(wnt_obesity_matrix)Create a file called search_terms.txt:
insulin
glucose
diabetes
metabolic syndrome
#
liver
pancreas
adipose tissue
muscle
Then run:
results <- PubMatrix(
file = "search_terms.txt",
Database = "pubmed",
daterange = c(2020, 2024),
outfile = "metabolic_tissue_matrix"
)
# Create heatmap visualization
plot_pubmatrix_heatmap(results)# Get better rate limits with an API key
results <- PubMatrix(
A = c("CRISPR", "base editing", "prime editing"),
B = c("therapeutic", "clinical trial", "safety"),
API.key = "your_ncbi_api_key_here",
Database = "pubmed",
daterange = c(2020, 2024),
outfile = "gene_editing_therapeutics"
)When outfile and export_format parameters
are specified, PubMatrixR generates a results file with clickable
hyperlinks:
| Format | Parameter Value | File Extension | Use Case |
|---|---|---|---|
| No Export | export_format = NULL (default) |
- | Results returned only to R environment, no file saved |
| CSV | export_format = "csv" |
.csv |
Excel-compatible format with HYPERLINK formulas for direct linking to PubMed searches |
| ODS | export_format = "ods" |
.ods |
LibreOffice/OpenOffice format with embedded hyperlinks, better for cross-platform compatibility |
The output filename follows the pattern:
{outfile}_result.{extension}
All formats include:
# No file export - results only in R
result <- PubMatrix(A = genes, B = diseases, Database = "pubmed")
# Export as CSV with hyperlinks
result <- PubMatrix(
A = genes,
B = diseases,
Database = "pubmed",
outfile = "my_results",
export_format = "csv"
)
# Creates: my_results_result.csv
# Export as ODS (LibreOffice format)
result <- PubMatrix(
A = genes,
B = diseases,
Database = "pubmed",
outfile = "my_results",
export_format = "ods"
)
# Creates: my_results_result.odsCreate heatmaps using the dedicated heatmap functions:
# Basic heatmap with overlap percentages and Euclidean clustering
plot_pubmatrix_heatmap(your_matrix)
# Save heatmap to file
plot_pubmatrix_heatmap(your_matrix,
filename = "my_heatmap.png",
title = "Custom Title")Features of the visualization:
#fee5d9) to dark red (#99000d) representing
publication countsTo improve search speed and avoid rate limiting:
API.key parameterReference: NCBI E-utilities documentation
PubMatrixR is particularly useful for:
Empty Results: If many searches return 0 results, try:
Rate Limiting Errors: If you encounter HTTP 429 errors:
Long Search Times: For large matrices:
This project is licensed under the MIT License - see the LICENSE file for details.