An interactive visualization of U.S. Census FSRDC-linked research outputs
Project Summary
Dashboard Features & Methodology (2025 Update):
Comprehensive data cleaning, deduplication, and enrichment pipeline: combines eight group CSVs, project-level metadata join, and robust API enrichment (Crossref/OpenAlex) with rate-limiting, retries, and local caching for reproducibility and completeness.
Interactive EDA: Bar chart of top 10 RDCs (highlighting Boston/Michigan dominance), publication trends over time (with data cutoff caveat for 2024/2025), and top 10 authors (showing name clustering and prolific contributors).
Co-authorship network insights: Top author pairs and collaboration patterns, revealing both strong partnerships and distributed networks.
Publication trends for top 5 RDCs: Comparative growth and recent peaks, especially for Michigan and Boston.
Clustering & PCA: KMeans (k=3, k=4) and DBSCAN on standardized/enriched metadata, with 2D PCA projections. Clusters interpreted as long-term journal projects, short-term/draft outputs, and specialized/region-specific studies.
Classification: OutputType_x as target, rare classes dropped, SMOTE for class balancing, Random Forest for robust multi-class prediction, with confusion matrix and report.
Text analysis: TF-IDF KMeans (k=5) and LDA topic modeling on OutputTitle, surfacing themes like housing/inequality, public health, labor markets, COVID-19, and healthcare/rural policy. Word cloud visualization included.
DES (Discrete Event Simulation): Models project lifecycle from start to publication, quantifies bottlenecks (approval/data access), and estimates median time to publication (3–4 years).
Searchable/filterable data table and CSV download for further exploration.
Modern UI: Responsive design, dark/light mode toggle, and clear navigation for accessibility.
Team and data/tool information section for transparency and reproducibility.
Key Results & Insights (2025):
Data pipeline: Ensured a clean, deduplicated, and richly annotated dataset by combining multiple sources and robust API enrichment with error handling.
Top RDCs: Boston and Michigan are clear leaders in research output, with a sharp drop after the top 5, indicating output concentration in leading institutions.
Publication trends: Steady growth since 2011, peaking in 2023; sharp drop in 2024/2025 is due to incomplete data, not a real decline.
Authors: High concentration among a few prolific authors (notably J. Wang, Z. Wang, Y. Zhang), with common surnames suggesting research groups or teams.
Co-authorship: Top pairs have 6–8 joint publications, indicating strong but distributed collaboration networks. Many prolific authors collaborate widely, not just in pairs.
Clustering & PCA: Three main research clusters (long-term journal projects, short-term drafts, specialized outputs); DBSCAN isolates outliers and edge cases. PCA did not reveal clear clusters by output type, likely due to non-linear feature relationships.
Classification: OutputType_x as target, rare classes dropped, SMOTE for balancing, Random Forest for robust multi-class prediction. Model performance evaluated with confusion matrix and report.
Text analysis: TF-IDF KMeans and LDA reveal major themes: housing/inequality, public health, labor markets, COVID-19, economic shocks, and healthcare/rural policy. Word cloud and topic lists included for interpretability.
DES: Median time from project start to publication is 3–4 years, with bottlenecks at approval and data access stages. Highlights the need to streamline these phases for faster research output.
Comprehensive view: The dashboard provides an interactive, end-to-end exploration of FSRDC-linked research, supporting both high-level insights and detailed drill-downs for further analysis.
Top 10 RDCs by Research Output
The chart below highlights the top 10 Research Data Centers (RDCs) by the number of associated research outputs. Boston and Michigan are the clear leaders, with a sharp drop after the top 5 (Triangle), indicating a concentration of research output within these leading institutions. The average research outputs per RDC is approximately 807, reflecting high productivity among the top centers.
Top Co-Author Pairs
Mariko Sakakibara & Natarajan Balasubramanian: 8
Benjamin A. Campbell & Rajshree Agarwal: 8
Benjamin A. Campbell & Martin Ganco: 8
Charles Courtemanche & James Marton: 8
Abigail Cooke & Thomas Kemeny: 7
Nuri Ersahin & Rustom M. Irani: 6
Debarshi K. Nandy & Karthik Krishnan: 6
Andrew B Bernard & J. Bradford Jensen: 6
Karthik Krishnan & Thomas J. Chemmanur: 6
Martin Ganco & Rajshree Agarwal: 6
Insight: Top co-author pairs have collaborated on up to 8 publications, indicating strong research partnerships. However, the relatively low number of co-authored publications for top pairs (compared to prolific authors) suggests that many leading authors collaborate widely across different teams, or that author parsing may split teams in a way that undercounts some collaborations.
Publication Trend Over Years
This chart illustrates the growth in FSRDC-linked publications over time. There is a general upward trend from 1993 to 2023, with a slow but steady increase until the early 2000s, then a more noticeable acceleration after 2004-2005. Peak publication activity occurs in 2023. The sharp decline in 2024 and 2025 is due to incomplete data for those years, not an actual drop in research output.
Publication Trends Over Time for Top 5 RDCs
This chart shows yearly publication counts for the top 5 RDCs (Michigan, Boston, Baruch, Penn State, Triangle). Michigan and Boston have scaled their research output more dramatically in recent years, with significant peaks around 2020–2023. All top 5 RDCs show growth, but there is year-to-year variability and a sharp drop in 2024–2025 due to data cutoff.
Top 10 Authors
The chart shows the top 10 most prolific authors in the FSRDC ecosystem. J. Wang leads with 682 publications, followed by Z. Wang (485), Y. Zhang (343), J. Lee (323), L. Zhang (311), M. Finger (294), Y. Chen (293), A. Sharma (288), S. Bhattacharya (277), and H. Kim (264).
Insight: There is a high concentration of publications among a few authors, with J. Wang being exceptionally prolific. The recurrence of surnames like "Wang" and "Zhang" (with different initials) suggests either common names or the presence of research groups/teams with high output. Many prolific authors collaborate widely, as seen in the co-authorship analysis, rather than only in fixed pairs.
PCA & Clustering Visualizations
Explore different 2D projections and clusterings of the research outputs. Use the dropdowns to switch between PCA by OutputType, KMeans (3/4 clusters), DBSCAN, and TF-IDF KMeans text clusters.
TF-IDF KMeans Text Clustering
This plot shows clusters based on KMeans applied to TF-IDF features of research titles, revealing thematic groupings in the text.
Topic Modeling Insights
We applied Latent Dirichlet Allocation (LDA) and TF-IDF KMeans clustering to analyze research titles and uncover major themes in FSRDC research outputs.
LDA topic modeling identified five prominent research themes, including:
Housing, inequality, and urban development
Gender gaps and STEM education outcomes
COVID-19 and small business impact
Economic shocks and unemployment
Healthcare access and rural policy
TF-IDF KMeans clustering grouped titles by semantic similarity, surfacing clusters such as:
Titles focused on housing, income, and urban inequality
Research related to public health, rural access, or healthcare policy
Studies on labor markets, job mobility, and economic shocks
These methods provide a powerful lens for exploring research focus areas and recurring themes in thousands of outputs. The topic lists and word cloud below summarize the most frequent and important terms in the FSRDC research corpus.
This word cloud visualizes the most prominent terms identified through LDA topic modeling of research titles. Larger words appear more frequently across top topics, reflecting the main research areas in the FSRDC corpus.
Discrete Event Simulation (DES) Insights
To illustrate data flow in the FSRDC ecosystem, we implemented a conceptual Discrete Event Simulation (DES). The simulation models project lifecycles from proposal submission to publication, capturing delays from RDC approval, data access, and analysis.
Entities: Projects submitted to RDCs
Events: Proposal approval, data access granted, output published
Resources: RDC data specialists and reviewers
Key Insight: Median time from start to output is 3–4 years, with substantial queueing at approval and access stages.
This simulation highlights the importance of streamlining RDC processes to improve research throughput.
Project Lifecycle Bottleneck Chart
This bar chart shows the simulated median time spent at each major phase of an FSRDC project. It visualizes delays due to approvals, data access, analysis, and publication stages.