1 Abstract
Escalating pharmaceutical prices in the United States present a significant barrier to healthcare access and exacerbate financial strain on patients. This study investigates geographic and demographic disparities in prescription drug costs by analyzing Medicare Part D plan pricing at the county level. To manage and analyze over 100 GB of complex data, we developed a scalable data engineering pipeline utilizing AWS S3, DuckDB, and the Parquet file format. The pipeline integrates datasets from the Centers for Medicare & Medicaid Services (CMS), the U.S. Census Bureau, and the Food and Drug Administration (FDA). We applied statistical analysis and machine learning techniques to examine how demographic and market factors influence drug pricing. Our findings reveal that the number of available Medicare plans in a county is the most influential predictor of price for the majority of drugs studied, underscoring the impact of market dynamics over direct demographic characteristics. While certain demographic variables, such as median income, education level, and racial composition, showed predictive relevance for specific medications, no single demographic factor consistently explained price variation across all drugs. Although our analysis does not conclusively demonstrate disproportionate pricing impacts on any one demographic group, it highlights a complex and inequitable pricing landscape. These insights challenge the efficacy of current Medicare price negotiation mechanisms and suggest that federal and state policymakers should pursue price normalization strategies. Ensuring that the lowest negotiated drug prices are broadly accessible could enhance transparency and affordability for all Medicare beneficiaries.
2 Introduction
Escalating pharmaceutical prices in the U.S. are reshaping public health policy, widening healthcare disparities, and straining the financial well-being of millions. A recent KFF poll found that 60% of adults take at least one prescription, and 25% take four or more (“Public Opinion on Prescription Drugs and Their Prices” 2024). Yet, as pharmaceutical technology advances, affordability remains a significant concern. Even with coverage through Medicare, Medicaid, or private insurance, many patients face high out-of-pocket costs that contribute to financial stress and, in some cases, treatment avoidance.
According to the Commonwealth Fund (“2025 Scorecard on State Health System Performance, Fragile Progress, Continuing Disparities” 2025), 30% of U.S. adults carry medical debt, and nearly 57% of under-insured adults delay or avoid necessary treatment due to cost. This financial strain is compounded by the fragmented structure of public insurance programs like Medicare and Medicaid, which serve only specific populations and vary widely by state and county. These regional differences in plan availability and coverage often translate into significant disparities in pricing and out-of-pocket costs, further undermining equitable access to care.
While medications are only one aspect of medical care, they are essential in preventing more costly interventions. Chronic conditions such as diabetes, cardiovascular disease, and kidney disease can often be effectively managed with medication, reducing the need for hospitalization or surgery. A recent study by UNC Health (UNC Health 2025) found that oral semaglutide (marketed as Ozempic and Rybelsus for diabetes, and Wegovy for weight loss) significantly lowered the risk of heart attacks and strokes. This underscores that medication affordability is not merely a matter of convenience but a foundation of equitable, cost-effective healthcare.
This project examines the costs associated with Medicare prescription drug plans and their impact on different demographic groups across the U.S. By analyzing how plan pricing varies by state and county, we aim to better understand the financial burden on under-served populations. The findings will highlight geographic and socioeconomic disparities in medication affordability and may inform policy changes, or targeted interventions to advance health equity.
To support this analysis, we integrated publicly available data from the Centers for Medicare & Medicaid Services (CMS), the U.S. Census Bureau, and the Food and Drug Administration (FDA). These datasets provide detailed information on drug pricing, clinical use, demographic characteristics, and public insurance coverage specifically for Medicare patients. We brought together all datasets into a single DuckDB database on AWS S3, allowing for efficient and scalable data queries. This setup supports a thorough investigation of drug spending patterns and helps model costs based on demographic factors.
3 Background
3.1 Preliminary Research
3.1.1 Domestic Impact
In the United States, a 2009 study titled “Variation in Drug Prices at Pharmacies: Are Prices Higher in Poorer Areas?” (Gellad et al. 2009) examined mean retail prices for four commonly prescribed medications across Florida. The researchers compared these prices to median ZIP code income data from the 2000 census. Their findings revealed that, on average, drug prices in the poorest ZIP codes were 9% higher than the statewide average, suggesting a troubling correlation between socioeconomic status and pharmaceutical costs.
3.1.2 Global Impact
Internationally, disparities in drug pricing persist. A 2021 report from the U.S. Government Accountability Office (U.S. Government Accountability Office 2021) found that prices for selected brand-name drugs in the United States were, on average, higher than those in Australia, Canada, and France. While some medications were priced only slightly above international benchmarks, others cost up to ten times more than their counterparts abroad, underscoring the extent of pricing variability and the relative inefficiency of the U.S. pharmaceutical market.
3.2 Factors Influencing Medication Affordability
According to the National Academies of Sciences, Engineering, and Medicine (National Academies of Sciences, Engineering, and Medicine; Health and Medicine Division; Board on Health Care Services; Committee on Ensuring Patient Access to Affordable Drug Therapies; 2017), many factors affect medication affordability. The factors that seem most important for our investigation are “the interaction of market power, health insurance, and the lack of effective incentives for controlling product price”, “unequal bargaining power between buyers and sellers”, “insurance benefit designs with significant patient cost-sharing provisions”, and “inadequate performance of patient assistance programs and other public programs intended to make medicines more affordable for patients.”
For consumers, few levers exist to lower costs. One option is choosing where to purchase medications, though this assumes access to multiple pharmacies and the ability to travel. The other is selecting a health insurance plan, but the availability of plan options varies widely by region. In many cases, geographic location, something individuals often cannot control, significantly limits affordable choices.
3.3 Data Landscape
Medication prices vary depending on one’s insurance plan, the pharmacy, the dosage of medication, generic or name-brand, and more. When looking for medication prices for our analysis, we initially explored several APIs from online pharmacies, including GoodRx (GoodRx 2025) and CostPlusDrugs (Cost Plus Drugs 2025), which aim to improve consumer price transparency. However, these sources could not be reliably linked to the county-level demographic data essential to our analysis.
Consequently, we turned to the Centers for Medicare & Medicaid Services (CMS), which offers several of the most comprehensive public datasets on healthcare coverage and drug pricing. CMS maintains over 100 datasets spanning prescription costs, plan structures, and beneficiary demographics (Centers for Medicare & Medicaid Services 2025). One Medicare-based dataset, in particular, allowed for medication pricing to be tied to geographic location, which could then be paired with demographic data. While the dataset is limited to Medicare beneficiaries, it still serves as a strong proxy for broader drug pricing trends. Medicare Part D covers over 50 million people and negotiates prices through private insurers, making it a major market influencer. Its pricing often anchors costs across the healthcare system, suggesting that other buyers likely pay similar or higher prices.
For demographic data, the most obvious choice was to use the demographic data collected every 10 years by the US Census (U.S. Census Bureau 2025).
3.4 Medicare Overview
Medicare eligibility typically begins at age 65, though individuals with disabilities, end-stage renal disease (ESRD), or ALS may qualify earlier (U.S. Department of Health & Human Services 2022). Enrollment must occur within three months before or after one’s 65th birthday to avoid late penalties.
Medicare consists of four main parts (Medicare.gov 2025b):
- Part A – Hospital insurance (inpatient care, skilled nursing, hospice, home health)
- Part B – Medical insurance (doctor visits, outpatient care, equipment, preventive services)
- Part D – Prescription drug coverage
- Medigap – Supplemental insurance (available only with Original Medicare)
Coverage can be obtained through:
- Original Medicare – Includes Parts A and B; Part D and Medigap must be purchased separately.
- Medicare Advantage (Part C) – Bundles Parts A, B, and often D, offered by private insurers.
3.5 Medicare Part D (Drug Coverage) Costs
Medicare Part D costs vary by plan and may include a monthly premium, an annual deductible (up to $590 in 2025), copayments (fixed amounts), and coinsurance (percentage-based costs) (Medicare.gov 2025a). Prices for the same medication and dosage can differ across pharmacies and even within the same pharmacy depending on the selected plan. Not all plans cover all medications, adding another layer of complexity. Given that most 65-year-olds take multiple prescriptions, choosing the most cost-effective plan and pharmacy can be overwhelming, making Medicare enrollment confusing.
4 Data
4.1 Data Collection
4.1.1 Data Sources
We selected the CMS Quarterly Prescription Drug Plan Formulary as our primary data source, utilizing pricing, formulary, plan information, and geographic locator files to map drug prices and plan availability at the county level. The formulary offered a rare combination of granularity, geographic coverage, and accessibility, making it a practical foundation for our analysis.
Demographic data selection was straightforward. U.S. Census county-level data was collected using the Tidycensus (Walker and Herman 2025) package in R. Key variables included age, education, gender, race, total population, median income, and poverty rates.
To interpret drug references in the CMS datasets, which list medications by National Drug Code (NDC) only, we added the FDA’s NDC Directory (U.S. Food and Drug Administration 2025) to match codes with drug names.
In addition to the formulary, we integrated three national-level CMS datasets (Medicare Part B, Part D, and Medicaid, 2019–2023) to contextualize drug spending trends. Although lacking county-level detail, these datasets highlighted medications with the highest national financial impact.
4.1.2 Data Cleaning and Technical Challenges
Although the previous section may suggest that combining and storing the data was straightforward, the data cleaning and processing phase presented a significant challenge, consistent with the complexities typical of most data science projects.
The CMS Quarterly Prescription Drug Plan Formulary datasets were significantly larger than those our team had previously worked with. Each quarterly file from 2019 Q1 through 2025 Q1 contained approximately 10–12 GB of compressed data. To manage scope, we narrowed our focus to the 2023 Q4 dataset. Even within this single quarter, the formulary included pricing data for approximately 6,336 unique medications (identified by NDC codes), across multiple dosage strengths and 30-, 60-, and 90-day supply options. These were distributed across 5,644 unique contract/plan combinations. Some Medicare plan options were listed at the county level, while others were available only at the broader regional level, either as Medicare Advantage (MA, also known as Part C) or Prescription Drug Plans (PDP). To align these regional plans with our census-based demographic data, we disaggregated them by county. Medication prices varied not only by NDC code, but also by supply duration, dosage strength (e.g., 10 mg, 30 mg, 50 mg daily), delivery mechanism (e.g., pill, patch, injection, liquid), and Medicare plan.
A key limitation of the CMS formulary data was the absence of drug names, listing only 11-digit National Drug Code (NDC) identifiers. Meanwhile, the overall spending datasets included drug names but lacked NDCs, preventing direct linkage. Since NDCs are not easily interpretable, converting them to readable names was essential for analysis.
To bridge this gap, we used the FDA’s National Drug Code Directory (U.S. Food and Drug Administration 2025), which provides detailed drug identification. However, formatting inconsistencies, such as missing leading zeros and compressed segments, required standardizing both datasets to a 9-digit format by removing package-level detail and adjusting segment lengths.
After extensive cleaning, we successfully matched FDA drug names to CMS formulary entries. Fortunately, naming conventions in the spending data aligned closely with FDA standards (95% match), enabling us to relate national spending figures to county-level prices and plan availability. This integration created a robust, interconnected database across multiple large-scale datasets.
During the geographic analysis, we identified a unique complication related to Connecticut. In 2022, all Connecticut counties were renamed and had their boundaries redrawn (Federal Register: The Daily Journal of the United States Government 2022). While our primary focus was on Q4 2023 data, a discrepancy emerged: CMS data continued to use the pre-2022 county names and boundaries, whereas the U.S. Census Bureau adopted the updated designations. Because the county shapes also changed, we applied engineering judgment to map the old county definitions to the new ones. As a result, any county-level analysis for Connecticut should be interpreted as an approximation. Data for all other U.S. counties remains accurate once spellings were matched.
4.2 Data Engineering
The project faced substantial challenges in gathering and assembling the data, chief among them, the scale and accessibility of the files. The volume of data, which exceeded 100 GB compressed (nearly 300 GB uncompressed), far surpassed the file size limitations imposed by GitHub. While GitHub does offer a Large File Storage (LFS) extension, it was not sufficient for the scope of this project. Moreover, the team needed a collaborative environment where all members could access, modify, and validate shared datasets without working in silos, which could risk data inconsistency.
To address this, and upon the recommendation of Professor Jed Rembold, we explored several cloud storage options. After evaluating the alternatives, Amazon Web Services (AWS) S3 (Simple Storage Service) was selected for its flexibility, scalability, and broad support within the data science community. Using an S3 bucket, we created a centralized, cloud-based storage system that allowed team members to upload, retrieve, and organize large datasets seamlessly.
Implementing this solution was not trivial. Neither team member had prior experience with AWS, and the learning curve was significant. Support from large language models (LLMs) such as the AWS integrated LLM assistant, Microsoft Copilot, GitHub Copilot and ChatGPT proved instrumental in navigating technical setup, access permissions, and authentication. The result was a robust, scalable storage infrastructure that underpinned our data pipeline and enabled consistent access and collaboration throughout the project.
Setting up the AWS S3 bucket was only the first step in implementing a robust data infrastructure. Professor Rembold also recommended using DuckDB, an in-process SQL OLAP (Online Analytical Processing) database designed for high-performance analytical queries on large datasets. Unlike traditional database systems, DuckDB operates without a dedicated server, making it ideal for decentralized, collaborative data science workflows.
A third critical recommendation was to convert all raw data, originally provided in CSV and text formats by CMS, the FDA, and the U.S. Census Bureau, into Parquet. Parquet is a columnar storage format optimized for analytical workloads. Parquet offers efficient compression and schema evolution, and it integrates seamlessly with DuckDB. By adopting Parquet, we reduced storage costs and improved query speed, allowing us to manage and process the dataset more effectively within the AWS S3 environment.
Together, AWS S3, DuckDB, and Parquet formed a cohesive, scalable solution. The team was able to interface with this infrastructure using R, enabling the creation of reusable scripts that could be stored and version-controlled in GitHub. This structure supported asynchronous collaboration, ensured data consistency, and also allowed us to explore advanced features of RStudio, particularly its GitHub integration.
Over 100 Parquet files were ultimately stored in the AWS S3 bucket, many representing variations of plan files, geographic locator files, drug formulary files, and pricing data from the CMS Quarterly Prescription Drug Plan Formulary dataset. While not all files were used in the final analysis, early-stage exploration required access to every quarter from 2019 to 2023. In contrast, the U.S. Census, FDA NDC, and CMS spending datasets were far smaller and could have been managed locally. However, the CMS formulary data’s size and complexity made scalable storage essential. Though it introduced some engineering challenges, using S3 was ultimately the right choice. It enabled the creation of a structured, relational database across a large, disparate dataset.
We implemented an ELT (Extract, Load, Transform) pipeline, extracting data from source systems and loading it into an AWS S3 bucket as a centralized warehouse. Transformations and analysis were performed using R scripts and DuckDB’s SQL interface on Parquet files. Version control via GitHub enabled collaborative, reproducible workflows and scalable, iterative analysis.
Standardizing NDC formats across CMS and FDA datasets allowed consistent linkage of drug identifiers, enabling reliable mapping between county-level pricing, national spending data, and drug names. By the end of the data engineering phase, we had built a fully connected architecture that linked U.S. Census demographics to Medicare Part D pricing, mapped those prices to NDC codes, and connected them to national drug spending. This relational framework supported robust geographic and demographic analysis of medication costs across the U.S.
5 Analysis
5.1 Scope Reduction Leveraging National Spending Data
One of the first goals was to understand the connected data at a high level by condensing and aggregating details. Due to the granularity of the formulary data, it was necessary to narrow the focus rather than analyze all 6,000+ medications. The CMS spending datasets provided a clearer, structured view and helped identify high-impact drugs. Using this data, we generated top drug lists by total spend, claims, and doses (Figure 1), ultimately selecting a few key medications for deeper analysis. Initially, we reviewed the top 10 and top 25 by total spending.
We were not able to match all the top drugs to our dataset, so we used only the ones that had full matches, which ended up being the 14 of the top 25 medications. To help contextualize what conditions these medications treat, see Table 1 below.
Total Spend (USD) | Generic Name | Brand Name(s) | Common Treatment Conditions |
---|---|---|---|
$67.3B | Apixaban | Eliquis | AFib (Atrial Fibrillation), DVT/PE (Deep Vein Thrombosis / Pulmonary Embolism), Post-surgery Prophylaxis |
$43.1B | Adalimumab | Humira, Biosimilars | chronic inflammatory or autoimmune conditions such as RA (Rheumatoid Arthritis), PsA (Psoriatic Arthritis), AS (Ankylosing Spondylitis), CD (Crohn’s Disease), UC (Ulcerative Colitis), Psoriasis, JIA (Juvenile Idiopathic Arthritis), uveitis |
$30.9B | Dulaglutide | Trulicity | Type 2 Diabetes |
$29.9B | Lenalidomide | Revlimid | Multiple Myeloma, MDS (Myelodysplastic Syndromes) |
$28.4B | Rivaroxaban | Xarelto | AFib (Atrial Fibrillation), DVT/PE (Deep Vein Thrombosis / Pulmonary Embolism) |
$26.9B | Empagliflozin | Jardiance | Type 2 Diabetes, CV risk |
$26.7B | Semaglutide | Ozempic, Wegovy | Type 2 diabetes, Obesity |
$18.4B | Paliperidone Palmitate | Invega Sustenna/Trinza | Schizophrenia |
$16.7B | Etanercept | Enbrel | RA (Rheumatoid Arthritis), PsA (Psoriatic Arthritis), AS (Ankylosing Spondylitis), Psoriasis |
$15.7B | Insulin Aspart | Novolog, Fiasp | Type 1 & 2 Diabetes |
$15.2B | Ustekinumab | Stelara | Psoriasis, PsA (Psoriatic Arthritis), Crohns |
$14.5B | Ibrutinib | Imbruvica | blood cancers such as CLL (Chronic Lymphocytic Leukemia), MCL (Mantle Cell Lymphoma), and WM (Waldenström’s Macroglobulinemia) |
$12.1B | Tiotropium Bromide | Spiriva | COPD (Chronic Obstructive Pulmonary Disease), Asthma (adjunct) |
$11.6B | Liraglutide | Victoza, Saxenda | Type 2 Diabetes, Obesity |
5.2 Exploratory Data Analysis
5.2.1 Price Variance Across Counties
After mapping individual medication prices to U.S. counties by which Medicare plans were available, we found far more variability in pricing than expected. Some drugs had over 6,000 unique prices nationwide, with many others exceeding 4,000. A box plot of the top-spending medications is shown below, ordered from highest to lowest total national expenditure (Figure 2). This visualization highlights the wide variance in drug pricing. Notably, the absolute cost of a medication does not necessarily align with its total national spending. For example, Apixaban, the drug with the highest overall spend, costs only about $10 for a 30-day supply, while another top-spending drug is priced at approximately $26,000 for the same duration.
5.2.2 Quantity of Available Medicare Plans
One contributing factor to the wide range of drug prices is the number of Medicare plans available within a given county. Not all prices shown in the previous figure are accessible to every Medicare patient, as plan availability varies by location. To further illustrate this variability at the county level, the bar chart below displays the number of Medicare plans available in each county within a sample state (Figure 3).
5.2.3 Data Summary Simplifications
Given the number of plans within each county (and the variation in drug prices by plan, pharmacy, dosage, and delivery mechanism), we chose to simplify the data by calculating the median 30-day supply price for each drug (based on the 9-digit NDC code) within each county. While 60-day and 90-day supply prices were also available, they were excluded to streamline the analysis. For much of the subsequent work, this median county-level price per drug will serve as our primary target variable.
5.2.4 Geographic Medication Price Patterns
Our ultimate goal is to use demographic data (tied to geographic location) to predict drug prices in order to figure out which demographic factors have the largest impact on medication prices. Before jumping into the demographic data, we looked at the geographic data.
An early concern in the development of this project was whether there would even be any price variation based on location. A common misconception is that Medicare drug prices are the same across the US, given that Medicare is a public national medical insurance. The results were surprising. Drug prices do vary significantly regionally, sometimes at the state level and sometimes at the county level. A quick scan over many of the drugs shows the lack of geographic patterns for many of the drugs. (See Figure 4). The maps show the median price for a 30-day supply of each drug by county. The color scale is different for each drug, with the highest value for each drug shown in yellow and the lowest in purple.
Taking a closer look at two drugs, Semaglutide is one drug that stands out at the state level, as shown in Figure 5. New York state has the lowest median 30-day supply price compared to every other state in the country.
Liraglutide is another that stands out in that it has a clear state-wide price impact, like in New Mexico and Minnesota, but a county-level price delta in some counties scattered across other states such as Nebraska, Oregon, Washington, Tennessee, Missouri, and Virginia. (Figure 6)
Our team was surprised by what we observed once the median prices were applied to map plots. There are patterns here, but they vary geographically, and the pattern is not the same for different drugs. To see large maps for all of the top drugs, go to this shiny app.
5.2.5 Correlations
The age, education, racial, and gender data from the Census are provided as absolute values. We normalized those by dividing each metric by the total population to convert them to percentages. We then explored which demographic metrics most closely correlate to the prices for each medication in Figure 7. If all the medications are correlated with similar demographic metrics, all drugs can be modeled together. If not, the individual drugs will need to be modeled separately. In this case, we filtered out weak correlations, keeping only those with Pearson correlation coefficients less than -0.1 or greater than 0.1. There is no clear pattern among all the top drugs for correlations with the demographic metrics. Thus, we concluded that each medication needs to be analyzed independently.
We also introduced a feature called “Affordability” which divided price by income. This metric showed minimal correlation with other demographic variables also. Even the strongest positive and negative correlations had Pearson coefficients below 0.04, indicating negligible relationships. It failed to yield meaningful insights within our analysis and results.
6 Results
Our primary hypothesis was that certain demographic groups may experience greater hardship due to rising medication prices, particularly in counties where these demographics are concentrated. We asked whether it’s possible to use county-level demographic data, along with median medication prices, to build a predictive model for drug costs. If successful, such a model could offer short-term guidance, helping individuals with specific medical needs identify counties with more favorable pricing, and long-term potential to inform policy and industry efforts aimed at addressing pricing inequities.
6.1 Statistical Testing - Phase 1
To consider the impact to different demographics by varying prices for the selected drugs, our team constructed a series of hypotheses. The first consideration was towards race. The majority of U.S. counties contain percentages which heavily tilt towards individuals who identify as white. In fact, when evaluating the percentages across all counties throughout the US, counties are 95% white at the median and 93% at the mean. For our first test, we elected to evaluate counties that were at least 95% white and those below that threshold.
Before proceeding, we had to address assumptions which would be critical to determining which tests could actually apply. We did not initially feel confident applying the assumption of normality to our data. The spread of the median and mean price values across counties and within various medication groups did visually appear normal. However, after applying the Shapiro-Wilk test, Shapiro p-values were all significantly below 0.05 which indicated a lack of normalcy. Accordingly our team chose a non-parametric test to evaluate our hypotheses. The Anderson-Darling test was explored as an alternative to the Shapiro-Wilk test, but the results were similar, with high AD statistic values, suggesting a lack of normality.
The Wilcoxon rank-sum (Mann-Whitney U) test was selected to evaluate two independent groups, specified by the hypothesis threshold. We considered the observations to be independent of each other since the census is completed at the county level, with no counties overlapping.
Ultimately, the Mann-Whitney U Test is less powerful than parametric tests when assumptions are met. It also lacks confidence intervals and is only comparing the ranks and not the actual values. The benefit of taking this approach for our data is that fewer overall assumptions are necessary. Since our team was condensing a number of variables, this approach was preferred to starting with an excessive number of assumptions.
6.1.1 Hypothesis 1
Null Hypothesis: The distribution of medication costs in counties with >=95% white population is not statistically different from the distribution in counties with <95% white population.
Alternative Hypothesis: The distribution of medication costs in counties with >= 95% white population is statistically different from that in counties with <95% white population.
Assumptions:
- – Independent groups
- – Ordinal or continuous data
- – Same Shape of Distribution (when comparing medians)
- – Random Sampling
- – Independence within each group
Accordingly, our team elected to compare mean prices, which notably exposes our analysis to outlier pricing. The assumption independence within each group is stretched, considering the manner in which the data was procured, which we have discussed in great detail.
Drug Name | p-value |
---|---|
Adalimumab | 3.37e-21 |
Apixaban | 5.46e-21 |
Insulin Aspart | 6.31e-18 |
Semaglutide | 2.53e-13 |
Paliperidone Palmitate | 6.19e-13 |
Rivaroxaban | 4.67e-12 |
Liraglutide | 1.44e-10 |
Empagliflozin | 4.23e-04 |
Tiotropium Bromide | 1.16e-02 |
Dulaglutide | 1.30e-02 |
Lenalidomide | 6.55e-02 |
Etanercept | 7.12e-02 |
Ustekinumab | 7.29e-02 |
Ibrutinib | 4.93e-01 |
We can reasonably conclude from this that in 10 of our 14 focus medications, there is a statistically significant difference in mean prices and reject the null hypothesis for these medications. (See Table 2 above.) To dive deeper, our team constructed density plots to compare the direction of this difference. As observed in Figure 8, where we evaluate two of these medications, Semaglutide and Dulaglutide. We can see that the direction shifts at different price ranks and the shift depends on the medication.
Drilling down again, with these two medications in focus, actual values are compared below in Table 3.
Drug Name | >=95% White | Mean of Mean Price | SD Mean Price |
---|---|---|---|
Dulaglutide | FALSE | 444.00 | 3.74 |
Dulaglutide | TRUE | 444.00 | 3.71 |
Semaglutide | FALSE | 168.00 | 2.58 |
Semaglutide | TRUE | 167.00 | 1.84 |
The findings are interesting but have issues. In the plot, we can clearly see that slight advantages are given to certain groups at certain prices. The mean of the mean prices, as well as the standard deviations show practically no difference for Dulaglutide and only some difference for Semaglutide. Unfortunately, this does not provide much clarity regarding any particular advantage afforded to counties where more than 95% of the population identifies as white. The findings are hardly definitive and lack confidence intervals. This does demonstrate the limitations of using the Mann-Whitney U Test.
6.1.2 Hypothesis 2
Using the same assumptions, another consideration involves fully removing the racial aspect and instead considering the percentage of the population that is below the poverty level. Within the census data, the median is 13% below the poverty level, across all counties and 14% at the mean. We establish a new hypothesis.
Null Hypothesis: The distribution of medication costs in counties with >=13% below poverty population is not statistically different from the distribution in counties with <13% below poverty population.
Alternative Hypothesis: The distribution of medication costs in counties with >= 13% below poverty population is statistically different from that in counties with <13% below poverty population.
Drug Name | p-value |
---|---|
Semaglutide | 1.43e-20 |
Dulaglutide | 2.49e-16 |
Ibrutinib | 4.41e-13 |
Insulin Aspart | 2.34e-10 |
Adalimumab | 3.46e-10 |
Etanercept | 2.40e-08 |
Ustekinumab | 8.88e-08 |
Apixaban | 1.90e-06 |
Empagliflozin | 6.70e-06 |
Liraglutide | 1.12e-03 |
Lenalidomide | 1.43e-03 |
Paliperidone Palmitate | 2.70e-02 |
Tiotropium Bromide | 8.55e-02 |
Rivaroxaban | 1.54e-01 |
Again, we find that there are statistically significant outcomes for many of the selected medications. Twelve of the fourteen medications yield p-values below 0.05 suggesting that the null hypothesis be rejected. (See Table 4 above.)
In Figure 9 we look closer at six medications, to expand the investigation into these differences and the direction of any potential advantage.
Similar to hypothesis 1, we see that the price differences are contextual to the medication and price rank position. While the null hypothesis is rejected, the challenge of approaching multiple medications is clearly demonstrated. For both hypotheses, by rejecting the null, this may result in a type 1 error, or false positive. There are clearly price differences relative to certain demographic groups. Whether those differences are being driven by the demographics remained unclear.
6.2 Statistical Testing - Phase 2
In order to further explore statistical significance, we assumed a normal distribution of the data. This assumption enabled the use of traditional parametric testing methods. Specifically, we conducted two-sample t-tests, mirroring the structure used in the first phase of statistical testing, where the dataset was divided based on demographic medians. Additionally, we grouped counties into multiple subsets using demographic quartiles rather than medians, allowing for more granular comparisons. This stratification enabled the application of the Tukey HSD test to assess pairwise differences across these groups.
Using a two-sample t-test allowed reconsideration of our second hypothesis. We again encountered low p-values. Selecting for Semaglutide as one example and using the same data split for poverty at the median, the resulting p-value was 2.2e-16 which was a strong indicator for statistical significance, suggesting we reject the null.
Building on this, we applied the Tukey HSD test to the quartile-based groups at a 95% confidence level. Again, we found p-values that supported rejecting the null hypothesis, indicating that the average mean prices across groups were not equal. Interestingly, the only comparisons with p-values at or above 5% occurred between adjacent quartile groups, while all other comparisons yielded p-values well below this threshold. (See Table 5 below.)
Comparison | Difference | Lower CI | Upper CI | Adjusted p-value |
---|---|---|---|---|
>17% Poverty - 0-10% Poverty | -1.1555147 | -1.4430153 | -0.8680141 | 0.0000000 |
>17% Poverty - 10-13% Poverty | -0.8564752 | -1.1503721 | -0.5625784 | 0.0000000 |
>17% Poverty - 13-17% Poverty | -0.6371890 | -0.9262280 | -0.3481500 | 0.0000001 |
13-17% Poverty - 0-10% Poverty | -0.5183257 | -0.8123963 | -0.2242551 | 0.0000361 |
10-13% Poverty - 0-10% Poverty | -0.2990395 | -0.5978861 | -0.0001928 | 0.0497806 |
13-17% Poverty - 10-13% Poverty | -0.2192862 | -0.5196131 | 0.0810407 | 0.2382320 |
We repeated this process across several medications and demographic variables, consistently observing statistical significance. However, when examining the actual differences in average prices between groups, the magnitude of these differences was often minimal. Continuing with Semaglutide as an example, counties in the lowest poverty quartile (0–10%) had an average mean price of $168, while those in the highest poverty quartile (>17%) had an average of $167, a mere $1 difference. Notably, counties with higher poverty levels were paying slightly less on average.
To better understand the implications of this finding, we considered affordability by dividing the mean price by the median income of each county. This revealed that although prices were lower in higher-poverty areas, the cost represented a larger proportion of income, highlighting a disparity in affordability. This relationship is clearly illustrated in the following Table 6 and Figure 10.
Poverty Group | Mean Price | SD Price | Mean Affordability | SD Affordability | Sample Size (n) |
---|---|---|---|---|---|
0–10% Poverty | 168.00 | 2.35 | 0.00209 | 0.000409 | 788 |
10–13% Poverty | 168.00 | 2.60 | 0.00246 | 0.000332 | 725 |
13–17% Poverty | 167.00 | 2.36 | 0.00275 | 0.000351 | 772 |
>17% Poverty | 167.00 | 1.69 | 0.00340 | 0.000613 | 846 |
We employed both t-tests and Tukey HSD tests to evaluate a range of medications across various demographic subsets, tailoring each analysis to specific hypotheses about potential demographic impacts on drug pricing. While many results showed statistical significance, further inspection revealed that the actual differences in mean prices, affordability metrics, and other summary statistics were often marginal.
This distinction between statistical and practical significance became a turning point. Statistical significance simply indicates that a result is unlikely due to random chance. It does not guarantee that the difference is meaningful or actionable. In our case, the detected differences, though statistically valid, lacked the magnitude necessary to inform policy or guide consumer decisions.
Recognizing this limitation, we pivoted our focus toward structural market factors, particularly the relationship between population size and the number of available Medicare Part D plans in each county. This shift allowed us to explore whether plan availability aligns with population needs, a perspective that offers more tangible insights into affordability and access. By emphasizing market dynamics over demographic predictors, we uncovered patterns that more directly influence pricing and equity across regions.
6.3 Population vs Number of Available Medicare Plans
Given the lack of clear patterns for medication prices across the maps and within isolated demographic groups, we turned to a deep dive into the demographics. Our first dive was into the population. Do counties with a larger population have more Medicare plan options, and does this competition lead to lower drug prices? A scatter plot of the number of Medicare Plans available vs the population per county is shown below with a linear fit to show a trend line in Figure 11.
While there is a slight trend, it is not a great fit overall. The model is statistically significant (p-value < 0.05) and shows a moderate positive relationship (slope = 0.0195, or 1 more medicare plan with a 1.95% increase in the population) between the number of Medicare plans and population. But it’s not a strong fit. Two-thirds of the variation in population is left unexplained (adjusted R-squared = 0.3444), so we don’t want to over-interpret this model as being predictive. Given the limited correlation between the number of available plans and population, we don’t think this demographic measure is having much of an impact on medication prices.
From here, we jumped into creating a model across all our demographics (age, education, gender, and racial breakdowns, as well as median income, and percent below the poverty line).
6.4 Machine Learning
6.4.1 Model Selection
In order to model the relationship between medication prices and demographic features, we employed a variety of machine learning algorithms, each with unique strengths and assumptions. Below is a brief overview of the algorithms explored:
Linear Model (LM): Linear regression is a fundamental technique used to model the relationship between a continuous outcome variable and one or more predictors, assuming a linear relationship. It is interpretable and computationally efficient but may underperform when the underlying patterns are nonlinear or complex.
Elastic Net (ENET): Elastic Net is a regularized regression method that combines both L1 (Lasso) and L2 (Ridge) penalties. It is particularly effective when dealing with multicollinearity or when performing variable selection in high-dimensional datasets. Elastic Net encourages sparsity while maintaining group stability among correlated features.
Gradient Boosting Machine (GBM): GBM is an ensemble technique that builds models sequentially, with each new model correcting the errors of the previous one. It is capable of capturing complex nonlinear relationships and is known for its high predictive performance, although it can be prone to overfitting if not properly tuned.
K-Nearest Neighbors (KNN): KNN is a non-parametric algorithm that classifies or predicts outcomes based on the average value (or majority class) of the k closest observations in the feature space. It is simple and intuitive but can struggle with high-dimensional data and is sensitive to the choice of distance metric and value of k.
Random Forest (RF): Random Forest is an ensemble method that constructs a large number of decision trees during training and outputs the average prediction (for regression) or majority vote (for classification). It offers good performance with low risk of overfitting, handles missing data well, and provides insights into feature importance.
Support Vector Machine (SVM): SVM is a powerful classifier and regressor that aims to find the optimal hyperplane that maximally separates classes or fits data with minimal error (via support vectors). It works well in high-dimensional spaces and with complex boundaries but can be computationally intensive and sensitive to kernel and parameter choices.
Extreme Gradient Boosting (XGBoost): XGBoost is an optimized and scalable implementation of gradient boosting. It includes advanced features such as regularization, parallel processing, and handling of missing values. XGBoost typically outperforms many other algorithms in structured data problems and is widely used in machine learning competitions and real-world applications.
6.4.2 Model Performance
Looping over every drug, we trained all the previously described models. A heatmap of the resulting model performance is provided below in Figure 12. As expected, the gradient boosting machine, the random forest, and the extreme gradient boosting models performed the best across all drugs. Unfortunately, while the models have ok R2 values when fitting to the training data, the R2 values for the test data is significantly lower. This suggests that the models are overfitting to the training data, and we need to tune the hyperparameters of the models to improve performance.
6.4.3 Tuning to Prevent Overfitting
To improve the model performance and prevent overfitting, we tuned the hyperparameters of the extreme gradient boosting models. The tuning process involved adjusting parameters such as the number of rounds, maximum depth, learning rate (aka eta) and more to optimize model performance on the validation set. Our most overfit model, XGB6, has a nearly perfect fit to the training data for all drugs, and yet it also has the best fit to the test data. If the model had the appropriate key features that drive median price, we would’ve expected to see the test R2 values much closer to the training R2 values. There would be a optimum point where the training data might have a worse fit, but the test data would have a better fit in comparison. Tuning the parameters to prevent overfitting, you can see the XGB7, XGB8, and XGB9 get progressively worse fits to the training data, but also those models get worse fits to the test data. Thus, we think that the features that we have are not the best predictors of median drug price. The performance of the tuned XGB models are shown in Figure 13.
6.4.4 Feature Importance
The most important features for the initial XGB model across all drugs are shown in the heatmap below (Figure 14). The most obvious pattern here is that the num_contract_plans (aka the quantity of available Medicare plans) is the most important feature for 12 out of the 14 top drugs. We already looked at the connection between total_] population and the quantity of available plans in a county previously, and only saw a mild correlation. For the other two drugs, the top features were income per capita and percent of the population with master’s degree.
6.4.5 Feature Direction
To determine if more Medicare plan options in a county increase or decrease the drug price, we did a quick linear regression fit between the median price of all drugs and the number of available Medicare plans in a county. The results are shown in Figure 15 below. The percent of the price variance that is due to the number of available Medicare plans was pulled from the R2 value and is shown in the size of the points. The direction of the relationship is shown as a percent of the median drug price on the X-axis. For every 10 medicare plans added to a county, the drug price either increases or decreases by the percent in the plot. The significance of the relationship was determined by a p-value < 0.05 and is color coded in the plot. For the most affected drug price, the number of contracts accounts for up to ~5% of the variance in drug price. 5% is not a very large imcpart. Also, some drugs are increasing price and some decreasing with the addition of 10 more Medicare plan options in a county. In general, the number of Medicare plans, along with demographics, aren’t the only factor that drives drug price, but the number of plans is the most important factor we found in our feature set.
6.4.6 Machine Learning Conclusions
The application of diverse machine learning algorithms provided valuable insights into the complex relationships between demographic factors and medication prices across U.S. counties. Despite the initial hypothesis that demographic variables might strongly predict drug pricing disparities, our results indicate that the quantity of Medicare plans available within a county is the most consistent and influential predictor across the majority of the top-spending medications. Other demographic features, such as education level, income, and racial composition, showed variable importance depending on the specific drug but did not emerge as universally strong predictors.
The superior performance of GBM, RF, and XGB models highlights the nonlinear and multifaceted nature of drug pricing patterns, which simpler linear models struggled to capture effectively. This complexity underscores the challenge of modeling medication prices purely based on demographics and points to the potential influence of additional factors not captured in our current dataset.
Overall, these findings suggest that while demographic characteristics contribute to medication price variation, market dynamics (reflected by Medicare plan availability) play a pivotal role. Future work may benefit from incorporating additional economic, regulatory, and supply-chain data to better explain and predict geographic disparities in medication costs. This knowledge could ultimately inform policies aimed at improving affordability and equity in drug pricing.
7 Conclusions
7.1 Insights and Recommendations
Our analysis reveals significant geographic disparities in Medicare Part D drug pricing across the United States. While our data does not conclusively demonstrate that these pricing variations disproportionately affect any single demographic group, the inconsistencies raise serious concerns about the effectiveness and equity of Medicare’s price negotiation mechanisms. The observed variation suggests that drug costs could be more uniformly regulated, pointing to inefficiencies in the current system.
For many Medicare beneficiaries, particularly older adults and individuals with limited financial resources, the challenge of navigating complex plan options to secure affordable coverage remains significant. Despite the availability of plan finder tools and licensed advisors, the process is often overwhelming for the very population the program is designed to support. This complexity undermines both the accessibility and fairness of Medicare Part D, highlighting a need for more user-friendly systems and clearer guidance to ensure equitable and affordable access to prescription drug coverage.
The Inflation Reduction Act of 2022 represents a pivotal step toward addressing the long-standing issue of unsustainable drug pricing in the United States. Regulatory actions by the U.S. Department of Health and Human Services (HHS) and the Centers for Medicare & Medicaid Services (CMS) (Centers for Medicare & Medicaid Services 2024b) have demonstrated promising potential to curb excessive price increases. Notably, in December 2024, CMS (Centers for Medicare & Medicaid Services 2024a) announced cost savings for 64 drugs whose prices had risen faster than inflation, an encouraging indicator of progress.
However, the full impact of these measures remains to be seen. Much of the projected savings will not be realized until 2026, and their effectiveness will depend on several factors, including pharmaceutical industry responses, future pricing trends, and the extent to which the affected medications are utilized by beneficiaries. Continued monitoring and evaluation will be essential to ensure these reforms translate into meaningful improvements in affordability and access.
To promote greater equity and transparency in Medicare Part D, federal and state policymakers could collaborate to ensure that the lowest negotiated drug prices are made broadly accessible, rather than restricted to specific plans or contracts. This approach presents a clear opportunity to advance price normalization and improve the visibility of treatment costs. By making pricing more consistent and transparent, policymakers can better serve Medicare beneficiaries and reinforce the integrity and fairness of the program.
Price normalization across Medicare Part D plans could yield far-reaching benefits that extend beyond the immediate Medicare population. By reducing regional and plan-based disparities, normalization would foster a more predictable and equitable pricing environment for all stakeholders, including healthcare providers, pharmacies, and insurers. It could also streamline administrative processes by minimizing the complexity of managing varied pricing structures, thereby enhancing the efficiency of healthcare delivery and reducing operational costs.
Fortunately, many private companies (like GoodRx and CostPlusDrugs) are actively working to promote price transparency and secure the best possible prices for commonly prescribed medications. Their efforts demonstrate that innovation in medication pricing structures is feasible. Policymakers must embrace these emerging modern approaches and support initiatives that promote consistent and transparent pricing across the healthcare system. This is not about enforcing rigid price controls, but rather about ensuring fairness, clarity, and accessibility in how drug prices are determined and communicated.
Normalized drug pricing across Medicare Part D plans could play a critical role in curbing overall healthcare spending by discouraging price inflation and fostering competitive pricing among pharmaceutical manufacturers. When prices are consistent and transparent, it becomes significantly easier for policymakers, researchers, and healthcare administrators to evaluate cost-effectiveness, identify inefficiencies, and implement targeted reforms. This clarity can lead to more strategic resource allocation, improved public health outcomes, and a more sustainable healthcare system.
The movement toward transparent and normalized drug pricing offers a promising path forward for consumers and the healthcare system alike. With private companies paving the way, policymakers have a unique chance to foster a more equitable pricing landscape, one built on clarity, consistency, and shared benefit.
7.2 Ethical Considerations
7.2.1 Equity and Fairness
Leveraging census data introduces inherent biases, as certain populations, such as undocumented individuals, transient communities, or those with limited access to digital infrastructure, may be underrepresented. Throughout our data collection, cleaning, and analysis processes, we have taken deliberate steps to avoid stigmatizing or overgeneralizing any group.
We recognize that systemic disparities in healthcare access, driven by socioeconomic status, geography, race, and other factors, can influence both the availability and affordability of medications. Additionally, using Medicare pricing as a proxy for broader drug costs may obscure trends that would be more visible with commercial or retail pricing data. This methodological choice, while practical, may limit the general applicability of our findings and should be interpreted with caution.
7.2.2 Purpose and Impact
The primary goal of this project is to contribute to a more equitable and transparent healthcare system in the United States. By linking CMS formulary data with demographic insights from the census, we aim to uncover patterns that can inform future research, policy development, and targeted interventions. We hope that this work serves as a foundation for ongoing efforts to address longstanding inequities in healthcare accessibility and affordability.
7.2.3 Privacy
All data used in this project was sourced from publicly available datasets that do not contain personally identifiable information. No attempt was made to re-identify individuals through data linkage or analysis. Our approach prioritizes responsible data handling and respects the privacy of all individuals that might be represented.
7.3 Future Research
One of the most significant challenges encountered in this project was the sheer size and complexity of the datasets. To make the analysis feasible, data had to be condensed and aggregated, which limited the granularity of insights. However, this also highlights the substantial potential for deeper and more nuanced analysis using these connected datasets in future work.
7.3.1 Longitudinal Analysis
The CMS formulary data that was collected is published quarterly, offering a valuable opportunity to evaluate trends over time. Future research could explore:
- Price fluctuations within specific contracts and plans.
- Temporal shifts in drug availability or coverage.
- Policy impacts on pricing and access over time. This longitudinal approach could reveal how changes in healthcare policy, market dynamics, or demographic shifts influence drug pricing and accessibility.
7.3.2 Expanded Demographic Analysis
While this project incorporated basic demographic overlays, future work could benefit from:
- More granular demographic segmentation.
- Intersectional analysis to understand how overlapping identities affect access to medications and plan availability.
7.3.3 Geographic and Environmental Context
Incorporating spatial and environmental factors could enhance understanding of healthcare access disparities:
- Rural vs. urban classifications to assess geographic inequities in drug coverage and pricing.
- Healthcare infrastructure mapping, such as proximity to pharmacies or providers, to contextualize formulary access.
7.3.4 Sociopolitical Context
Exploring the influence of sociopolitical factors could add depth to the analysis:
- Political leanings of regions (e.g., voting patterns) to examine correlations with healthcare policy adoption or plan availability.
- State-level policy differences, such as Medicaid expansion or drug pricing regulations, to assess their impact on formulary design.