I was looking to do a project that will challenge my newly learned skills in R after taking the courses Building Data Visualization Tools and Reproducible Templates for Analysis and Dissemination. I wanted a topic that is interesting, relevant, timely and with some measure of importance. to create beautiful and interesting plots and presentation.
While browsing the internet, I came upon the Global Tuberculosis Report 2017 by the World Health Organization or WHO. The report impresssed upon me the need for sustained and concerted efforts to meet the goal of the WHO to end the TB epidemic by 2030. TB has been with us for a long time and the WHO is making an enormous effort to finally rid the world of this disease. However, there are many obstacles along the way.
The social and economic impacts are devastating, including poverty, stigma and discrimination. While the world has committed to ending the TB epidemic by 2030, actions and investments don’t match the political rhetoric. - Dr Tedros Adhanom Ghebreyesus, WHO Director-General
The measure for success for this goal is a 90% reduction in TB deaths and 80% reduction in TB incidence rate by 2030 compared to 2015. The report was outstandingly well written in a language that is easily understandable and organized in a logical and meaningful way. Plus it is accompanied by a website/repository where data in the report can easily be navigated and downloaded. Equally important, the subject of the data was something that I can relate to as a former healthcare provider and as a citizen of a developing country like the Philippines.
Although 2030 seems conveniently far in to the future, the WHO has set goals for the year 2020 and 2025 to ensure that countries are on trak to meet the 2030 end goal. The first milestones of the End TB Strategy set for 2020 are a 35% reduction in TB deaths and a 20% reduction in TB incidence, compared with levels in 2015; and that no TB patients and their households should face catastrophic costs as a result of TB disease.
For this exercise, I will be exploring the global TB data for the year 2016 and examine the progress that has been made since the year 2000. My goal is to bring attention to the problem of TB and the need for immediate, sustained, and timely response to meet the target Goals of the WHO and to create beautiful and interesting plots using R.
The Global Tuberculosis Report 2017 was concisely written and compels me to lift lines of text directly rather than paraphrase what was written. Text copied directly from the report will be set in block lines of text like the one below.
In 2016, there were an estimated 1.3 million TB deaths among* HIV-negative people and an additional 374 000 deaths among HIV-positive people. TB is the ninth leading cause of death worldwide and the leading cause from a single infectious agent, ranking above HIV/AIDS.
library(dplyr)
library(tidyr)
library(ggplot2)
library(ggthemes)
library(gridExtra)
TB.burden <- read.csv("TB_burden_countries_2018-05-20.csv", ### read in file and save
stringsAsFactors = FALSE) ### as an R object TB.burden
stat_2016 <- TB.burden %>% ### take The R object TB.burden
select(country, year, e_inc_num, e_mort_num) %>% ### select the variables we need
filter(year == 2016) %>% ### filter data year 2016 only
summarize(incidence = sum(e_inc_num)/365, ### summarize incidence and mortality into
deaths = sum(e_mort_num)/365) ### number of cases per day
### save as R object mean_inc_death
g1 <- stat_2016 %>% ### Take R object mean_inc_death
gather(key = "key", ### prepare data
value = "value") %>% ### for plotting with ggplot
ggplot(aes(x = key, ### Plot var key on the x axis
y = value, ### plot var value on the y axis
fill = key)) + ### plot color as var key
geom_bar(stat = "identity", ### plot data as a barplot
position = "dodge") +
labs(title = "Number of Deaths/New Cases per Day", ### add title, subtitle, caption
subtitle = "Year 2016",
x = "",
y = "",
caption = "WHO | Tuberculosis Data") +
theme_hc(bgcolor = "darkunica") + ### use preset theme
theme(axis.text = element_text(colour = "white"), ### set axis labels as white
axis.title = element_text(colour = "white"), ### set title color as white
panel.grid.major.x = element_blank(), ### remove grid elements
title = element_text(size = 8)) + ### set font size to 8
guides(fill = FALSE) + ### remove legend
scale_fill_manual(values = c("coral2", "darkorchid")) ### set color manually
###TB.burden %>% select(country, year, e_inc_num, e_mort_num) %>% group_by(year) %>% summarize(incidence = sum(e_inc_num)/365, deaths = sum(e_mort_num)/365) %>%gather(key = "key", value = "value", -year) %>% ggplot(aes(x = year, y = value, fill = key)) + geom_bar(stat = "identity", position = "dodge") + labs(title = "Mean Number of Deaths and New Cases", subtitle = "per day", x = "", y = "", caption = "WHO | Tuberculosis Data") + theme_hc(bgcolor = "darkunica") + theme(axis.text = element_text(colour = "white"), axis.title = element_text(colour = "white"), legend.position = "right", panel.grid.major.x = element_line(color = "gray50")) + scale_fill_manual(values = c("dodgerblue", "tomato"))
eincnum_m <- names(TB.burden)[grep("^e_inc_num_m", ### find variables that begin
names(TB.burden))][c(1, 4, 7)] ### with e_inc_num_m
allcasesM <- TB.burden %>% ### take R object TB.burden
filter(year == "2016") %>% ### filter year 2016 only
select(year, eincnum_m) ### select var year and those
### beginning with e_inc_num_m
names(allcasesM) <- gsub("^e_inc_num_m", "age", ### change variable names
names(allcasesM))
sum_allcasesM <- allcasesM %>% ### summarize data according
group_by(year) %>% ### to age groups
summarize(age014 = sum(age014),
age15plus = sum(age15plus),
all = sum(age)) %>%
mutate(gender = "male") ### add variable gender
eincnum_f <- names(TB.burden)[grep("^e_inc_num_f",
names(TB.burden))][c(1, 4, 7)]
allcasesF <- TB.burden %>%
filter(year == "2016") %>%
select(year, eincnum_f)
names(allcasesF) <- gsub("^e_inc_num_f", "age",
names(allcasesF))
sum_allcasesF <- allcasesF %>%
group_by(year) %>%
summarize(age014 = sum(age014),
age15plus = sum(age15plus),
all = sum(age))
sum_allcasesF$gender <- "female"
sum_allcasesF$age014 <- sum_allcasesF$age014 * -1 ### make variables negative
sum_allcasesF$age15plus <- sum_allcasesF$age15plus * -1
sum_allcasesF$all <- sum_allcasesF$all * -1
allcase_df <- rbind(sum_allcasesM, sum_allcasesF) ### rowbind maledatframes
allcase_plotdf <- gather(allcase_df, ### prepare data for plotting
key = "key",
value = "value",
-year, -gender)
g2 <- ggplot(allcase_plotdf, ### take allcase_plotdf
aes(x = key, ### plot key on x axis
y = value, ### plot value on y axis
fill = gender)) + ### plot gende as color
geom_bar(stat = "identity") + ### create a barplot
coord_flip() + ### flip x and y axis
labs(title = "Estimated incidence of TB cases 2016", ### add title, subtitle
subtitle = "by age and sex", ### caption and y axis title
x = "",
y = "Absolute number of cases (Millions)",
caption = "WHO | Tuberculosis Data") +
theme_hc(bgcolor = "darkunica") + ### use preset theme
theme(axis.text = element_text(colour = "white"), ### color axis title as white
axis.title = element_text(colour = "white"), ### COLOR title as white
legend.position = "right", ### put legend on the right
panel.grid.major.x = element_line(color = "gray50"), ### set color for grid lines
title = element_text(size = 8), ### set font size to 8
legend.key.size = unit(0.25, "cm"),
legend.title = element_blank()) +
scale_y_continuous(breaks = c(-2500000, ### set y axis tick labels
0,
2500000,
5000000),
labels = c(2.5,
0,
2.5,
5)) +
scale_fill_manual(values = c("maroon2", "mediumblue")) ### set color manually
grid.arrange(g1, g2, nrow = 1) ### arrange position of plot
In 2016, 4548 people died from TB each day. That’s about 3 persons per minute. Approximatley 28498 new cases of TB were recorded daily worldwide or 20 persons per minute.
An estimated 10.4 million people fell ill with TB in 2016: 90% were adults, 65% were male, 10% were people living with HIV (74% in Africa)
I downloaded the data dictionary which contains a definition of the different variables in the diffrent datasets available for download from the WHO repository.
Next I downloaded the WHO TB burden estimates. The data includes WHO-generated estimates of TB mortality, incidence (including disaggregation by age and sex and incidence of TB/HIV), case fatality ratio, treatment coverage (previously called case detection rate), proportion of TB cases that have rifampicin-resistant TB (RR-TB, which includes cases with multidrug-resistant TB, MDR-TB), RR/MDR-TB among notified pulmonary TB cases and latent TB infection among children aged under 5.
*WHO measures the burden of tuberculosis (TB) disease in terms of:
- incidence - the number of new and relapse cases of TB arising in a given time period, usually 1 year;
- prevalence - the number of cases of TB at a given point in time; and
- mortality - the number of deaths caused by TB in a give time period, usually 1 year.
Incidence is reported either as absolute number of new cases or as a proportion of the population per 100,000
Below are separate tabs to view the different steps taken to read the data in R. Click on them from left to right to follow the steps chronologically. We already read in the TB burden estimates data earlier to produce the plot of the incidence and number of deaths per day earlier. The steps are shown here for instructional puposes.
In case you encounter a variable name in the codes that you don’t understand, you can search for its meaning by clicking on the rightmost tab marked Reading the TB burden estimates data
and entering the variable name in the search tab. The search engine is case-sensitive so make sure spell the name accurately. You can also search using keywords in case you don’t know the variable name you are interested in.
In the interest of reproducible research, the codes for this article can be seen by clicking on the rectangular tabs marked “code” on the right saide margin in each section. For students of R programming feel free to copy the codes. Freely have I received and freely i give.
We loaded the packages: dplyr
, tidyr
, and ggplot2
earlier to create the plot. We will load other packages to further extend the basic functionality of R to help us manipulate and plot the data.
library(tibble)
library(ggmap)
library(ggrepel)
library(ggalt)
library(DT)
TB.dic <- read.csv("TB_data_dictionary_2018-05-20.csv", ### read file in R
stringsAsFactors = FALSE) ### read strings as strings
var_dic <- TB.dic[,1] ### store variable names in a vector
defofterms <- TB.dic[, c(1,4)] ### subset 1st and last column of the dictionary
head(TB.dic)
## variable_name dataset code_list
## 1 bmu Community engagement
## 2 bmu_community_impl Community engagement
## 3 bmu_ref_data Community engagement
## 4 bmu_rxsupport_data Community engagement
## 5 bmu_rxsupport_data_coh Community engagement
## 6 community_data_available Community engagement 0=No; 1=Yes
## definition
## 1 Number of TB Basic Management Units in the country
## 2 Number of TB Basic Management Units which implemented community-based referrals or any form of community treatment adherence support
## 3 Number of Basic Management Units with data on referrals by community health workers / community volunteers
## 4 Number of Basic Management Units with data on community treatment adherence support
## 5 Total number of patients who started TB treatment in the Basic Management Units with data on community treatment adherence support
## 6 Are data available on community-based referrals or any form of community treatment adherence support?
TB.burden <- read.csv("TB_burden_countries_2018-05-20.csv", ### read in file
stringsAsFactors = FALSE)
var_burden <- names(TB.burden) ### store variable names in a vector
var_burden_in_dic <-var_dic[var_dic %in% var_burden] ### find which variable names in the in the TB.burden dataframe in the dictionary
var_burden_def <- subset(defofterms, ### create a dataframe of the variable
defofterms$variable_name %in% var_burden_in_dic) ### names and their definition
datatable(var_burden_def) ### show variable and definition
The map below shows the countries with higher incidence of TB (per 100,000 population) in lighter shades of blue. Incidence varied widely among countries, from under 10 per 100 000 population in most high-income countries to 150-300 in most of the 30 high TB burden countries, and above 500 in a few countries including the Democratic People’s Republic of Korea, Lesotho, Mozambique, the Philippines and South Africa.
world_map <- map_data("world") ### Create data for world map
world_map <- world_map[world_map$region != "Antarctica",] ### remove Antartica
map_region <- unique(world_map$region) ### get names of countries
TB_country <- unique(TB.burden$country) ### get names of countries
TB.burden_recoded <- TB.burden %>% ### create subset of the
select(country, ### TB.burden data and
g_whoregion, ### select variables that
year, ### we'll need to plot
e_pop_num, ### the data
e_inc_100k,
e_inc_num,
e_inc_num_014,
e_inc_num_15plus,
e_inc_num_f,
e_inc_num_f014,
e_inc_num_f15plus,
e_inc_num_m,
e_inc_num_m014,
e_inc_num_m15plus,
e_mort_num,
e_mort_100k,
e_mort_exc_tbhiv_100k)
### reconcile spelling of the names of the countries in the two data frames before merging
TB.burden_recoded$country <- recode(TB.burden_recoded$country,
"Antigua and Barbuda" = "Antigua",
"Bolivia (Plurinational State of)" = "Bolivia",
"Bonaire, Saint Eustatius and Saba" = "Bonaire",
"British Virgin Islands" = "VirginIslands",
"Brunei Darussalam" = "Brunei",
"Cabo Verde" = "Cape Verde",
"Côte d'Ivoire" = "Ivory Coast",
"Congo" = "Republic of Congo",
"Czechia" = "Czech Republic",
"Democratic People's Republic of Korea" = "South Korea",
"Iran (Islamic Republic of)" = "Iran",
"Lao People's Democratic Republic" = "Laos",
"Micronesia (Federated States of)" = "Micronesia",
"Republic of Korea" = "North Korea",
"Republic of Moldova" = "Moldova",
"Russian Federation" = "Russia",
"Saint Kitts and Nevis" = "Nevis",
"Saint Vincent and the Grenadines" = "Saint Vincent",
"Sint Maarten (Dutch part)" = "Sint Maarten",
"Syrian Arab Republic" = "Syria",
"The Former Yugoslav Republic of Macedonia" = "Macedonia",
"Trinidad and Tobago" = "Trinidad",
"United Kingdom of Great Britain and Northern Ireland" = "UK",
"United States of America" = "USA",
"United Republic of Tanzania" = "Tanzania",
"Venezuela (Bolivarian Republic of)"= "Venezuela",
"Viet Nam" = "Vietnam",
"Wallis and Futuna Islands" = "Wallis and Futuna")
inset_plot <- TB.burden %>% select(year, e_inc_100k) %>% filter(year == 2016) %>% ggplot(aes(x = e_inc_100k)) + geom_histogram(Color = "yellow", fill = "slateblue")+ theme_hc(bgcolor = "darkunica") + labs(title = "Histogram of\nincidence of TB", subtitle = "per 100,000 population", x = "", y = "") +theme(title = element_text(size = 8), axis.text = element_text(color = "white"))
g <- ggplotGrob(inset_plot)
###g <- qplot(data = TB.burden, x = e_inc_100k, geom = "histogram", fill = I("slateblue"))
world_map_joined <- left_join(world_map, ### merged world map data
TB.burden_recoded %>%
filter(year == 2016), ### with TB data
by = c('region' = 'country'))
ggplot(data = world_map_joined) + geom_polygon(aes (x = long,
y = lat,
group = group,
fill = e_inc_100k)) +
labs(title = "The Global Map of TB 2016",
subtitle = "Estimated Incidence (per 100,000)",
caption = "WHO | Tuberculosis Data",
x ="", y = "") +
scale_fill_gradient(name = "incidence ",
guide = "colourbar") +
theme_hc(bgcolor = "darkunica") +
scale_colour_hc("darkunica") +
theme(legend.key.width = unit(2, "cm"),
panel.grid.major.y = element_blank(),
legend.just = c("right","bottom"),
axis.text = element_blank()) +
annotation_custom(grob = g,
xmin = -Inf,
xmax = -80,
ymin = -Inf,
ymax = 20)
The 30 High TB Burden Countries were higlighted to focus global action on TB in countries where progress is most needed. High TB Burden Countries account for 87.1% of estimated absolute number of incident cases and 86.5% estimated absolute number of mortality cases. The 20 countries with the highest estimated absolute number of incident TB cases, plus the top 10 countries with the highest estimated TB incidence per capita make up the list.
top30 <- c("Angola", ### filter countries
"Bangladesh", ### that belong to the
"Brazil", ### High Burden Countries
"Cambodia",
"Central African Republic",
"China",
"Republic of Congo",
"South Korea",
"Democratic Republic of the Congo",
"Ethiopia",
"India",
"Indonesia",
"Kenya",
"Lesotho",
"Liberia",
"Mozambique",
"Myanmar",
"Namibia",
"Nigeria",
"Pakistan",
"Papua New Guinea",
"Philippines",
"Russia",
"Sierra Leone",
"South Africa",
"Thailand",
"Tanzania",
"Vietnam",
"Zambia",
"Zimbabwe")
g1 <- TB.burden_recoded %>% select(country, year, e_inc_num) %>% filter(year == 2016) %>% mutate(hbc = ifelse(country %in% top30, TRUE, FALSE)) %>% group_by(hbc) %>% summarise(sum_inc = sum(e_inc_num)) %>% mutate(perc = paste0(round(sum_inc/sum(sum_inc)*100,1), "%")) %>% ggplot(aes(x = hbc, y=sum_inc, fill = factor(hbc))) + geom_bar(stat = "identity") + geom_text(aes(label = perc), vjust = 1.6, color = "white", size = 3.5) + theme_hc(bgcolor = "darkunica") + labs(title = "Incidence of TB cases in 2016 among\nHigh Burden Countries and\nthe rest of the world", subtitle = "estimated number of incident cases" , caption = "WHO | Tuberculosis Data", x = "", y = "") + guides(fill = FALSE) + scale_x_discrete(labels = c("rest of the world", "High Burden Countries")) + scale_fill_manual(values = c("mediumblue", "maroon2")) + theme(axis.text.x = element_text(colour = "white"), title = element_text(size = 9))
g2 <- TB.burden_recoded %>% select(country, year, e_mort_num) %>% filter(year == 2016) %>% mutate(hbc = ifelse(country %in% top30, TRUE, FALSE)) %>% group_by(hbc) %>% summarise(sum_mort = sum(e_mort_num)) %>% mutate(perc = paste0(round(sum_mort/sum(sum_mort)*100,1), "%")) %>% ggplot(aes(x = hbc, y=sum_mort, fill = factor(hbc))) + geom_bar(stat = "identity") + geom_text(aes(label = perc), vjust = 1.6, color = "white", size = 3.5) + theme_hc(bgcolor = "darkunica") + guides(fill = FALSE) + labs(title = "Mortality of TB cases in 2016 among\nHigh Burden Countries and\nthe rest of the world", subtitle = "estimated number of mortality cases", caption = "WHO | Tuberculosis Data", x = "", y = "") + scale_x_discrete(labels = c("rest of the world", "High Burden Countries")) + scale_fill_manual(values = c("mediumblue", "maroon2")) + theme(axis.text.x = element_text(colour = "white"), title = element_text(size = 9))
grid.arrange(g1, g2, nrow =1)
The top 20 by estimated absolute number (in alphabetical order): - Angola, Bangladesh, Brazil, China, DPR Korea, DR Congo, Ethiopia, India, Indonesia, Kenya, Mozambique, Myanmar, Nigeria, Pakistan, Philippines, Russian Federation, South Africa, Thailand, UR Tanzania, Viet Nam
The additional 10 by estimated incidence rate per 100 000 population and with a minimum number of 10 000 cases per year (in alphabetical order): - Cambodia, Central African, Republic of Congo, Lesotho, Liberia, Namibia, Papua New Guinea, Sierra Leone, Zambia, Zimbabwe.
The map below highlight the 30 countries with the highest burden of TB in different shades of pink. Countries not included in the list are in gray. A darker shade indicate countries with higher burden of disease. The ggplot2
package offers a lot of flexibility to alter different aspects of the plot to suit different situations.
top30mapdf <- TB.burden_recoded %>% ### take 2016 data from
filter(year == 2016,
country %in% top30) %>% ### high burden countries
select(country, ### select var needed
e_inc_num) %>% ### arrange incidence
arrange(desc(e_inc_num)) ### in descending order
top30_joined <- left_join(world_map, ### merge data by
top30mapdf, ### by country
by = c('region' = 'country'))
ggplot() + geom_polygon(data = top30_joined, ### plot data
aes (x = long,
y = lat,
group = group,
fill = e_inc_num)) +
labs(title = "The 30 High Burden Countries",
subtitle = "estimated number of incident TB cases",
caption = "WHO | Tuberculosis Data") +
scale_fill_gradient(name = "incidence ",
guide = "colourbar",
low = "plum",
high = "mediumvioletred") +
theme_hc(bgcolor = "darkunica") +
scale_colour_hc("darkunica") +
theme(legend.key.width = unit(2, "cm"),
title = element_text(colour = "plum"))
We can include labels to facilitate identificaton of the countries in the list and their geographic locations.
top30_latlon <- read.csv("top30_latlon2.csv", ### get de and longitude position
stringsAsFactors = FALSE) ### of high burden countries
ggplot(data = top30_joined) + ### plot map
geom_polygon(aes (x = long,
y = lat,
group = group,
fill = e_inc_num)) +
labs(title = "The 30 High Burden Countries",
subtitle = "estimated number of incident cases", ### add labels
caption = "WHO | Tuberculosis Data") +
scale_fill_gradient(name = "incidence", ### change name of legened
guide = "colourbar") +
theme_hc(bgcolor = "darkunica") + ### use preset theme
scale_colour_hc("darkunica") +
theme(legend.key.width = unit(2, "cm")) + ### set legend dimensions
geom_text_repel(data = top30_latlon, ### label countries
aes(x = lon, ### avoid overlapping labels
y = lat,
label = country),
nudge_x = 2,
size = 3,
color = "lawngreen")
The plot below shows the ranking among the 30 High Burden countries in terms of absolute number of incident cases. It also highlight how far India is from the rest of the world in terms of absolute number of incident cases.
An estimated 10.4 million people fell ill with TB in 2016: 56% were in five countries: India, Indonesia, China, the Philippines and Pakistan.
TB.burden_recoded %>%
filter(year == 2016,
country %in% top30) %>%
select(country,
e_inc_num, year) %>%
arrange(desc(e_inc_num)) %>%
ggplot(aes(x = reorder(country,
e_inc_num),
y = e_inc_num)) +
geom_point(size = 3,
color = "firebrick1") +
coord_flip() +
scale_y_reverse() +
labs(title = "The 30 High Burden Countries",
subtitle = "Estimated number of incident cases",
x = "",
y = "Count",
caption = "WHO | Tuberculosis Data") +
theme_hc(bgcolor = "darkunica") +
theme(axis.text = element_text(colour = "white"),
axis.title = element_text(colour = "white"),
legend.position = "right",
panel.grid.major.x = element_line(color = "gray50"))
Notice in the left plot below how the values and ranking changed when incidence is reported in proportion to the size of the countries population as compared to absolute number of new cases like in the plot above. In terms of incidence per 100,000 population, South Africa ranks number 1, followed by Lesotho, and the Philippines, third. The plot on the right below shows the estimated total population number per country. The other countries on the list are dwarfed by China and India in terms of population size.
TB.burden_recoded$country_abbrev <- recode(TB.burden_recoded$country,
"Central African Republic" = "Cen African Rep",
"Democratic Republic of the Congo" = "DP Congo")
g1 <- TB.burden_recoded %>%
filter(year == 2016,
country %in% top30) %>%
select(country_abbrev,
e_inc_100k, year) %>%
arrange(desc(e_inc_100k)) %>%
ggplot(aes(x = reorder(country_abbrev,
e_inc_100k),
y = e_inc_100k)) +
geom_point(size = 3,
color = "cyan") +
coord_flip() +
scale_y_reverse() +
labs(title = "The 30 High Burden Countries",
subtitle = "Estimated incidence per 100000 population",
x = "",
y = "incidence (per 100,000)",
caption = "WHO | Tuberculosis Data") +
theme_hc(bgcolor = "darkunica") +
theme(axis.text = element_text(colour = "white"),
axis.title = element_text(colour = "white"),
title = element_text(size = 10),
legend.position = "right",
panel.grid.major.x = element_line(color = "gray50"))
g2 <- TB.burden_recoded %>%
filter(year == 2016,
country %in% top30) %>%
select(country_abbrev,
e_pop_num, year) %>%
arrange(desc(e_pop_num)) %>%
ggplot(aes(x = reorder(country_abbrev,
e_pop_num),
y = e_pop_num)) +
geom_point(size = 3,
color = "purple1") +
coord_flip() +
scale_y_reverse() +
labs(title = "The 30 High Burden Countries",
subtitle = "Estimated total population",
x = "",
y = "Count",
caption = "WHO | Tuberculosis Data") +
theme_hc(bgcolor = "darkunica") +
theme(axis.text = element_text(colour = "white"),
axis.title = element_text(colour = "white"),
title = element_text(size = 10),
legend.position = "right",
panel.grid.major.x = element_line(color = "gray50"))
grid.arrange(g1, g2, nrow=1)
For the plots below we will be using the variable e_inc_100k
which reports the incidence of TB per 100,000 population. The plots earlier were constructed using the variable e_inc_num
which shows the absolute number of incident cases per country. In this way, the value that is reported does not only take into account the number of cases per country but also the proportion of the population that is affected. The TB mortality rate (per 100 000 population) fell by 37% between 2000 and 2016.
inc_2016 <- TB.burden_recoded %>%
select(country,
year, e_inc_100k) %>%
group_by(year) %>%
summarise(incidence = sum(e_inc_100k)) %>%
ggplot(aes(x = year,
y = incidence)) +
geom_point(color = "seagreen",
size = 2) +
geom_line(color = "royalblue") +
labs(title = "Estimated Global Incidence of TB cases",
subtitle = "per 100,000",
x = "Year",
y = "Estimated Incidence (per 100,000)",
caption = "WHO | Tuberculosis Data") +
theme_hc(bgcolor = "darkunica") +
theme(axis.text = element_text(colour = "white"),
axis.title = element_text(colour = "white"),
legend.position = "right",
panel.grid.major.x = element_line(color = "gray50"),
title = element_text(size = 8))
mort_2016 <- TB.burden_recoded %>%
select(country,
year,
e_mort_exc_tbhiv_100k) %>%
group_by(year) %>%
summarise(mortality = sum(e_mort_exc_tbhiv_100k)) %>%
ggplot(aes(x = year,
y = mortality)) +
geom_point(size = 2,
color = "tomato") +
geom_line(color = "navajowhite") +
labs(title = "Estimated Global mortality of TB cases",
subtitle = "Excluding HIV (per 100,000)",
x = "Year",
y = "Estimated mortality (per 100,000)",
caption = "WHO | Tuberculosis Data") +
theme_hc(bgcolor = "darkunica") +
theme(axis.text = element_text(colour = "white"),
axis.title = element_text(colour = "white"),
legend.position = "right",
panel.grid.major.x = element_line(color = "gray50"),
title = element_text(size = 8))
grid.arrange(inc_2016, mort_2016, nrow = 1)
Globally, the TB mortality rate is falling at about 3% per year. TB incidence is falling at about 2% per year and 16% of TB cases die from the disease; by 2020, these figures need to improve to 4-5% per year and 10%, respectively, to reach the first (2020) milestones of the End TB Strategy
The graph below shows how high the incidence and mortality were in the African Region in 2000 and the significiant improvement that has been achieved from 2000 to 2016.
g1 <- TB.burden_recoded %>%
select(country,
g_whoregion,
year,
e_inc_100k) %>%
group_by(g_whoregion, year) %>%
summarize(incidence = sum(e_inc_100k)) %>%
ggplot(aes(x = year,
y = incidence,
color = g_whoregion)) +
geom_line() +
geom_point(size = 2) +
labs(title = "Estimated Incidence per WHO Region",
x = "",
y = "Estimated Incidence (per 100,000)",
caption = "WHO | Tuberculosis Data") +
guides(colour = guide_legend("WHO Region")) +
theme_hc(bgcolor = "darkunica") +
theme(axis.text = element_text(colour = "white"),
axis.title = element_text(colour = "white"),
panel.grid.major.x = element_line(color = "gray50"),
title = element_text(size = 8),
legend.position = "bottom",
legend.title = element_text(size = 10))
g2 <- TB.burden_recoded %>%
select(country,
g_whoregion,
year,
e_mort_exc_tbhiv_100k) %>%
group_by(g_whoregion, year) %>%
summarize(incidence = sum(e_mort_exc_tbhiv_100k)) %>%
ggplot(aes(x = year,
y = incidence,
color = g_whoregion)) +
geom_line() +
geom_point(size = 2) +
labs(title = "Estimated mortality per WHO Region",
subtitle = "Excluding HIV (per 100,000)",
x = "",
y = "Estimated Incidence (per 100,000)",
caption = "WHO | Tuberculosis Data") +
guides(colour = guide_legend("WHO Region")) +
theme_hc(bgcolor = "darkunica") +
theme(axis.text = element_text(colour = "white"),
axis.title = element_text(colour = "white"),
panel.grid.major.x = element_line(color = "gray50"),
title = element_text(size = 8), legend.position = "bottom",
legend.title = element_text(size = 10))
grid.arrange(g1, g2, nrow = 1)
Regionally, the fastest decline in TB incidence is in the WHO European Region (4.6% from 2015 to 2016). The fastest declines in the TB mortality rate are in the WHO European Region and the WHO Western Pacific Region (6.0% and 4.6% per year, respectively, since 2010).
In the plot below, Incidence is plotted on the x-axis, with the highest values on the left and 0 on the right. For most High Burden countries, the dark purple colored points representing the year 2000 are located on the left and the light yellow colored point representing the year 2016 is on the right, indicating a decrease in the number of incident cases across the years. Notice however that in the case of Angola, Liberia, and Mozambique, the opposite is true. In the case of the Philippines the yellow colored point is in the middle, indicating that there was some resurgence in the number of incident cases in recent years.
The distance between the purple colored points and the yellow colored points quantifies the amount of change. The longer the distance, the greater the amount of change. Tanzania, Zambia, Cambodia, and Ethiopia seem to be getting there slowly but surely while Zimbabwe, Kenya, and the Central African Republic are in a hurry to get there by leaps and bounds. The goal set by the WHO is less than 10 tuberculosis cases per 100,000.
The plot below was inspired by a plot created by eugejoh. I tried to recreate it with some changes added.
TB.burden_recoded %>%
filter(country %in% top30) %>%
ggplot(aes(x = reorder(country_abbrev, e_inc_100k),
y = e_inc_100k,
color = year)) +
geom_point(size = 3) +
coord_flip() +
scale_y_reverse() +
labs(title = "Estimated Incidence",
subtitle = "per 100,000",
x = "",
y = "Estimated Incidence (per 100,000)",
caption = "WHO | Tuberculosis Data") +
scale_color_gradient(high = "yellow",
low = "purple") +
theme_hc(bgcolor = "darkunica") +
theme(axis.text = element_text(colour = "white"),
axis.title = element_text(colour = "white"),
legend.position = "right",
panel.grid.major.x = element_line(color = "gray50"))
The decline since 2010 has exceeded 4% per year in several high TB burden countries, including Ethiopia, Kenya, Lesotho, Namibia, the Russian Federation, the United Republic of Tanzania, Zambia and Zimbabwe.
The plot below is a variant of the plot above except that it shows the estimate TB mortality per 100,000. It conveys the rate of decline more effectively by showing the angle of the slope of the line. The steeper the slope the faster the rate of decline. It also shows whether the downward trend was continuous or whether there were periods of reversal.
There was a steady rapid decline in Ethiopia, Central African Republic, and South Korea while the gentler slope of the line for India, Indonesia and the Philippines show a slow but steady decline.
TB.burden_recoded %>%
filter(country %in% top30) %>%
ggplot(aes(x = year,
y = e_mort_exc_tbhiv_100k,
color = country)) +
geom_point() +
geom_line() +
facet_wrap(~country_abbrev, ncol = 5) +
guides(color= FALSE) +
labs(title = "Estimated Mortality among High Burden Countries 2000 - 2016",
subtitle = "Excluding HIV (per 100,000)",
x = "year",
y = "Estimated Mortality (per 100,000)",
caption = "WHO | Tuberculosis Data") +
theme(axis.text.x = element_blank()) +
theme_hc(bgcolor = "darkunica")
For direct comparison between countries, the plot below provides a better view. Countries that might be considered outliers or different from the rest, are highlighted. It also reveals the upward and downward trends clearly across the years. Identifying which line represents a certain country can be difficult.
TB.burden_recoded %>%
filter(country %in% top30) %>%
ggplot(aes(x = year,
y = e_mort_exc_tbhiv_100k,
color = country_abbrev)) +
geom_point() +
geom_line() +
labs(title = "Decline in Mortality among High Burden Countries 2000 - 2016",
x = "year",
y = "Estimated Mortality (per 100,000)",
caption = "WHO | Tuberculosis Data") +
theme(axis.text.x = element_blank(),
legend.position = "bottom") +
theme_hc(bgcolor = "darkunica") +
theme(legend.title = element_blank())
Each point represents a high TB burden country on the interactive plot below. Estimated mortality is plotted on the x axis and estimated incidence on the y axis. The size of the points map the estimated population number and the color maps the WHO region where the country belongs. Year is depicted by the movement and position of the datapoint on the plot.
Hovering the pointer using the mouse on a datapoint/country reveals all the information mentioned above. You can adjust the slider at the bottom of the plot to whichever year you desire to show the data for a particular year.
Repeatedly playing the video or sequence of events backward and forward allows one to sort of freeze the moment in time and compare the progress of each country year by year.
library(plotly)
gg <- TB.burden_recoded %>% filter(country %in% top30) %>% ggplot(aes(x = e_mort_100k, y = e_inc_100k, color = g_whoregion)) + geom_point(aes(size = e_pop_num, frame = year, ids = country)) + labs(x= "Estimated Mortality (per 100,000)", y = "Estimated Incidence (per 100,000)") + theme_hc(bgcolor = "darkunica") + scale_colour_hc("darkunica") + theme(legend.position = "bottom")
ggplotly(gg)
A diverging barplot is a wonderful tool to show a summary of the data. Coloring the bars as to whether the number of incident TB cases increased or declined conveys the information in an expedient manner. Let’s look at the period between 2010 to 2016 and see whether incidence increased or decreased for each country.
Our list of countries is quite long which resuilts in a very long plot.
inc2010 <- TB.burden %>%
select(country,
year,
e_inc_100k) %>%
filter(year == 2010) %>%
select(country, e_inc_100k)
inc2016 <- TB.burden %>%
select(country,
year,
e_inc_100k) %>%
filter(year == 2016) %>%
select(country, e_inc_100k)
names(inc2010)[2] <- 'e_inc_2010'
names(inc2016)[2] <- 'e_inc_2016'
merged_df <- merge(inc2010, inc2016)
merged_df$diff <- merged_df$e_inc_2010 - merged_df$e_inc_2016
merged_df$type <- ifelse(merged_df$diff < 0, 'below', 'above')
merged_df <- merged_df %>% filter(diff != 0)
merged_df$country <- as.factor(merged_df$country)
merged_df <- merged_df %>% select(country, diff, type) %>% arrange(desc(diff))
rank <- merged_df$country
merged_df$country <- factor(merged_df$country, levels = rank)
below <- merged_df %>% filter(type == "below")
ggplot(merged_df,
aes(x=country,
y=diff)) +
geom_bar(stat='identity',
aes(fill=type),
width=.5) +
coord_flip() +
scale_fill_manual(name = "status",
values = c("turquoise2",
"orangered1"),
labels = c("improved",
"worsened")) +
labs(title = "Change in Estimated Incidence",
subtitle = "per 100,000 population",
caption = "WHO | Tuberculosis Data",
y = "",
x= "Number of cases") +
theme_hc(bgcolor = "darkunica") +
theme(axis.text = element_text(colour = "white"),
axis.title = element_text(colour = "white"),
axis.title.y = element_blank())
Of the 218 countries in the WHO list of countries, 44 had an increase in estimated incidence of TB cases between the period 2010 to 2016. Among these, 3 were from the High Burden countries. These were: Mozambique, Liberia, Philippines
The CFR is the proportion of people with TB who die from the disease; it can be approximated as the number of TB deaths divided by TB incidence in the same year
Fighting TB requires significant investments in infrastructure, personnel, and technology that is incorporated in a program that seeks to form a unified approach to providing timely access to diagnosis and treatment. Getting the patient to come in for diagnosis and then ensuring daily compliance to treatment regimen 6 to 9 months has been a perenial problem. The cfr in a way is a gauge of how well services are provided or delivered.
There is considerable country variation in the CFR, from under 5% in a few countries to more than 20% in most countries in the WHO African Region. This shows considerable inequalities among countries in access to TB diagnosis and treatment that need to be addressed.
TB.burden %>%
filter(year == 2016,
cfr != is.na(cfr)) %>%
select(country,
cfr,
year) %>%
arrange(desc(cfr)) %>%
ggplot(aes(x = reorder(country,
cfr),
y = cfr)) +
geom_bar(fill = "dodgerblue",
stat = "identity") +
coord_flip() +
labs(title = "The 30 High Burden Countries",
subtitle = "Case Fatality Ratio",
x = "", y = "Percent",
caption = "WHO | Tuberculosis Data") +
theme_hc(bgcolor = "darkunica") +
theme(axis.text = element_text(colour = "white"),
axis.title = element_text(colour = "white"),
legend.position = "right",
panel.grid.major.x = element_line(color = "gray50"))
It is noteworthy to point out that even though the Philippines, China, and Cambodia are among the High Burden TB Countries, they rank high in terms of low Case Fatality Ratio among the countries of the world.
CFR is a key indicator for monitoring progress towards the 2020 and 2025 milestones. A CFR of 6% is required to achieve the 2025 global milestone for reductions in TB deaths and cases.
Based on the available data we have examined, I come to the following conclusions.
The number of new cases and deaths from TB are on the decline, but it has to decline even faster in order to meet the milestone targets for 2020 and 2025 and the WHO end goal for 2030.
The number of new TB cases and deaths from TB are concentrated in several countries. These countries need special attention and probably assistance in order to keep pace with the goals set by WHO.
There is substantial variation in population density, incidence and mortality among the different countries of the world which suggests that an individualized approach or strategy will be required to address the TB situation in each one.
The goal of the WHO is overaching, some might even say it is ambitious but the fact is it is high time that the TB problem is brought to it’s successful conclusion. TB is curable and the means to treat it is available but it will require a global effort and political will from every country to see its end.
I hope you enjoyed this bit of presentation. Please let me know if there are any mistakes in the analysis and plotting of the data or any comments that you may have. I am open to constructive comments. You can open a pull request from this repository or email me at eofilda@yahoo.com.
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7600)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] plotly_4.7.1 maps_3.3.0 DT_0.4 ggalt_0.4.0
## [5] ggrepel_0.8.0 ggmap_2.6.1 tibble_1.4.2 bindrcpp_0.2.2
## [9] gridExtra_2.3 ggthemes_3.5.0 ggplot2_2.2.1 tidyr_0.8.1
## [13] dplyr_0.7.5
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.17 lattice_0.20-35 png_0.1-7
## [4] assertthat_0.2.0 rprojroot_1.3-2 digest_0.6.15
## [7] proj4_1.0-8 mime_0.5 R6_2.2.2
## [10] plyr_1.8.4 backports_1.1.2 evaluate_0.10.1
## [13] httr_1.3.1 pillar_1.2.3 RgoogleMaps_1.4.2
## [16] rlang_0.2.1 lazyeval_0.2.1 data.table_1.11.4
## [19] geosphere_1.5-7 extrafontdb_1.0 rmarkdown_1.10
## [22] proto_1.0.0 labeling_0.3 extrafont_0.17
## [25] stringr_1.3.1 htmlwidgets_1.2 munsell_0.5.0
## [28] shiny_1.1.0 compiler_3.5.0 httpuv_1.4.3
## [31] pkgconfig_2.0.1 htmltools_0.3.6 tidyselect_0.2.4
## [34] viridisLite_0.3.0 later_0.7.3 MASS_7.3-49
## [37] grid_3.5.0 jsonlite_1.5 xtable_1.8-2
## [40] Rttf2pt1_1.3.6 gtable_0.2.0 magrittr_1.5
## [43] scales_0.5.0 KernSmooth_2.23-15 stringi_1.1.7
## [46] mapproj_1.2.6 reshape2_1.4.3 promises_1.0.1
## [49] sp_1.3-1 ash_1.0-15 rjson_0.2.20
## [52] RColorBrewer_1.1-2 tools_3.5.0 glue_1.2.0
## [55] purrr_0.2.5 crosstalk_1.0.0 jpeg_0.1-8
## [58] yaml_2.1.19 colorspace_1.3-2 knitr_1.20
## [61] bindr_0.1.1