builder. designer. writer. and everything in between.

Rewriting Herstory


Rewriting Herstory

Using text analysis to explore powerful female counterparts of well-known men in history.


What would a "woman's world" look like?

Education of history around the world is incredibly male-centric. In Time Magazine's analysis of most googled people of history, The Smithsonian's list of Most Significant Americans of All-Time, Wikipedia's most viewed pages of people, S&P500 company leaders list, and the Nobel Prize Laureate list, the percentage of female representation did not surpass 20%. In fact, there were more men named "John" than there were women on the board of S&P500 companies.

Using R packages rvest, tm, slam, and Wikipedia's lists of women, I scraped each of these sources to compile a list of the biographies of the top 50 influential men in history and over 7,000 of women. To achieve industry diversity, I included the following occupations to match on:

Astronauts  |  Astronomers  |  Business People  |  Composers  |  Explorers  |  Political Leaders  |  Inventors  |  Mathematicians  |  Philosophers  |  Scientists  |  Writers  |  Film Directors  |  Civil Rights Leaders  |  Artists  |  Computer Scientists

By analyzing the cosine similarities in the document-term-matrices of each man against women in his industry, I collected the top matching counterpart for each. The [unfinished] results are below.

Famous Men Description Famous Women Description Match Quality
William Shakespeare Poet, Playwright Delia Bacon Playwright, known for authorship of attribution of Shakespeare's plays Unsure
Aristotle Philosopher Mary Louise Gill Professor of Philosophy, focuses on Aristotle Unsure
Charles Darwin Biologist Mary Anne Whitby Introduced silkworm cultivation to UK with Darwin Good
Christopher Colombus Explorer Carol Beckwith Photojournalist who documented indigenous tribes of Africa Good
Wolfgang Amadeus Mozart Composer Jitka Snizkova Czech composer, President of Mozart Society Unsure
Leonardo da Vinci Inventor Mary the Jewess Known as first alchemist of Western world, invention of the chemical apparatus Good
Winston Churchill Political Leader Christy Clark 35th Premier of British Columbia Unsure
Walt Disney Businessperson Vanna Bonta Italian-American writer, actress, and inventor. Voice talent on Beauty and the Beast Bad
Marco Polo Explorer Freya Stark Explorer and writer of 2 dozen books on travels of Middle East, one of first non-Arabs to travel Arabian Desert Good
Confucius Philosopher Iris Murdoch British philosopher and novelist, Dame Commander of Order of British Empire, top 50 greatest writers Good
Benjamin Franklin Inventor Mary Dixon Kies First American woman to receive a patent for her straw hats Unsure
Neil Armstrong Astronaut Peggy Whitson First female commander of ISS, 3rd in cumulative EVA time (longest for women) Good
Andrew Carnegie Businessperson Karin Foreseke Swedish CEO of D. Carnegie & Co, an investment bank Bad
J.P. Morgan Businessperson Zoe Cruz Former Co-President of Morgan Stanley Bad

Initial takeaways

The quality of the matches vary widely. High-quality matches included Neil Armstrong vs. Peggy Whitson and Marco Polo vs. Freya Stark. Many of the weaker matches resulted from matching famous women who researched the famous men, instead of those who matched in achievements.

The most obviously and amusingly incorrect match was J.P. Morgan vs. Zoe Cruz, the former Co-President of Morgan Stanley. The algorithm pulled the common Wharton freshman mistake of confusing these two banks. Can't blame it though - I didn't know the difference until just last year!

moving forward

The final product will be a searchable and ever-growing database of counterparts throughout history to promote the education and appreciation of female achievements. Email me at if you're interested in collaborating!

R code


# Reading Women Wiki Pages 
womencs3 = read_html("")
womencs3 = html_attr(html_nodes(womencs3, css="a"), "href")
womencs3full= c()
for (i in 1:length(womencs3)){
  womencs3full[i] <- paste("", womencs3[i], sep="")

womencs3text = c()

for(i in 1:length(womencs3full)){
    z = read_html(womencs3full[i])
    z = html_text(html_nodes(z, css="p"), "href")
    z = paste(z[1:length(z)], collapse=" ")
    womencs3text[i] <- paste(z, sep="")}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})

# Read Man's Wiki Link
url = ""
mantext = read_html(url) %>% html_nodes("p") %>% html_text()

# Using Cosine
mantext = paste(mantext[1:length(mantext)], collapse=" ")
mantext.df = data.frame("Name", mantext)
womentext.df = data.frame("Name", womenastronauttext)
colnames(mantext.df) = c("Person", "Text")
colnames(womentext.df) = c("Person", "Text")
alltext = rbind(womentext.df, mantext.df)

# Turn to Corpus
corp = VCorpus(VectorSource(alltext$Text)) 
corp = tm_map(corp, removePunctuation)
corp = tm_map(corp, removeNumbers)
corp = tm_map(corp, content_transformer(tolower) ,lazy=TRUE)
corp = tm_map(corp, content_transformer(removeWords), stopwords("english") ,lazy=TRUE)
corp = tm_map(corp, content_transformer(stemDocument), lazy=TRUE)
corp = tm_map(corp, stripWhitespace)
dtm <- DocumentTermMatrix(corp)

# Find Highest % Match
cosine_sim <- tcrossprod_simple_triplet_matrix(dtm, dtm)/sqrt(row_sums(dtm^2) %*% t(row_sums(dtm^2)))
diag(cosine_sim) = 0
matchedvalues = cosine_sim[length(womenastronauttext)+1,]
top = sort(matchedvalues, decreasing = TRUE)[1]
womenastronautfull[which(matchedvalues == top)]