Case studies
June 2020
Demographics behind 1,000 names of COVID-19 victims published by NYT
Results
About
Gender
Age
18-39
40-59
60+
Race
White American
African American
Asian American
Native American
Two or more races
Hispanic origin
Hispanic
Non-Hispanic
Ethnicity
British
Hispanic
French
Italian
African
Jew
East European
Germanic
East Asian
Indian
Nordic
Japanese
Arab
New York Times published names of 1,000 people (out of more than 100,000) who died from COVID-19 in US. This is a sad publication featuring names, age, short bio and state of residence of 1,000 people whose names were compiled from numerous publications across the country. Even sadder is the fact that this is less than 1% of all victims. The paper version of newspaper dedicated several full pages to accommodate the names of the victims.
While it is a tragic tribute to COVID-19 victims, many researches fighting COVID-19 are interested in case by case individual data because it enables to use this data in machine learning algorithms designed to fight COVID-19 in one or another way. Access to case data provides a variety of possible vital opportunities:
Get demographic insights to see the most affected demographics
Analyze location data to see the most affected geographies
Analyze medical data (conditions, treatment, test results, x-rays, etc) to find out contributing factors
And such case data is almost absent in public domain leaving a large gap in opportunity to leverage the power of citizen science and large number of data researchers globally.
Goal of the research
NYT 1,000 names is not a solution to this problem of course. It's a relatively small and incomplete data set of cases (e.g. it doesn't contain useful medical information). However it contains names so we decided to conduct a case study of measuring COVID-19 victims' demographics as a proof-of-concept of how Demografy can be used as one step in using AI to combat COVID-19 - getting richer dempographic data for each case.
For the present case study, we've used online version of New York Times publication containing data set of 1,000 names of US COVID-19 victims. Unfortunately web page displaying the data uses canvas HTML5 element so data can't be copied directly via trivial automated means. As a result we compiled data set manually.
Data for each person contains the following information:
Full name
Age
State
City or county (optional)
Short bio
After manually collecting all the data in tabular format, we cleaned the data using a program written for this purpose. Program normalizes names by removing middle names, initials and abbreviations such as "Jr." or "Sr." leaving first and last name separately. Then Demografy was used to classify demographic indicators. Demografy uses supervised machine learning algorithms to classify gender, age group, race, Hispanic origin and ethnicity from first and last names.
More details about methodology is availbele by request.
March 2018
How we measured Twitter demographics using AI
Results
About
Pew Research, comScore
Source
Demografy
Surveys and verified panel data
Method
Demografy's AI
Gender
18-39
40-59
60+
We are proud to publish our first case study. In this case study we used Demografy to measure age range and gender of US Twitter users and compared our results with two separate studies on Twitter demographics published by Pew Research and comScore. We also made a brief overview of existing technologies of measuring demographics of website audiences.
Goal of the research
One of the reasons behind the research is assessment of Demografy’s accuracy. However before conducting this research we have already tested accuracy of Demografy by comparing its results with hundreds of thousands of publicly available self-reported records of real people.
So there are three key aspects of the research:
Explore and compare available approaches to measuring Internet demographics.
Benchmark our own performance against known and well established solutions.
Implement proof of concept of a brand new demographics measuring method that eliminates disadvantages of traditional approaches.
July 2018
Measuring gender bias in films: comparing our results with Google's
Results
About
Female on-screen time
Google
Source
Demografy
Face and gender recognition in video tracks
Method
Gender detection in cast names
By MPAA rating
PG
PG-13
R
By Oscar-winning vs all
Oscar-winning
All movies
By genre
Horror
Romance
Comedy
Sci-Fi
Drama
Biography
Action
Crime
Inspired by Google/GD-IQ 2017 study on gender bias in 300 top grossing 2014–2016 US movies, we used Demografy to measure female on-screen time in the same movies by analyzing names of main character cast. Though both studies use different technologies and measure slightly different indicators we got very close results and decided to publish them.
Goal of the research
Like in our previous study of measuring Twitter demographics, we were challenged to use Demografy technology on a real world task with results that can be cross-verified against reputed source. Google’s 2017 study was a perfect choice for this purpose.
Difference in measured indicators:
First of all, we’d like to highlight difference between the two studies.
Google measured share of female on-screen time by analyzing video frames with detected human faces.
Demografy measured share of female cast by analyzing names of top 10 characters in cast assuming that first 10 actors in cast represent main characters in the movie.
September 2018
Measuring racial and gender diversity in films using AI
Results
About
Race/ethnicity and gender composition of film cast and crew
USC Annenberg Inclusion Initiative
Source
Demografy
Manual data compilation from open sources
Method
Race/ethnicity and gender detection in names using ML
By race/ethnicity
Cast
White American
African American
Hispanic
Asian American
Native American
Middle Eastern
Other
By race/ethnicity
Directors
Non-Black and non-Asian
Asian American
African American
By gender
Cast
Producers
Writers
Directors
As a development of our previous case study of measuring gender bias in films, we decided to measure more demographic indicators using our technology. In this case study we measured gender composition of different roles in film industry as well as racial and ethnic composition of actors in top grossing US movies. We compared our results with USC Annenberg’s 2015 report since it contains a comprehensive and detailed data on measured demographic compositions.
About the Annenberg Inclusion Initiative
AII is the leading think tank in the world studying diversity and inclusion in entertainment through original research and sponsored projects. Beyond research, the Annenberg Inclusion Initiative develops targeted, research-based solutions to tackle inequality.
What data is measured
We measured gender and racial demographics for the following data:
Gender of directors, producers, writers and actors for 2015 top grossing US movies
Race and ethnicity of actors and directors in 2015 top grossing US movies