New York Times published names of 1,000 people (out of more than 100,000) who died from COVID-19 in US. This is a sad publication featuring names, age, short bio and state of residence of 1,000 people whose names were compiled from numerous publications across the country. Even sadder is the fact that this is less than 1% of all victims. The paper version of newspaper dedicated several full pages to accommodate the names of the victims.
While it is a tragic tribute to COVID-19 victims, many researches fighting COVID-19 are interested in case by case individual data because it enables to use this data in machine learning algorithms designed to fight COVID-19 in one or another way. Access to case data provides a variety of possible vital opportunities:
Get demographic insights to see the most affected demographics
Analyze location data to see the most affected geographies
Analyze medical data (conditions, treatment, test results, x-rays, etc) to find out contributing factors
And such case data is almost absent in public domain leaving a large gap in opportunity to leverage the power of citizen science and large number of data researchers globally.
Goal of the research
NYT 1,000 names is not a solution to this problem of course. It's a relatively small and incomplete data set of cases (e.g. it doesn't contain useful medical information). However it contains names so we decided to conduct a case study of measuring COVID-19 victims' demographics as a proof-of-concept of how Demografy can be used as one step in using AI to combat COVID-19 - getting richer dempographic data for each case.
For the present case study, we've used online version of New York Times publication containing data set of 1,000 names of US COVID-19 victims. Unfortunately web page displaying the data uses canvas HTML5 element so data can't be copied directly via trivial automated means. As a result we compiled data set manually.
Data for each person contains the following information:
Full name
Age
State
City or county (optional)
Short bio
After manually collecting all the data in tabular format, we cleaned the data using a program written for this purpose. Program normalizes names by removing middle names, initials and abbreviations such as "Jr." or "Sr." leaving first and last name separately. Then Demografy was used to classify demographic indicators. Demografy uses supervised machine learning algorithms to classify gender, age group, race, Hispanic origin and ethnicity from first and last names.
More details about methodology is availbele by request.