December 10, 2019

What Should Our PhD in Data Science Imply?

As we prepare to offer a PhD degree in data science, a highly interdisciplinary field encompassing every imaginable discipline, what does such a degree imply, what are employers expecting, what will a faculty actually trained in data science look like? All questions we are contemplating as the program comes together. Here I focus not so much on the nuts and bolts of such a program, but philosophically what do we wish to accomplish – that is the name of the degree after all.

At a recent School of Data Science (SDS) search committee meeting we were discussing the qualifications needed for an Associate Dean of Academic & Faculty Affairs. We listed a PhD in data science. After a brief pause there was laughter. Does anyone in the world actually have a PhD in data science? A web search produces very few universities in the US at least that offer exclusively PhD degrees in data science. Many universities have a data science specialization as part of a broader degree, for example, in biomedical informatics, in business analytics and in computer science. Currently, the US universities offering a “pure” PhD in data science appear to be:

The structure of these programs are not dissimilar to other fields, although expected durations vary. Here is an approximation:

Year’s 1-2 core courses, electives and research rotations
End of year 2 – qualifying exam to assess research potential
End of year 2 – identify research advisor(s)
Years 3 onwards research project with advisor(s)
Year 3 proposition exam of research topic
Graduate at the end of year 4

Standard stuff. More importantly, what do these programs and what do we want our graduates to have accomplished when they walk away with that degree? How will our PhD graduates in data science with a specialization in x differ from graduates in x with a specialization in data science? By analogy in public health, what distinguishes a PhD in Public Health and a DrPH (Doctor of Public Health). The former is designed for research-based contributions to the field, the latter is for leadership roles in practice-based settings (e.g., health department director, health officer). But even there the difference is shaky.

Deep research in some aspect of data science may or may not already belong to an existing field. For example, contributions to fundamentals of deep learning most likely already belong to computer science and engineering; fundamental contributions in cloud computing belongs to systems engineering or elsewhere and so on. On the other hand, where does data ethics currently belong? Where will it belong in the future, perhaps in the social sciences, perhaps not? In short, since data science is a composite of existing disciplines – statistics, computer science, applied mathematics and associated domain(s), would not a PhD represent a deeper study across all these domains than one would experience in a master’s degree, including a deeper dive into a domain area of specialization with increased emphasis placed on developing the research method?

Another way of thinking about our PhD graduates is to think of them as ∏ (pi) shaped and not T shaped. That is, both have broad expertise, but rather than a deep dive into a specific domain area, it is a deep dive into a specific domain area plus data science.

If so, what should a PhD graduate in data science look like to an employer, either in academia or the private sector? We asked this question of our SDS advisory board, a distinguished group of private sector experts. We couch their response, as well as our own knowledge of a career in academia, in terms we are using to operationalize our data science school.

Value – The determination of the value of what research we do, accounting for the natural tensions between social good and business practice. Example PhD training areas:

Ethics for Data Science
Privacy and regulatory concerns with protected information
Sociology and its intersection with data science

Design – the ability to both consume data and produce data products of the highest value. Example PhD training areas:

Human computer interaction
Data representation and manipulation – e.g., metadata, ontologies
Data characteristics – e.g., sparsity, high dimensionality, complexity
Data visualization
Study design

Systems – infrastructure and architectures to support big data/data science. Example PhD training areas:

Cybersecurity
Databases
Cloud & distributed computing
Sensors
Algorithms and data structures
Signal processing

Analytics – statistical and machine learning theory & application to analyze, infer, simulate and predict. Example PhD training:

Theory
- Statistics & probability – Inference, Bayesian, multivariate
- Probability
- Graphs and networks
- Linear algebra & linear models
- Game theory, decision theory
Application
- Deep learning/neural networks
- Natural language processing

Practice – Brings all of the above together where it is practiced in the form of research on one or more specific domain areas, for example, biomedical data sciences, digital humanities, finance. Whatever the domain area, there is a need for professional development in areas so important to a successful research career:

Communication – written and verbal
Study design
Time management
The art of academia – effective grant and paper writing, collaboration, personnel management, etc.

Finally, there is training in the guiding principles, by which we are establishing the school, namely:

Interdisciplinarity – comfort in multiple traditional disciplines
Provision of open knowledge in all we do – written articles, data accessibility and usability, software availability
Reproducibility – a prerequisite to the provision of open knowledge
Diversity, equality, inclusion – with respect to all with whom we work
Innovation & translation – research that makes a difference

No single mentor at this point in time will likely meet the complete needs of our PhD students. Research is best conducted through dual mentorship – an expert in data science combined with a domain expert. A model known to have worked well in other emergent interdisciplinary fields, for example bioinformatics. Research rotations allow the student to explore domains and research topics before honing in on a specific career direction.

Conferring a PhD in data science will be an experiment. How do you evaluate a PhD degree in predictive urban modeling vs radiological image analysis vs real time stock market analysis? What are the comparative rubrics? Should there be comparable rubrics for such diverse domains? Should there be a written thesis to define original work in all domains? Are high quality data and software a partial or complete substitute for a thesis? The first students, as those at New York University, are brave souls. Or are they? Industry tells us they want researchers in data science with a depth of research experience beyond a capstone project as found in a MS degree and professors are needed now to teach the next generation of data scientists. Those brave souls will be in high demand.

What Should Our PhD in Data Science Imply?

What Should Our PhD in Data Science Imply?

Recommend

CEUR-WS.org/Vol-1116 - Linked Science 2013 - Supporting Reproducibility, Scient...

奇瑞艾瑞泽5 PLUS甜野上市！八款车型售6.99万-9.99万

Dean’s Blog: Responsible Data Science

CEUR-WS.org/Vol-1282 - Linked Science 2014 - Making Sense Out of Data (LISC2014)

Proceedings of the 1st ACM SIGSPATIAL International Workshop on Privacy in Geogr...

Deans Blog: Our School During the Pandemic

岚图FREE正式发布零百加速最高4.6秒售价40万以下

Deans Blog: Data Science Meets COVID-19

贾跃亭要回来了？FF注册新公司注册资本达2.5亿美元

Dean’s Blog: Addressing Inequality

About Joyk