Datacast Episode 51: Research and Tooling for Computer Vision Systems with Jason Corso
The 51st episode of Datacast is my conversation with Professor Jason Corso — the new director of the Stevens Institute for Artificial Intelligence and the co-founder/CEO of Voxel51. Give it a listen to hear about his wide-ranging computer vision research in image registration, medical imaging, visual segmentation, video understanding, and robotics; his courses at SUNY Buffalo and the University of Michigan; his startup Voxel51 that builds dataset analysis tools; his opinion of doing good research; common threads between professorship and entrepreneurship; and much more.
Listen to the show on (1) Spotify, (2) Apple Podcasts, (3) Google Podcasts, (4) Stitcher, (5) iHeart Radio, (6) Radio Public, (7) Breaker, and (8) TuneIn
Key Takeaways
Here are the highlights from my conversation with Professor Corso:
On Studying Computer Science at Loyola College
As an undergraduate, I was responsible for thinking critically both about liberal arts and computer science topics. For someone like me with broad interests, I found it quite rewarding as it led me to focus on analytical problem solving and logical reasoning on widely different topics.
I had close relationships with nearly all professors during college, which allowed me to get into research pretty early. During the summer of my sophomore year, I worked with a computer graphics professor on developing mathematical models for image registration.
On Doing a Ph.D. at John Hopkins
I shifted from computer graphics to computer vision research, primarily because I was interested in the applied mathematics underneath vision than graphics — in the sense that I can quantify and analyze the performance from a system point of view.
My underlying thesis created a shared perceptual space between the computer vision system and the human user to enrich the environment with things like augmented reality and multitask gestures.
On Tackling Brain Cancer at UCLA
This was my introduction to visual segmentation. We studied a specific invasive type of brain tumor called glioblastoma multiforme (GBM). From a scientific point of view, can we measure the active tumor's size and the amount of the swollen parts from one scan to the next scan? And can we use those measurements as prognostic measures for the tumor growth and the patient’s longevity?
We developed a set of methods called segmentation by weighted aggregation. This is a graph-based representation of the MRI images.
On Teaching Bayesian Vision and Pattern Recognition at SUNY Buffalo
My “Intro to Pattern Recognition” is a core Machine Learning course: what can you do from a computational perspective, backed by mathematics? We would go from the history of discriminant functions in the 1950s all the way to support vector machines and multi-layer perceptrons.
My “Bayesian Vision” is an advanced Computer Vision course: if you want to pull information out of images that are understandable by humans, what methods should you study? We covered generative and graph-based models like Markov and Conditional Random Fields, which complement contemporary methods and statistics-based methods.
On Image Understanding Research
I created the term “Generalized Image Understanding,” while writing my NSF career proposal. Today, we would call that image captioning. I called the relationships between perceptions, semantics, and physical constraints as cognitive system entanglement. When you want to solve a vision understanding problem like image classification, you make tacit assumptions about the semantics, the languages, and the operating environment. My research for the last 15 years has been motivated because I don’t want to make tacit assumptions about those three pillars of the entanglement.
The proposed methodology in this 2007 work was about the relationship between things that can be extracted at the low-level (physical environment and semantic constraints), at the mid-level (graphical structure over the images), and at the high-level (driven by human eyes). The ultimate goal was to generate language outputs from the system.
On Video Understanding Research
ISTARE is a system that can understand human actions and how such actions relate to objects in the scene. We built graph-based models over the 3D space-time domain to understand the information flow over time. Modern deep learning methods require so much data to understand temporal evolutions over time. On the other hand, if we can extrapolate a hierarchy from a graph-based representation, it’s natural to relate something 200 frames from now to the current time.
Action Bank is a bridge between classical and deep learning ideas for the activity recognition problem. Instead of looking for interesting feature points, we created action spaces by manually selecting exemplar video snippets from the training data. We would embed a novel video for each action space and used an off-the-shelf classifier to compute what action it was. This 2012 CVPR paper was the first that operates on a large dataset like UCF-50.
Most of the time, in computer vision, we think of videos as sequences of frames. A voxel is a volume element in space-time videos, whereas a super-voxel is a collection of those voxels that have some notion of meaningfulness in a signal sense. LIBSVX is our library of super-voxel and video segmentation methods, coupled with a principled evaluation benchmark based on quantitative 3D criteria for good super voxels.
On Teaching Computer Vision at the University of Michigan
My “Foundations of Computer Vision” course is an introductory graduate-level course to computer vision. As computer vision is a broad research domain, it’s often hard to map distinct topics into their applicability. I tried to organize the information and simulate them into a more digestible approach. There are two halves of the course:
In the first half, we talked about representing visual data. What can we do using images as functions? How about images as points (coefficients living in vector spaces)? And images as graphs?
In the second half, we looked at end-to-end case studies: reducing one type of image to another type of image, extracting representations from images, and matching representations across different images.
My “Advanced Topics in Computer Vision” is a more classical course that discusses what’s hot right now. We read papers, picked apart their details, and implemented them as projects.
On Developing BubbleNets
BubbleNets is research done jointly with Brent Griffin, a research scientist at Michigan. He was looking at ways to drive robots' manipulation of complicated objects by segmentation from a video feed. As a proxy to that motivation, we looked at the video object segmentation literature, where we attempted to segment the dominant moving objects in the video. Or, more recently, given a human segmentation of an object in the first frame, could we propagate that segment through the whole video?
We observed that the first video stream is not always good in practice. BubbleNets basically said: If we can sort the frames using the bubble-sorting algorithm, we can then compare each frame's goodness without knowing whether the object would be labeled.
On Voxel51’s Inception
My co-founder, Brian Moore, was a Ph.D. student in Michigan in 2014. He took my Computer Vision course, the first one that I taught at Michigan. He was always asking and answering questions, and we built our relationship after that.
I was getting exhausted with writing papers. What impact would they have? Thus, my motivation for founding Voxel51 is to bring my research into the world as a product, garner impact, and touch society somehow.
Initially, we founded Voxel51 as an LLC funded by NIST. It was great to start a company with dilution-free money. The output of that grant was an open-source toolkit called ETA (Extensible Toolkit for Analytics), a general-purpose library that supports building computer vision systems.
A couple of years later, we decided to go the VC round. Ultimately, we landed with a flagship product called FiftyOne.
On Lessons Learned Building FiftyOne
In academic computer vision research, you assume that you are given a dataset, and your paper is about what you do on that dataset. No one gets much credit for the dataset. That’s a problem because, in industry, you don’t have the dataset to start with. You really need to learn how to build the dataset. Another huge observation that we had is that there were not that many dataset analysis tools out there.
Data science is a field of cowboys and cowgirls. We love to control our environment. I think all the end-to-end platforms launched in the past few years will deconstruct into tools. We decided to build a tool that can handle the dataset problem. FiftyOne is an open-source tool that enables you to ask questions about your data quickly.
How did we design the product?
We failed fast and made a lot of mistakes. With an agile process, we set goals and give the developers full freedom.
We learned that engineers are not good designers. We thus built contractual relationships with very excellent, highly skilled designers.
Open-source is a requirement in AI/ML to build the credibility of the tool.
At the beginning of FiftyOne, we iterated the tool many times with already friendly people we’ve worked with in the past. By the time we launched the tool a few months later, we already had a dozen people who have played with the code, touched the tool, and understood the potential value.
When people request help, you have to answer them quickly. Else you’ll lose their attention.
On Being A Director For Stevens Institute for AI
I have focused mostly on Computer Vision in my career, but I have been thinking increasingly about broader AI issues. I’m motivated to have a voice in AI limitations, data ethics, and responsible research.
With this Institute at Stevens, my vision is to support social AI both from the technical point of view and from the socially relevant issues. We believe that AI will empower every human being to live a better life, have a bigger impact, and make better decisions.
I can see how everything I have learned in the past 20 years and every connection I have built will set an exceptional foundation to build the institute into an important player in the future development of modern AI.
On Being A Professor vs. Being A Founder
The first similarity is creation:
As a professor, you create research outputs, the culture in your lab, the students whom you mentored.
As a founder, you create the company culture, the employees, the product decisions.
The second similarity is empathy:
As a professor, you need to listen to colleagues and students to what’s happening in the problem space to guide your research.
As a founder, you need to listen to your customers and employees to guide your company's directions.
The major difference is the goal:
As a technical founder, I learned a lot about lean product development and the importance of focus on a single product.
As an academic, I could move between different projects.
Delivering a paper is very different from delivering a product that someone would download and use.
On Advice For Aspiring AI Researchers
The best way to get citations is to release code with your papers. When you release code, make sure those works according to what is said in the paper.
The incentives are sometimes not well-aligned with good research. Researchers fight for deadlines rather than look at the bigger problems. It would pay off in the long game to look at those problems that require you to understand and analyze over multiple years.
On Trends in Computer Vision Research
When I started in video 10 years ago, it was not a popular research area because the time to paper was longer. Now I see video everywhere. In fact, the most popular feature request for Voxel51 is video support.
Learning with less data (self-supervision, transfer learning) is growing and is a problem space to be in.
The direction towards reproducibility and accountability for publications is encouraging.
Show Notes
(2:13) Jason went over his experience studying Computer Science at Loyola College in Baltimore for undergraduate, where he got an early exposure to academic research in image registration.
(4:31) Jason described his graduate school experience at John Hopkins University, where he completed his Ph.D. on “Techniques for Vision-Based Human-Computer Interaction” that proposed the Visual Interaction Cues paradigm.
(9:31) During his time as a Post-Doc Fellow at UCLA, Jason helped develop automatic segmentation and recognition techniques for brain tumors to improve the accuracy of diagnosis and treatment accuracy
(14:27) From 2007 to 2014, Jason was a professor in the Computer Science and Engineering department at SUNY-Buffalo. He covered the content of two graduate-level courses on Bayesian Vision and Intro to Pattern Recognition that he taught.
(18:20) On the topic of metric learning, Jason proposed an approach to data analysis and modeling for computer vision called “Active Clustering.”
(21:35) On the topic of image understanding, Jason created Generalized Image Understanding — a project that examined a unified methodology that integrates low-, mid-, and high-level elements for visual inference (equivalent to image captioning today).
(24:51) On the topic of video understanding, Jason worked on ISTARE: Intelligent Spatio-Temporal Activity Reasoning Engine, whose objective is to represent, learn, recognize, and reason over activities in persistent surveillance videos.
(27:46) Jason dissected Action Bank — a high-level representation of video activity, which comprises many individual action detectors sampled broadly in semantic space and viewpoint space.
(35:30) Jason unpacked LIBSVX — a library of super voxel and video segmentation methods coupled with a principled evaluation benchmark based on quantitative 3D criteria for good super voxels.
(40:06) Jason gave an overview of AI research activities at the University of Michigan, where he was a professor of Electrical Engineering and Computer Science from 2014 to 2020.
(41:09) Jason covered the problems and projects in his graduate-level courses on Foundations of Computer Vision and Advanced Topics in Computer Vision at Michigan.
(44:56) Jason went over his recent research on video captioning and video description.
(47:03) Jason described his exciting software called BubbleNets, which chooses the best video frame for a human to annotate.
(51:44) Jason shared anecdotes of Voxel51’s inception and key takeaways that he has learned.
(01:05:25) Jason talked about Voxel51’s Physical Distancing Index that tracks the coronavirus global pandemic’s impact on social behavior.
(01:07:47) Jason discussed his exciting new chapter as the new director of the Stevens Institute for Artificial Intelligence.
(01:11:28) Jason identified the differences and similarities between being a professor and being a founder.
(01:14:55) Jason gave his advice to individuals who want to make a dent in AI research.
(01:16:14) Jason mentioned the trends in computer vision research that he is most excited about at the moment.
(01:17:23) Closing segment.
His Contact Info
His Recommended Resources
Jeff Siskind (Professor at Purdue University)
CJ Taylor (Professor at the University of Pennsylvania)
Kristen Grauman (Professor at the University of Austin)
About the show
Datacast features long-form conversations with practitioners and researchers in the data community to walk through their professional journey and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths - from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.
Datacast is produced and edited by James Le. Get in touch with feedback or guest suggestions by emailing khanhle.1013@gmail.com.
Subscribe by searching for Datacast wherever you get podcasts or click one of the links below:
If you're new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.