Datacast Episode 54: Information Retrieval Research, Data Science For Space Missions, and Open-Source Software with Chris Mattmann

February 5, 2021 James Le

The 54th episode of Datacast is my conversation with Chris Mattmann— the Chief Technology and Innovation Officer at NASA JPL. Give it a listen to hear about his wide-ranging career: doing a Ph.D. in Software Architectures, being an Assistant Professor, and leading Information Retrieval & Data Science research at USC; developing Apache Tika and sitting on the board of the Apache Software Foundation; developing next-generation data systems that support space and earth science missions at NASA JPL; writing a book about TensorFlow, and growing up in Los Angeles.

See this content in the original post

Listen to the show on (1) Spotify, (2) Apple Podcasts, (3) Google Podcasts, (4) iHeart Radio, (5) Radio Public, and (6) TuneIn

Key Takeaways

Here are the highlights from my interview with Chris Mattmann.

On Studying Computer Science at USC

USC was a lot of hard work — sitting in the computer lab learning Linux, which I knew nothing about at the time.

Data structures really kicked it off for me. I learned how to store state information using algorithms. The key for me was learning linear algebra — which is about transforming states using computation. That was a nice complementary to computer science.

In my second year at USC, I got a JPL job and became interested in databases and data modeling.

On Working at JPL as an Undergraduate

I sat one late night in the computer lab and applied for bulletin board jobs. A role looked for a database programmer to program earthquake databases. I was decent enough of a programmer, so I got the job.

Within the first three weeks at JPL, my project got canceled. They put me on a few other earthquake projects with the scientists at Caltech — writing SQL queries for atmospheric science, putting files on disk for the scientists to search for, etc.

On His Ph.D. Thesis about Software Architecture

I graduated with my bachelor’s and was working at JPL. I was motivated to get a Master’s degree to get additional income in the future.

The 2nd class that I took for my Master’s is called “Software Engineering for Embedded Systems,” taught by Dr. Nenad Medvidović, a young professor just coming out of a top-notch software engineering lab in UCI. He inspired me to pursue research afterward.

At JPL, I was working a lot on designing data systems. I realized that the software architects sometimes could not explain their design decisions to handle data dissemination. The contribution of my Ph.D. dissertation was the following:

I first came up with a representation of different data distribution use cases/scenarios.
Then I developed a suite of machine learning algorithms to predict, select, and visualize these scenarios with high confidence/accuracy.
The end goal was to replicate software design decisions for complicated data systems without the system designer's supervision.

I believe my work changed how we capture engineering data related to JPL and defined processes for evaluating data system design.

On Developing Apache Tika

Towards the end of my Ph.D., I took a class on search engines, one of the very few search engine classes across the US. For the final project, I used Nutch, an open-source framework by Doug Cutting (who invented Hadoop and Lucene), to create an RSS-parsing toolkit. Then, I became interested in the vibrant Apache open-source community and became a Nutch committer.

Fast forward to 2007, Nutch was re-architected based on the structure of MapReduce. Nutch was this big grandiose web crawler, but inside it lay a distributed file system, a distributed computational platform, a user interface of ranking and scoring methods, and even a content detection framework. Jérôme Charron and I pitched the idea of Tika as a separate content detector from Nutch. Later on, Jukka Zitting came on board to push Tika to the finish line. The first-generation of Tika got into financial institutions like Equifax and FICO.

In the second-generation of Tika, I worked on the MEMEX project with DARPA to stop human trafficking, using search engines that could mine the dark web for information. As we improved Tika to support multi-media format, a journalist-programmer used the framework to analyze the Panama Papers data leak.

On Teaching at USC

“Software Architectures”: My advisor Neno incepted that class, where most of the content came from his book “Software Architecture: Foundations, Theory, and Practice.” I helped to teach it after graduated from USC.

“Information Retrieval and Web Search Engines”: My former teacher Ellis Horowitz created that class, and I helped different semesters of it. It taught students the foundations of search engines. Given my systems background, I focused on technologies like Lucene, Solr, Nutch, and ElasticSearch.

“Content Detection and Analysis for Big Data”: We spent a lot of time talking about big data — the 5Vs, data mash-up, etc. The first assignment was about data enrichment, adding more features to the dataset. The second assignment was about content extraction at scale, generating structured data from unstructured data. The third assignment was about visualization and communication of your data science. I had done this class with UFO data, polar data, job data in South America, etc.

On Leading the Information Retrieval and Data Science Group at USC

At the IRDS group, we have trained 40 Master’s and 36 undergraduate students as research assistants, along with 3 post-docs. We have been funded a few big grants by NSF on polar cyberinfrastructure and open-source software. Today, we spent time working on sparkler (the web crawler) and Dockerized ML capabilities for Tika. These projects serve as the research arm for what we finally operationalize at NASA.

The group is also a good pipeline for giving USC students a chance to partner with NASA projects. I helped fill half of the search engine team via this group.

On His NASA’s JPL Career

In the first 10 years, I was on engineering and science — initially on data systems and then mission. I worked on projects such as Orbiting Carbon Observatory space mission and the Soil Moisture Active Passive earth science mission. I was also the computing lead for the Airborne Snow Observatory airborne mission.

After that, I went into technology development. From year 10 to 15, I built 60-to-70 million-dollar technology programs with DARPA, NSF, and other commercial industries.

From year 15 until now, I moved into the IT department, where my goal was to mature the people and data science discipline at JPL. As the Deputy CTO and then the division manager reporting to the CIO, I make sure that AI can be used for NASA’s missions (like creating robots that run on Mars's surface).

On The Apache Software Foundation

The Apache Foundation is a 501c3 non-profit organization with a billion-dollar valuation. Tools like Hadoop and Spark would not exist without it.

In the last decade, being on the Apache board was like being plugged into everything important in software — projects starting up, how companies are using them, big decisions on open-source licensing, etc.

What I am excited about open-source software nowadays: (1)MLOps and open-source ML frameworks developed by big organizations (like TensorFlow), (2) The future of learning with fewer labels (zero-shot and one-shot learning, for instance), and (3) AutoML that automates data science workflow such as model development, selection, and evaluation.

On “A Vision For Data Science”

To get the best out of big data, I believe that four advancements are necessary:

Algorithm Integration: Methods for integrating diverse algorithms seamlessly into big-data architectures need to be found.
Development and Stewardship: Software development and archiving should be brought together under one roof.
Many Formats: Data reading must become automated among formats.
People Power: Ultimately, the interpretation of vast streams of scientific data will require a new breed of researcher equally familiar with science and advanced computing.

On Writing “Machine Learning with TensorFlow”

While writing the book, the biggest challenge for me is getting access to the system to go from training models on my laptop to distributed training.

Compared to the first edition, here are my core contributions:

Updating code to the latest TensorFlow version 2.3
Recognizing the importance of data preparation

On Differences Between Academia and Industry

Deadline: In academia, deadlines are soft and squishy. In industry, they are not. You have to meet deliverables.
Value: In academia, you invest in yourself. In industry, you need to generate value for the organization.
Discipline: In academia, you need a little bit of discipline. In industry, discipline is all about: you follow routines and contribute at various levels.

On The Tech Community in Los Angeles

The tech scene here is dispersed.

The whole Silicon Beach thing consists of entertainment and aerospace engineering companies, coupled with software technology. We have many engineering and business innovators, ranging from big institutions like NASA to universities like USC and CalTech. Furthermore, we put back into the city what we get out of it.

Timestamps

(2:55) Chris went over his experience studying Computer Science at the University of Southern California for undergraduate in the late 90s.
(5:26) Chris recalled working as a Software Engineer at NASA Jet Propulsion Lab in his sophomore year at USC.
(9:54) Chris continued his education at USC with an M.S. and then a Ph.D. in Computer Science. Under the guidance of Dr. Nenad Medvidović, his Ph.D. thesis is called “Software Connectors For Highly-Distributed And Voluminous Data-Intensive Systems.” He proposed DISCO, a software architecture-based systematic framework for selecting software connectors based on eight key dimensions of data distribution.
(16:28) Towards the end of his Ph.D., Chris started getting involved with the Apache Software Foundation. More specifically, he developed the original proposal and plan for Apache Tika (a content detection and analysis toolkit) in collaboration with Jérôme Charron to extract data in the Panama Papers, exposing how wealthy individuals exploited offshore tax regimes.
(24:58) Chris discussed his process of writing “Tika In Action,” which he co-authored with Jukka Zitting in 2011.
(27:01) Since 2007, Chris has been a professor in the Department of Computer Science at USC Viterbi School of Engineering. He went over the principles covered in his course titled “Software Architectures.”
(29:49) Chris touched on the core concepts and practical exercises that students could gain from his course “Information Retrieval and Web Search Engines.”
(32:10) Chris continued with his advanced course called “Content Detection and Analysis for Big Data” in recent years (check out this USC article).
(36:31) Chris also served as the Director of the USC’s Information Retrieval and Data Science group, whose mission is to research and develop new methodology and open source software to analyze, ingest, process, and manage Big Data and turn it into information.
(41:07) Chris unpacked the evolution of his career at NASA JPL: Member of Technical Staff -> Senior Software Architect -> Principal Data Scientist -> Deputy Chief Technology and Innovation Officer -> Division Manager for the AI, Analytics, and Innovation team.
(44:32) Chris dove deep into MEMEX — a JPL’s project that aims to develop software that advances online search capabilities to the deep web, the dark web, and nontraditional content.
(48:03) Chris briefly touched on XDATA — a JPL’s research effort to develop new computational techniques and open-source software tools to process and analyze big data.
(52:23) Chris described his work on the Object-Oriented Data Technology platform, an open-source data management system originally developed by NASA JPL and then donated to the Apache Software Foundation.
(55:22) Chris shared the scientific challenges and engineering requirements associated with developing the next generation of reusable science data processing systems for NASA’s Orbiting Carbon Observatory space mission and the Soil Moisture Active Passive earth science mission.
(01:01:05) Chris talked about his work on NASA’s Machine Learning-based Analytics for Autonomous Rover Systems — which consists of two novel capabilities for future Mars rovers (Drive-By Science and Energy-Optimal Autonomous Navigation).
(01:04:24) Chris quantified the Apache Software Foundation's impact on the software industry in the past decade and discussed trends in open-source software development.
(01:07:15) Chris unpacked his 2013 Nature article called “A vision for data science” — in which he argued that four advancements are necessary to get the best out of big data: algorithm integration, development and stewardship, diverse data formats, and people power.
(01:11:54) Chris revealed the challenges of writing the second edition of “Machine Learning with TensorFlow,” a technical book with Manning that teaches the foundational concepts of machine learning and the TensorFlow library's usage to build powerful models rapidly.
(01:15:04) Chris mentioned the differences between working in academia and industry.
(01:16:20) Chris described the tech and data community in the greater Los Angeles area.
(01:18:30) Closing segment.

Machine Learning with TensorFlow

His Contact Info

His Recommended Resources

Doug Cutting (Founder of Lucene and Hadoop)
Hilary Mason (Ex Data Scientist at bit.ly and Cloudera)
Jukka Zitting (Staff Software Engineer at Google)
“The One Minute Manager” (by Ken Blanchard and Spencer Johnson)

This is the 40% discount code that is good for all Manning’s products in all formats: poddcast19.

See this content in the original post

About the show

Datacast features long-form conversations with practitioners and researchers in the data community to walk through their professional journey and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths - from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.

Datacast is produced and edited by James Le. Get in touch with feedback or guest suggestions by emailing khanhle.1013@gmail.com.

Subscribe by searching for Datacast wherever you get podcasts or click one of the links below:

If you're new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.