Datacast Episode 29: From Bioinformatics to Natural Language Processing with Leonard Apeltsin
Datacast’s 29th Episode is my conversation with Leonard Apeltsin, a research fellow at the Berkeley Institute for Data Science. Listen to dig into his background in Bioinformatics, his consulting experience for various startups in the Bay Area, his role as a co-founder and AI lead at Primer AI, his thoughts on the current NLP research realm, his resources on data science in healthcare, his upcoming book “Data Science Bookcamp”, and many things else.
Dr. Leonard Apeltsin is a research fellow at the Berkeley Institute for Data Science. He holds a Ph.D. in Biomedical Informatics from UCSF and a BS in Biology and Computer Science from Carnegie Mellon University. Leonard was a Senior Data Scientist & Engineering Lead at Primer AI, a machine learning company that specializes in using advanced Natural Language Processing Techniques to analyze terabytes of unstructured text data. As a founding team-member, Leonard helped expand the Primer AI team from four employees to over 80 people. Outside of Data Science and ML, Leonard enjoys scuba diving, salsa dancing, and making short documentary films.
Show Notes
(2:18) Leonard discussed his undergraduate experience at Carnegie Mellon — where he studied Biology and Computer Science.
(5:10) Leonard decided to pursue a Ph.D. in Bioinformatics at the University of California — San Francisco.
(6:27) Leonard described his Ph.D. research that focused on finding hidden patterns in genetically-linked diseases.
(9:42) Leonard went deep into clustering algorithms (Markov Clustering and Louvain) and their applications such as protein and news article similarity.
(13:21) Leonard shared his story of starting a data science consultancy with various client startups.
(17:58) Leonard discussed the interesting consulting projects that he worked on: from detecting plagiarism to predicting bill insurance.
(22:04) Leonard shared practical tips to learn technical concepts.
(23:23) Leonard reflected on his experience working with a string of startups including Accretive Health, Quid, and Stride Health.
(26:06) Leonard is the founding team member of Primer AI, a startup that applies state-of-the-art NLP techniques to build machines that read and write, back in early 2015.
(30:31) Leonard discussed the technical challenges to develop algorithms that power Primer’s products to scale across languages other than English.
(34:28) Leonard unpacked his technical post “Russian NLP” on Primer’s blog.
(38:17) Leonard talked about the advances in the NLP research domain that he is most excited about in 2020 (XLNet >>> BERT).
(41:10) Leonard discussed the challenges of scaling the data-driven culture across Primer AI as the company grows.
(46:20) Leonard mentioned different use cases of Primer for clients in finance, government, and corporate.
(51:41) Leonard talked about his decision to leave Primer and become a Data Science Health Innovation Fellow at the Berkeley Institute for Data Science.
(54:30) Leonard went over applications of data science in healthcare that will be adopted widely in the next few years.
(1:02:45) Leonard discussed his process of writing a book called “Data Science Bookcamp.”
(1:07:21) Leonard revealed how he chose the case studies to be included in the book.
(1:10:27) Closing segment.
His Contact Info
His Recommended Resources
spaCy (Open-Source Library for Advanced NLP)
fastText (NLP library from Facebook)
XLNet: Generalized Autoregressive Pretraining for Language Understanding
BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding
Federated Learning with Differential Privacy: Algorithms and Performance Analysis
Differential Privacy- Enabled Federated Learning for Sensitive Health Data
Fitbit and Apple Watch
Walter Pitts who invented neural networks
Paul Werbos who invented back-propagation
Fei-Fei Li who constructed the ImageNet dataset
“The Signal and The Noise” by Nate Silver
You can read the completed chapters of “Data Science Bookcamp” on the Manning Website:
Permanent discount code: poddcast19
5 free eBook codes: dcdsprf-B373, dcdsprf-CA3B, dcdsprf-299E, dcdsprf-6E5, and dcdsprf-9660 (activated and will last for 2 months)