Datacast Episode 42: Privacy-Preserving NLP with Patricia Thaine

The 42nd episode of Datacast is my conversation with Patricia Thaine — CEO of Private AI. Give it a listen to hear about her educational background in liberal arts and English literature, her transition to studying Computational Linguistics, her Ph.D. research in privacy-preserving NLP at the University of Toronto, her work with Private AI and the Vector Institute, her thoughts on privacy integration into NLP and Speech Recognition applications, and much more.

Patricia Thaine is a Computer Science Ph.D. Candidate at the University of Toronto and a Postgraduate Affiliate at the Vector Institute researching privacy-preserving natural language processing, with a focus on applied cryptography. Her research interests also include computational methods for lost language decipherment.

Listen to the show on: (1) Spotify, (2) Apple Podcasts, (3) Google Podcasts, (4) Stitcher, (5) Overcast, and (6) Breaker!

Key Takeaways

Below are highlights from my conversation with Patricia:

On Her Undergraduate Education

  • The liberal arts program at John Abbott College helps me get a better understanding of how the world works and hone writing and critical thinking skills.

  • I then studied English Literature at Concordia University, thinking that was my great love. But after one year, I felt like there were different things that I could learn. I tried out International Relations and Philosophy, but none of them captured the essence of how the world works.

  • I tried out Linguistics and appreciated its pattern-matching structure a lot. I then saw something called Computational Linguistics that got my interest, took my first programming class at McGill University, decided to switch major, and eventually pursued a Master’s degree in Computational Linguistics at the University of Toronto.

On Being a Graduate Student at the University of Toronto

  • It has been a dream for me. During the initial interview with a professor who later became my advisor, he talked about the different projects that I could be involved with, including lost language decipherment and the analysis of ancient languages. That’s something I have always wanted to do, so I immediately accepted an offer to become his Master’s student.

  • Later on, my research switched to writing systems, which are somewhat under-appreciated. I studied how to match sounds with certain characters in lost ancient languages and how to determine the syntaxes of languages, among other things.

On Privacy-Preserving Natural Language Processing and Speech Processing

  • The world is learning towards laws that enforce strict privacy requirements (GDPR and CCPA). These laws have sub-parameters towards what you can do with data. A lot of the technologies have not necessarily caught up to that (for companies to do what they want to do).

  • Research in privacy-preserving NLP is specifically exciting and important because natural language contains the most sensitive data that we produce. I include speech processing in this category as well, considering that speech has even more personal data than pure text (socio-economic background, education, gender, etc.)

  • Privacy goes hand-in-hand with security. Privacy ensures user access and user confidentiality, making it easier to keep the data secured and avoid data leaks.

On Perfectly Privacy-Preserving AI Application

  • I wrote a guide on “Perfectly Privacy-Preserving AI” to showcase the different parts of the ML pipeline that people need to be careful about (concerning privacy) and how they fit together.

  • Federated learning brings computation to the devices where the data is collected. It needs to be combined with differential privacy, which allows us to make a generalization about the data rather than get specific information. One example of differential privacy is adding differentially-private noises to the data to make the system more robust.

  • On top of them, we can add secure multiparty computation, where the resulting weights of the model can be combined with the weights of other models.

  • There has been less research on model-privacy than on data-privacy.

On Founding and Running Private AI

  • In academic research, I can afford to work on impractical and theoretical things that can potentially lead to future knowledge.

  • On the commercial side, the hardest thing for me is to find the problems and build hand-crafted tools around them, given the understanding obtained from the academic realm.

  • There’s a massive gap in the market where there were no privacy-preserving tools for developers without a private Machine Learning background. Private AI build tools that are generalizable and easy to integrate to address that need.

  • Our primary use cases are (1) transferring sensitive datasets between different teams within organizations and (2) filtering dataset queries to prevent the amount of sensitive data being passed around.

  • Another interesting use case is direct integration into the app and browser extension.

  • The most important thing about starting a business is to talk to a lot of people. You’ll get an idea from those conversations to see whether or not your hypothesis is sensible. Also, it would be best if you built a prototype to ensure that their words match their actions.

On Being an Effective Researcher

  • I am a huge fan of combining topics, for example, privacy with NLP, or background in biology/healthcare with computer science.

  • Machine learning, on its own, is excellent for conducting in-depth theoretical research. Still, with an extra domain knowledge where you go deep into two main areas, you’ll get an explosive combination that can lead to incredible outcomes.

Video: https://www.youtube.com/watch?v=bsW7sFWpBUg

Video: https://www.youtube.com/watch?v=bsW7sFWpBUg

Show Notes

  • (2:55) Patricia talked about his interest in learning languages and living in different cultures.

  • (4:05) Patricia talked about her experience volunteering as a translator at the International Network of Street Papers.

  • (5:00) Patricia studied Liberal Arts at John Abbott College, English Literature at Concordia University, and Computer Science and Linguistics at McGill University during her undergraduate years.

  • (8:06) Patricia worked at McGill Language Development Lab as a Research Assistant, which studied how children learn different types of words and sentences.

  • (9:15) Patricia described her graduate school experience at the University of Toronto, where she researched lost language decipherment and writing systems.

  • (11:19) Patricia talked about MedStory, which is a text-oriented visual prototype built to support the complexity of medical narratives (spearheaded by Nicole Sultanum).

  • (12:35) Patricia explained her research paper, “Vowel and Consonant Classification through Spectral Decomposition.”

  • (15:29) Patricia unpacked her blog post, “Why is Privacy-Preserving NLP Important?

  • (19:02) Patricia dissected her paper “Privacy-Preserving Character Language Modelling” that proposes a method for calculating character bigram and trigram probabilities over sensitive data using homomorphic encryption.

  • (21:13) Patricia wrote a two-part series called “Homomorphic Encryption for Beginners.”

  • (22:21) Patricia unwrapped her paper “Efficient Evaluation of Activation Functions over Encrypted Data” that shows how to represent the value of any function over a defined and bounded interval, given encrypted input data, without needing to decrypt any intermediate values before obtaining the function’s output.

  • (25:33) Patricia elaborated on her paper “Extracting Bark-Frequency Cepstral Coefficients from Encrypted Signals,” which claims that extracting spectral features from encrypted signals is the first step towards achieving secure end-to-end automatic speech recognition over encrypted data.

  • (27:38) Patricia explained why privacy is an essential attribute for speech recognition applications.

  • (29:53) Patricia discussed her comprehensive guide on “Perfectly Privacy-Preserving AI” which dives into the four pillars of perfectly privacy-preserving AI and outlines potential combinatorial solutions to satisfy all four pillars.

  • (37:53) Patricia shared her take on the differences working in academic and commercial settings (she is the founder and CEO of Private AI).

  • (40:50) Patricia talked about Private AI’s GALATEA Anonymization Suite, which anonymizes data at the source and encrypts them using quantum-safe cryptography.

  • (45:05) Patricia emphasized the importance of talking to customers when building a commercial product.

  • (46:58) Patricia shared her experience as a Postgraduate Affiliate at Vector Institute, which works with institutions, industry, startups, incubators, and accelerators to advance AI research and drive its application, adoption, and commercialization across Canada.

  • (49:09) Patricia shared her advice for young researchers by going deep into at least two domains and combining the knowledge.

  • (50:30) Patricia shared her excitement for privacy and NLP research in the upcoming years.

  • (52:36) Closing segment.

Her Contact Info

Her Recommended Resources