Datacast Episode 64: Improving Access to High-Quality Data with Fabiana Clemente

The 64th episode of Datacast is my conversation with Fabiana Clemente — the Co-Founder and Chief Data Officer of YData, whose mission is to help companies and individuals to become the industry leaders by solving the true AI hidden secret — access to high-quality data.

Our wide-ranging conversation touches on her educational background in applied mathematics and data management, her time working as a developer building big data solutions, her foray into the data science universe, the genesis behind YData, synthetic data generation, differential privacy, model explainability, open-source as a strategy, and much more.

Listen to the show on (1) Spotify, (2) Apple Podcasts, (3) Google Podcasts, (4) TuneIn, (5) RadioPublic, (6) Stitcher, and (7) Breaker.

Key Takeaways

Here are highlights from my conversation with Fabiana:

On Studying Applied Mathematics

College was an exciting experience as I had the opportunity to touch on many different areas of mathematics. Some areas were more related to physics, for example. Some others were more related to statistics, which can be more applicable in more situations. I took classes in optimization, multivariate data analysis, applied statistics, and data science (even no one called it “data science” at the time). It was interesting to get that sense of mixing the theoretical and the applied side of math. Overall, the experience gave me a broader scope of what I can do as a professional.

On Building IoT Solutions at Vodafone

This opportunity came about during my attendance at the first Web Summit in Lisbon. An architect from Vodafone reached out to me. They had a position for a data solutions architect, someone with experience building databases and extracting analytics from them. This was quite a jump for me, as I had always been a developer before that. Being an architect designing scalable solutions for IoT was definitely a big challenge.

In IoT, there are two different perspectives on the way data can be used: in a batch manner or in real-time. So, in addition to the high data volume that I was dealing with, I had to set up an infrastructure to deal with different necessities for the speed of the insights being delivered. At the time, I was defining this from scratch, an architecture that (1) could cope with these requirements and (2) was scalable and flexible enough for new requirements in the future.

At the time, the big data architectures influencing my decision came from Spotify and Uber.

However, the main concern was not designing the architecture to cope with the requirements but convincing people to accept something new and different from the status quo (RDBMS and traditional storages).

This meant telling them that there were new open-source solutions to process data at the enterprise level. This also meant restructuring teams to accept and work with the new architecture.

On Getting Into Data Science

As a solution architect at Vodafone, I was building conceptually and making sure that any architecture could cope with other existing systems within the company. However, I was not developing myself and missing a lot of my time as a developer. I enjoyed learning new things and exploring them on my own. That led to my decision to work as a data scientist.

  • The first project I worked on was document information extraction. Given a receipt, how could I extract the number of items bought, the price of those items, etc., in an automated way? I started exploring the world of deep learning.

  • The second project was about human behavior, how to extract insights from people’s movements. I worked mostly with sensor data coming from mobile phones.

On Data Science vs. Business Intelligence

Data science can be considered a part of the BI scope, which is about using data to give business-specific insights. From a technical perspective:

  • BI requires knowledge about data warehousing, ETL pipelines, and visualization tools. BI sets the base for any company looking to get data-driven insights.

  • Data science is much more experimental, while building BI solutions has a very well-defined methodology. As a data scientist, you can explore the data as you wish. I don’t feel the same level of creativity on the BI side.

On Founding YData

During my time as a data scientist, I felt the pain of accessing high-quality data. At times, I had to wait for 6 months or more to have access to data just to build a proof-of-concept. That was not feasible at all. How can I have faster access to data without concerns about its sensitivity and privacy, the real blockers to data access? On the other side of the same spectrum, if you are doing data science on a small company, you will have access to many things. That made me a bit uncomfortable because I totally extract insights for the wrong person.

While talking with my co-founder, we understood that this was not a problem just from my reality but also his reality. As we talked with other people, we saw that this was their reality as well. Data science teams were struggling with this. With these questions in mind, we started exploring data privacy space to better understand what types of solutions could provide the privacy needed and enable data scientists to do their work in a timely manner. I dug into privacy-enhancing technologies such as federated learning, differential privacy, and, eventually, synthetic data.

Synthetic data is what I called friendly data science.

On Techniques to Generate Synthetic Data

Understanding that synthetic data was the way to go, we started exploring the techniques and algorithms that could help us make this possible. In a world of generative models, we have Bayesian networks, GANs, or LSTMs that are feasible for the job. But we definitely had another concern:

How can we make synthetic data with the same utility, fidelity, and quality as the original data, while still guaranteeing variability and privacy?

Out of the possible solutions, GANs appear to be the fittest for the job. Of course, GANs are well-known for unstructured data such as images, audio, and text. However, there was a lack of research on using GANs for tabular and time-dependent data. GAN’s concept of having two networks working against each other is an interesting way to ensure that the generated data has the quality, utility, and privacy needed.

On GANs for Synthetic Data Generation

Synthetic data for tabular and time-series data is a new subject with its peculiarities. Data science teams, who are the target to use this kind of solution, need to be able to trust data generated using neural networks.

We decided to open-source our code with the main GAN architectures popular in the research, starting with the implementation of Vanilla GAN. However, the challenges are notorious, such as mode collapse (where the generated data does not portray the real behavior of the original data) and high-dimensional data processing.

  • Conditional GAN brings an improvement. In this case, we already know the target to what we want to condition the training, which helps with the model performance. However, other factors impact the network, such as hyper-parameters.

  • Wasserstein GAN brings a change to the loss function used. The better we define the loss function for our network, the closer the results we get for what we want. This loss function behaves in different ways with different types of data.

  • There’s a big jump going from tabular data to sequential/time-series/time-dependent data. We have to be careful about the quality of each generated data record and how the time component influences the generation process. This requests a new type of architecture, such as TimeGAN.

We are looking to keep updating this learning journey about generating synthetic data: how to achieve and trust it and educate the broader data science community about the benefits it can bring.

On Differential Privacy

With differential privacy, you introduce noise to the data you have to make it harder to re-identify someone. You can define the level of how hard it is to re-identify someone, which is known as the privacy budget. The noise you apply will decide how much privacy you will get versus how much utility of the data you will lose.

I think differential privacy and synthetic data are highly complementary.

You have different data privacy needs at different stages or within different parts of your organization. Synthetic data allows data science teams to explore the information that they have. But let’s say your domain is healthcare. Then you want an extra step to ensure that the new synthetic data is even more private. The combination of generating synthetic data with differential privacy makes a lot of sense: you are generating synthetic data with differential privacy to ensure more perturbation.

I don’t know if this is a con or can be seen as a feature of differential privacy: if you want more privacy, you will lose the data's utility. If you apply differential privacy in the wrong manner, you will extract the wrong insight from your data.

On “The Cost of Poor Data Quality

While extracting insights from the data, you have to ensure that the data is correct; otherwise, you get invalid insights. If you apply these invalid insights to your business, you might make decisions that cost you big financial losses. Even if you build different models, the data that you chose for those models has a bigger impact at the end of the day. Thus, it’s important to take into consideration these factors in a data platform:

  • Accuracy stands for how good your data is for the problem that you have.

  • Completeness is about the missing values that you can measure with the data and the implicit missing-ness from external quantities.

  • Consistency identifies whether you have different signs within your data and helps you catch errors.

  • Reliability signals how much you trust your data.

  • Up-to-date-ness is the most important one, whether you have access to the data in a timely manner.

On Model Explainability

Model explainability lets us know the impact that certain variables have on a decision. You want to know why the model took a decision so you are aware of any flaws or biases in the data. In that sense, model explainability can help you understand potential problems that you have with your data. With good data quality, you are not afraid to justify why you trust a machine’s decision.

On Open-Source-as-a-Strategy

My co-founder and I enjoy well-developed open-source projects. They enable us throughout our careers to learn new things and experiment with new technologies. In addition, we have the opportunity to experiment with these projects before buying them.

Open-source is also about the community. If you give back without the expectation to receive, you create a sense of community where everyone is willing to contribute.

It’s proven that you develop better solutions if you have more people thinking about them than having just one person. Because synthetic data is something so new in the market, we understood that we need to do open-source and show its value to the community. This is the educational path that we foresee.

Open-source ties up with our product. If the data science community trusts the data that we are developing, it’s far easier to convince people that they need a solution like the YData platform.

On Hiring

The use of synthetic data with deep learning is rare to be found in the development ecosystem. Therefore, the candidates should be excited about such an opportunity to work on cutting-edge technologies.

As an early-stage startup, we have to be sure that the value we are setting for the company right at the beginning can attract good developers out there.

YData is a place of collaboration, where feedback is valuable and goes both ways between founders and employees.

On Women in Tech

I am quite active in the Portuguese women in the tech community. I am very present in discussions, podcasts, and public speaking about YData and entrepreneurship. Doing that as a woman is a way to inspire others to follow a similar path.

Show Notes

  • (02:06) Fabiana talked about her Bachelor’s degree in Applied Mathematics from the University of Lisbon in the early 2010s.

  • (04:18) Fabiana shared lessons learned from her first job out of college as a Siebel and BI Developer at Novabase.

  • (05:13) Fabiana discussed unique challenges while working as an IoT Solutions Architect at Vodafone.

  • (09:56) Fabiana mentioned projects she contributed to as a Data Scientist at startups such as ODYSAI and Habit Analytics.

  • (12:44) Fabiana talked about the two Master’s degrees she got while working in the industry (Applied Econometrics from Lisbon School of Economics and Management and Business Intelligence from NOVA IMS Information Management School).

  • (14:41) Fabiana distinguished the difference between data science and business intelligence.

  • (18:01) Fabiana shared the founding story of YData, the first data-centric platform with synthetic data, whose she is currently the Chief Data Officer.

  • (21:32) Fabiana discussed different techniques to generate synthetic data, including oversampling, Bayesian Networks, and generative models.

  • (24:01) Fabiana unpacked the key insights in her blog series on generating synthetic tabular data.

  • (29:40) Fabiana summarized novel design and optimization techniques to cope with the challenges of training GAN models.

  • (33:44) Fabiana brought up the benefits of using Differential Privacy as a complement to synthetic data generation.

  • (38:07) Fabiana unpacked her post “The Cost of Poor Data Quality,” — where she defined data quality as data measures based on factors such as accuracy, completeness, consistency, reliability, and above all, whether it is up to date.

  • (42:11) Fabiana explained the important role that data quality plays in ensuring model explainability.

  • (44:57) Fabiana reasoned about YData’s decision to pursue the open-source strategy.

  • (47:47) Fabiana discussed her podcast called “When Machine Learning Meets Privacy” in collaboration with the MLOps Slack community.

  • (49:14) Fabiana briefly shared the challenges encountered to get the first cohort of customers for YData.

  • (50:12) Fabiana went over valuable lessons to attract the right people who are excited about YData’s mission.

  • (51:52) Fabiana shared her take on the data community in Lisbon and her effort to inspire more women to join the tech industry.

  • (53:47) Closing segment.

Fabiana’s Contact Info

YData’s Resources

Mentioned Content

Blog Posts

Podcast

People

  • Jean-Francois Rajotte (Resident Data Scientist at the University of British Columbia)

  • Sumit Mukherjee (Associate Professor of Statistics at Columbia University)

  • Andrew Trask (Leader at OpenMined, Research Scientist at DeepMind, Ph.D. Student at the University of Oxford)

  • Théo Ryffel (Co-Founder of Arkhn, Ph.D. Student at ENS and INRIA, Leader at OpenMined)

Recent Announcements/Articles

About the show

Datacast features long-form conversations with practitioners and researchers in the data community to walk through their professional journey and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths - from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.

Datacast is produced and edited by James Le. Get in touch with feedback or guest suggestions by emailing khanhle.1013@gmail.com.

Subscribe by searching for Datacast wherever you get podcasts or click one of the links below:

If you're new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.