Datacast Episode 60: Algorithms and Data Structures For Massive Datasets with Dzejla Medjedovic
The 60th episode of Datacast is my interview with Dzejla Medjedovic— the Assistant Professor of Computer Science at the International University of Sarajevo and the author of “Algorithms and Data Structures for Massive Datasets.”
We had a wide-ranging conversation covering her foray into studying Computer Science, her Ph.D. in Applied Algorithms at Stony Brook University, her love for teaching, her industry internships at Microsoft, her Manning book, the tech scene in Sarajevo, and much more.
Listen to the show on (1) Spotify, (2) Apple Podcasts, (3) Google Podcasts, (4) TuneIn, (5) RadioPublic, (6) Stitcher, and (7) iHeart Radio
Key Takeaways
Here are highlights from my conversation with Dzejla:
On Getting Into Computer Science
I was the first generation to enter the Sarajevo School of Science and Technology, the first private school in Bosnia. Unlike in the US, up until recently, there were only state universities in Bosnia. SSIT offers lectures in English and promises a modern education with international faculty. That sounded exciting to me, so I started in a cohort of 60 people.
During high school, I enjoyed studying math and languages the most. I had a hunch that studying Computer Science would be similar to studying Math. I went to SSST’s website to look at the Computer Science department's curriculum and remembered looking at courses called “Computational Complexity” and “Algorithms.” I was like: “yeah, those sound kind of like math.” In fact, there is the aspect of discrete mathematics that is related to how good you are at problem-solving.
It was a unique experience. We did not have many resources, but we had a lot of enthusiasm. The professors there were building a new institution very passionately. Every student was taking a predetermined set of classes. Additionally, students had a lot of personal communication with the professors. For someone like me who has spent all my life in Bosnia (and in Sarajevo), interacting with an American professor opened up my eyes and informed me about new opportunities — making me feel brave enough to study abroad later.
Furthermore, we had a professor who used to be a US coach for ACM regional programming competitions for his team at the university where he worked. He introduced us to that kind of contest. That’s where I fell in love with programming. Sometimes, I would not sleep all night because I was thinking about a solution to a particular problem.
On Doing a Ph.D. at Stony Brook
I applied for a Ph.D. right out of college. To be honest, it was very shocking for me to go to another continent. Additionally, I went to a large university where there were a lot of smart people. It was a huge reality check at first — going from being the best student in your class to (definitely) not being the best student in your class. Being a Ph.D. student is a write of passage and the angstiest time of your life. In retrospect, It was a great time, but I remembered at that time, it was not so great.
Overall, the Ph.D. experience was very challenging but probably the best thing I have done. I was fortunate to have Michael Bender as my advisor. On the one hand, he is an academic who works in the algorithms field (most of his papers are filled with theorem proofs). But at the same time, he started a company that designs a database engine based on algorithms and data structures that he studied in theory. It was not rare for a faculty member to start a company. Still, it was rare to have somebody from algorithms (super theoretical field) to start a company using those theories. My Ph.D. thesis resulted from one of his insights: in theory, you make some assumptions; but in practice, you often see those are not the cases. Can we write papers that validate our algorithms to be working in real settings?
Stony Brook was an amazing school academically. I applied to the university by accident. There was a book for preparing for programming contests (that I attended during college) written by Steven Skiena, a professor at Stony Brook. That’s how I learned about Stony Brook. If you are in Bosnia, you probably know about a couple of Ivy Leagues school only. The only downside is the lack of community, which Stony Brook suffers from being very close to New York City. I came from a very tight-knit community in Bosnia, so being on an isolated campus felt a bit lonely.
On Her Ph.D. Thesis
If you are a Computer Science student, you definitely have to take a course in data structure and a course in algorithms by the time you graduate. Regardless of the programming languages, there’s an underlying idea behind that program. The better your algorithm is, the faster your program runs. Recently, the amount of data that we have been dealing with is growing at a tremendous rate. However, that wouldn’t be such a problem if computer hardware development is going at the same pace. Unfortunately, data has grown in size at a much faster rate than the RAM has grown in size.
If you want to design an efficient algorithm, you need to pay attention to how fast the data will be transferred from cloud/disk/SSD to your RAM, where the computation can be done. The main aspect of the problem is not the number of computational steps the algorithm has to take but how much time it spends moving data back and forth. That was the focus of my Ph.D. thesis.
The field of external memory algorithms dates back way before my thesis. In massive datasets, external memory algorithms are the most efficient ones. For my thesis, I worked on a particular variant of sorting: given a database with items of varying sizes, the sorting algorithm's efficiency also changes. That was the problem that I deal with.
On Teaching
Teaching is a good way to understand a topic. When I teach the same topic for the 5th or 6th time in a row, I am always surprised that I always learn something new. The main thing that I like about teaching is sharing knowledge in the social setting. I like people, and I like knowledge, so teaching is those two in one.
Something that I learned over the years is that algorithms are often taught as an abstract theoretical subject. The professor does not connect the design of an algorithm and the implementation of an algorithm. This happens because algorithm researchers are interested in the high-level idea and not interested in low-level implementation details. Something that can help the students is that the professor actually writes the code/shows an example for the algorithm. It’s sometimes hard to empathize with the students who know nothing about the subject.
As time goes by, I see more and more value in teaching concretely versus abstractly. That helps all students, not just the struggling students.
On Interning at Microsoft
In the first summer, I was in the Systems Center team working on the App Controller application. During this time, businesses start to move their services to the cloud. App Controller's goal is to enable people who work at a particular business to diagnose and fix technical issues with the business services. In other words, this helps enterprise clients get more insights and control of their services.
App Controller's problem is that: when the user loads the dashboard, the loading time for services is really long. My project, at first, was to decrease this response time by developing a cache. Due to business needs, this cache had to be implemented with the product by the end of the summer. I collaborated with talented software engineers and learned best practices on how to write production-level code.
In the second summer, I designed an app for visualizing the performance of different services in another team.
In industry, you must be in touch with customer needs. The human side of things is essential as you work with your colleagues. My theoretical algorithms study is more of a lonely endeavor (you can sit on a tree proving a theorem). Something that I did not like about industry so much is that everything moves so fast.
These days, many people are moving from academia to industry, which is making the industry much more creative and interested in doing research.
On Being A Professor
My motivation to pursue an academic job was contact with students, knowledge sharing, and freedom. Being a professor in Bosnia is very different from being a professor in the US. I felt like I learned many new things from the Ph.D. experience that might benefit Bosnian students. Most of my colleagues who graduated with me stayed in the US. Furthermore, I want to be close to my family in Bosnia.
A negative thing about Bosnia's academic job is that our government does not grant many funds for research. This is unlike the US system, in which professors apply to grant, get money from the government, and employ graduate students to conduct research. Therefore, Bosnia does not have a good research output at all. I have to work on research in my free time because I have many teaching and administrative duties. That’s a pity because you need to learn new things all the time by doing research for you to be a good professor.
On Writing “Algorithms and Data Structures for Massive Datasets”
Our book introduces a number of algorithms and data structures for engineers and data scientists working with a massive amount of data. That topic can get abstract fast, so it’s crucial to show industry use cases. For example, if we mention Bloom Filter, we can discuss how Bitcoin uses Bloom Filter. Getting an example that is interesting, realistic, and useful is the key to our writing process.
I want to write a book where: after reading it, somebody knows where to use the techniques he/she has learned.
On The Tech Community in Sarajevo
There has been a tech boom in Sarajevo recently, which is exciting. On the one hand, Bosnia has a very high unemployment rate; however, tech is the only sector where unemployment is close to 0. There are a lot of young people studying technical subjects in universities and getting employment right away.
Something less than ideal is that most of the work done by Bosnian software companies is outsourcing, which is often not the most exciting part. However, this is changing rapidly.
Show Notes
(01:58) Dzejla described her undergraduate experience studying Computer Science at the Sarajevo School of Science and Technology back in the mid-2000s.
(07:59) Dzejla recapped her overall experience getting a Ph.D. in Computer Science at Stony Brook University.
(14:38) Dzejla unpacked the key research problem in her Ph.D. thesis titled “Upper and Lower Bounds on Sorting and Searching in External Memory.”
(19:13) Dzejla went over the details of her paper “Don’t Thrash: How to Cache Your Hash on Flash,” which describes the Cascade Filter, an approximate-membership-query data structure that scales beyond main memory, that is an alternative to the well-known Bloom-filter data structure.
(24:41) Dzejla elaborated on her work “The batched predecessor problem in external memory,” which studies the lower bounds in three external memory models: the I/O comparison model, the I/O pointer-machine model, and the index-ability model.
(29:56) Dzejla shared her learnings from being a Teaching Assistant for the Introduction to Algorithms course at Stony Brook (both at the undergraduate and graduate level).
(35:08) Dzejla went over her summer internships at Microsoft’s Server and Tools Division during her Ph.D.
(41:06) Dzejla reasoned about her decision to return to Sarajevo School of Science and Technology as an Assistant Professor of Computer Science.
(47:22) Dzejla dissected the essential concepts and methods covered in her Data Structures, Introductory Algorithms, Advanced Algorithms, and Algorithms for Big Data courses taught at SSIT.
(48:42) Dzejla provided a brief overview of the Computer Science/Software Engineering department at the International University of Sarajevo (where she has been a professor since 2017.
(50:57) Dzejla briefly talked about the courses that she taught at IUS, including Intro to Programming, Human-Computer Interaction, and Algorithms/Data Structures.
(52:49) Dzejla shared the challenges of writing Algorithms and Data Structures for Massive Datasets, which introduces data processing and analytics techniques specifically designed for large distributed datasets.
(56:14) Dzejla explained concepts in Part 1 of the book — including Hash Tables, Approximate Membership, Bloom Filters, Frequency/Cardinality Estimation, Count-Min Sketch, and Hyperloglog.
(58:38) Dzejla provided a brief overview of techniques to handle streaming data in Part 2 of the book.
(01:00:14) Dzejla mentioned the data structures for large databases and external-memory algorithms in Part 3 of the book.
(01:02:15) Dzejla shared her thoughts about the tech community in Sarajevo.
(01:04:16) Closing segment.
Dzejla’s Contact Info
Mentioned Content
Papers
“Upper and Lower Bounds on Sorting and Searching in External Memory” (Dzejla’s Ph.D. Thesis, 2014)
People
Erik Demaine (Computer Science Professor at MIT)
Michael Bender (Computer Science Professor at Stony Brook, Dzejla’s Ph.D. Advisor)
Joseph Mitchell (Computational Geometry Professor at Stony Brook)
Steven Skiena (Computer Science Professor at Stony Brook)
Jeff Erickson (Computer Science Professor at UIUC)
Books
“Algorithms and Data Structures for Massive Datasets” (by Dzejla Medjedovic, Emin Tahirovic, and Ines Dedovic)
“The Algorithm Design Manual” (by Steven Skiena)
Here is a permanent 40% discount code (good for all Manning products in all formats) for Datacast listeners: poddcast19. Link at http://mng.bz/4MAR.
Here is one free eBook code good for a copy of Algorithms and Data Structures for Massive Datasets for a lucky listener: algdcsr-7135. Link at http://mng.bz/Q2y6
About the show
Datacast features long-form conversations with practitioners and researchers in the data community to walk through their professional journey and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths - from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.
Datacast is produced and edited by James Le. Get in touch with feedback or guest suggestions by emailing khanhle.1013@gmail.com.
Subscribe by searching for Datacast wherever you get podcasts or click one of the links below:
If you're new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.