How to Prepare a Machine Learning Dataset using Tkinter (Python)

Here is the link for the Instagram Kaggle dataset: https://www.kaggle.com/datasets/shmalex/instagram-images/data

If you just want to see the code/what I did, scroll down to "The Solution".

Intro

I recently completed the Zero to Mastery Machine Learning/Data Science course (no affiliation) and was wondering what to do next. The advice from the instructor was to take the fundamentals that I'd learned in the course and apply them by building my own machine learning project. Additionally, he recommended a few principles to abide by when choosing a project:

  • The project should interest you

  • It should help you learn something

  • You should be able to tell a story about it afterwards

  • Look for the crossover - take one of your skills/interests (in this case, machine learning) and pair it with another

Using these principles to guide me and being the ridiculous Casanova that I am, I decided to tackle a rather ambitious project (at least, I think so): a Tinder bot. Not just a regular Tinder bot that auto-swipes right on everybody (that's easy), but an intelligent Tinder bot that swipes according to my own preferences, and then even has conversations with my matches to set up dates (supervised by me of course - this is all just for fun/to practice ML, not to actually completely automate my dating life).

The Data

The first step of the project is to train a machine learning model to classify pictures of women according to my preferences - "Yes"/swipe right, or "No"/swipe left. To do this, I would need lots of pictures of women, and then labels for them based on my preferences. But where to get lots of pictures of women?

Even though there's a trillion images of women on the internet, getting them into a nice, clean dataset isn't as easy as it seems. For starters, most of the images are hosted on sites that either have policies or active defense mechanisms against any sort of program/bot that downloads their images (and saving a few thousand images to my hard drive manually isn't something I was about to attempt).

Furthermore, an "image of a woman" is pretty vague. Many of the images that meet that definition aren't close to what you'd actually encounter while perusing an online dating app like Tinder. And even though this is my first real project, I want my model to be good (eg, accurate). And to get that, I have to have better data.

The first image set I found that had promise was this one here: https://www.kaggle.com/datasets/playlist/men-women-classification. It's a dataset with a bunch of men, women, and children designed to train a model on telling male from female.

While it was a start, it had a couple problems. First, the images weren't usually of the style one would find in an online dating profile (selfies, selfies with friends, Instagram-style photos, etc). Most of the photos are professional-style photos of the type you'd find in magazine ads, photo shoots etc. Many of them have text overlaying them.

A second problem is that many of them are either of celebrities or taken directly from clips of movies/shows, which hardly mimics the selfie-laden profiles of Tinder. There are an inordinate number of Wonder Woman/superhero pictures (mostly Gal Gadot), such that the model would be trained to predict that any girl in a costume is my dream match.

I labeled this dataset and used it to initially build and train my model, but I knew I wanted something better.

I searched around the net for a while on ways to build an image dataset. Since I'm new I wasn't 100% sure what I was really looking for, but I had an idea. For a while I considered writing a script to make calls to the Tinder API such that I could just download the pictures of the profiles I swiped on (I found a tutorial to do it here: https://www.youtube.com/watch?v=xd9-aYxIGpA&ab_channel=OscarAlsingFerreira). But the Tinder API isn't public so there's no documentation on it; additionally that video was created over 6 years ago, meaning the Tinder API has definitely been updated since then and the code would need to be reworked, if it even still worked at all.

While I was up the for challenge, I still wanted something easier given that the purpose of this challenge project isn't to test my web scraping skills but rather to build machine learning skills. I couldn't find much, though. I was on the verge of just going the Tinder-scraping route when I stumbled upon this dataset on Kaggle: https://www.kaggle.com/datasets/shmalex/instagram-images/data.

It's a monster of a dataset - just tons and tons of images from Instagram. Since Tinder photos are often just pulled from Instagram and vice versa, I figured this dataset would do nicely.

It's almost 100GB. After trying and failing for a while to build a script in Kaggle notebooks to only download part of the set, I just downloaded the whole thing (seriously, given how large some Kaggle datasets can be, I was floored to find that there's no easy way to only download part of one. I found ways to download only one single image file, but obviously that's no help because I need thousands of images. I'm pretty sure it's possible for someone with the skill and the dedication, but again, the purpose of this project is to build ML skills, so I just bit the bullet and downloaded the whole damn thing).

So now I have over random 1.2 million Instagram photos saved on my hard drive. And by random, I mean random - there's guys, there's girls, there's children, there's scenery, there's hair, there's girls' nails (LOTS of girls' nails), there's food, there's drink, there's architecture - you name it. It's Instagram after all. People just take pictures of what's going on in their lives and post it.

But that's not quite what I need - for my model, I need pictures of women, and I need labels. So that was the next step.

The Problem

So now I had all this data, but what to do next? Going through the files manually, selecting the images of women, and then manually adding labels in a csv file would take a lifetime. I wanted something better.

I wanted a GUI interface that would show me the images in order. For each image, I could choose whether it was relevant or not (a picture of a woman that might be found on a Tinder profile) - if it was, I could choose whether I would swipe right on her or swipe left, thus creating the labels. If the image was relevant, I wanted to have it copied to a designated folder, and I wanted its file name and the label I chose for it (1 or 0) to be saved to a csv file. I wanted all of this to be done automatically so that I could simply flip through them one by one in succession - not unlike Tinder, the very app I'm training my model for.

The Solution

With a little help from ChatGPT and the tkinter module (which I had learned about previously in the 100 Days of Python course by Angela Yu), I was able to whip up a primitive Tinder-like interface to sort through the mass of Instagram photos I had saved on my computer. Here's the code: https://github.com/cyborgrob/tinder_vision/blob/main/ig_dataset_sorter.py

And here's a quick demo I made of how it works:

As I mention in the vid, I'm still pretty novice at machine learning, and there might be a better/more efficient way to sort the data. But for my purposes this seems to work great, and with a little modification anybody should be able to suit it to their own projects.

If you have any comments or questions be sure to let me know!