Where did I find the data?
The data is from Kaggle Pet Adoption
-
The original raw datasets contain 25 features and 14,993 samples in train.csv and 24 features and 3,972 samples in test.csv.
-
Columns are: ‘AdoptionSpeed’, ‘Age’, ‘Breed1’, ‘Breed2’, ‘Color1’, ‘Color2’, ‘Color3’, ‘Description’, ‘Dewormed’, ‘Fee’,‘Fur Length’, ‘Gender’, ‘Health’, ‘MaturitySize’,‘PetID’, ‘PhotoAmt’, ‘Quantity’, ‘RescuerID’, ‘State’,‘Sterilized’, ‘Vaccinated’, ‘VideoAmt’.
- PetID - Unique hash ID of pet profile
- AdoptionSpeed - Categorical speed of adoption. Lower is faster. This is the value to predict. See below section for more info.
- Type - Type of animal (1 = Dog, 2 = Cat)
- Name - Name of pet (Empty if not named)
- Age - Age of pet when listed, in months
- Breed1 - Primary breed of pet (Refer to BreedLabels dictionary)
- Breed2 - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)
- Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
- Color1 - Color 1 of pet (Refer to ColorLabels dictionary)
- Color2 - Color 2 of pet (Refer to ColorLabels dictionary)
- Color3 - Color 3 of pet (Refer to ColorLabels dictionary)
- MaturitySize - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
- FurLength - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
- Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
- Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
- Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
- Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
- Quantity - Number of pets represented in profile
- Fee - Adoption fee (0 = Free)
- State - State location in Malaysia (Refer to StateLabels dictionary)
- RescuerID - Unique hash ID of rescuer
- VideoAmt - Total uploaded videos for this pet
- PhotoAmt - Total uploaded photos for this pet
- Description - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.
-
Drop the columns ‘PhotoAmt’, ‘VideoAmt’, ‘RescuerID’, ‘State’
-
breed_labels.csv and color_labels.csv list the number and corresponding breed name or color, so the resulting cleaned dataset removes all number in ‘Breed1’, ‘Breed2’, and 3 color-related columns, and insert the top 5 breed name and all colors correspondingly.
-
The top 5 breeds for cat or dog samples cover at least 75% of metadata, and many of the other breeds have less than 5 samples per breed name, or even 1 sample per breed name.