PetFinder.my Adoption Prediction is acquired from Kaggle.com (PetFinder.my, 2019). The original row datasets contain 25 features and 14,993 samples in train.csv and 24 features and 3,972 samples in test.csv. Through the data cleaning process, the cleaned datasets contain 21 features and 14,993 samples in train.csv and 20 features and 3,972 samples in test.csv. In cleaned train dataset, the features are ‘AdoptionSpeed’, ‘Age’, ‘Breed1’, ‘Breed2’, ‘Gender’, ‘Color1’, ‘Color2’, ‘Color3’, ‘MaturitySize’, ‘Fur Length’, ‘Vaccinated’, ‘Dewormed’, ‘Sterilized’, ‘Health’, ‘Quantity’, ‘Fee’, ‘Description’, ‘PetID’, ‘DataType’, and there is no ‘AdoptionSpeed’ variable in test dataset. In the original dataset instructions, breed_labels.csv and color_labels.csv list the number and corresponding breed name or color, so the resulting cleaned dataset removes all number in ‘Breed1’, ‘Breed2’, and 3 color-related columns, and insert the top 5 breed name and all colors correspondingly. The resulting cleaned datasets focus on the top 5 breeds for cats and top 5 breeds for dogs in the ‘Breed1’ variable and change the related integer values in ‘Breed2’ variables, due to the tremendously large amount of breed listed and involved. The top 5 breeds for cat or dog samples cover at least 75% of both test and train datasets together, and many of the other breeds have less than 5 samples per breed name, or even 1 sample per breed name. Besides, variables ‘Type’, ‘MaturitySize’, ‘FurLength’, ‘Vaccinated’, ‘Dewormed’, ‘Sterilized’, and ‘Health’ are all listed as integer from 0 to 3 or 4, so the resulting cleaned datasets also replace the integers with matching values in Python data type of string. A list of features that was applied in this project, associated with the type of data, and a variable description were available in Appendix A. The detailed descriptions of prediction variable ‘AdoptionSpeed’ are listed in Appendix B, noticing that there is not pet data available for adoption speed between 90 and 100 days.
The variables ‘PhotoAmt’, ‘RescuerID’, ‘State’, ‘VideoAmt’ are four features not being used to do visualization or NLP and are dropped from the cleaned dataset. All variables except ‘AdoptionSpeed’, ‘Age’, and ‘Fee’ have input values as string. ‘Name’ variable is organized uniformly by condensing values of ‘No name yet’, ‘No Name Yet’, ‘no name’ and NaN into the same ‘No Name’. Visualizations including line chart and pie chart are created by Plotly wrote in Python. Correlation, scatterplots and word clouds are generated by seaborn and matplotlib packages in Python. Tableau is used to create some interactive plots for every categorical variable and d3 is used to generate another interactive histogram and pie chart. The NLTK package in Python provides support for all data cleaning process on the ‘Description’ variable for NLP, including punctuation removal, tokenization, stop words removal, lower cases, stemming and lemmatization. First, a Multinomial Naïve Bayes Classification was applied to predict and fit variables to classify ‘AdoptionSpeed’ variable across split train and test. The sklearn Python package provides support for majority parts of modeling requirements, such as splitting data into train and test, vectorizing n-gram and Term Frequency Inverse Document Frequency (TF-IDF), find best evaluation model, fitting split train data, and plotting confusion matrix. Last, ensemble methods with 2-gram vectorizer and TF-IDF vectorizer are deployed to find the best parameters for Random Forest model, then do the fitting model, predicting model, plotting the confusion matrix and check two best models’ differences.