Aysha Rahman
PHY 231
As a person who grew up playing Pokemon video games, the Pokemon franchise is near and dear to my heart. There are 7 generations of Pokemon, which each have a set of video games they were released for. I wanted to look at the Complete Pokemon Dataset (https://www.kaggle.com/rounakbanik/pokemon/version/1) from Kaggle, which contains all current seven generations of Pokemon with their types, stats, and other information, all scraped from the site serebii.net. There are a total of 801 Pokemon, with 41 columns of information in the dataset.
I want to explore the data and look at correlations of different statistics, as well as create different teams of six Pokemon each for various purposes, and ultimately create an ideal Pokemon team to battle a particular character from one of the games.
#importing whatever I can think of, just in case I need it
import pandas as pd
import ast
import numpy as np
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sbn
from altair import Chart, X, Y, Color, Scale
import altair as alt
from vega_datasets import data #error
import requests
from bs4 import BeautifulSoup
matplotlib.style.use('ggplot')
from prettytable import PrettyTable as pt
from tabulate import tabulate as tb
#reading the csv file
poke = pd.read_csv("pokemon.csv")
poke.head()
I want to look at how each Pokemon statistic (attack, defense, special attack, special defense, speed, hit points, and total stats) compares to the others. Is there any correlation, and could we predict one stat based off of others? Let's look at it graphically.
#Graphs comparing stats
alt.Chart(poke).mark_circle().encode(
alt.X(alt.repeat("column"), type='quantitative'),
alt.Y(alt.repeat("row"), type='quantitative'),
color='Region:N'
).properties(
width=150,
height=150
).repeat(
row=['attack', 'defense', 'sp_attack', 'sp_defense', 'speed', 'hp'],
column=['attack', 'defense', 'sp_attack', 'sp_defense', 'speed', 'hp']
).interactive()
Now let us look at the actual correlation between each stat.
#Looking at the correlation between each stat
stat='attack', 'defense', 'sp_attack', 'sp_defense', 'speed', 'hp'#, 'base_total'
for x in stat:
a = poke[x]
for y in stat:
b=a.corr(poke[y])
print(x,"vs", y, b)
There doesn't seem to be any correlation. Still, I want to see how well we can predict one Pokemon statistic using others. The two most related stats seem to be defense and special defense, and then special defense and special attack. So, let's pick special attack, special defense, and defense as our starting example to see if one can be predicted using the others.
model = LinearRegression()
model.fit(poke[['sp_attack','sp_defense']], poke.defense)
mean_squared_error(poke.defense, model.predict(poke[['sp_attack','sp_defense']]))
poke['predictions'] = model.predict(poke[['sp_attack','sp_defense']])
Chart(poke).mark_circle().encode(x='sp_attack', y='defense') + \
Chart(poke).mark_circle(color='red').encode(x='sp_attack', y='predictions')
There doesn't seem to be much of a correlation between any of the stats, so it makes sense that the predictions are not super close to our actual data. They are visually a lot closer than I would expect, given the huge mean squared error. That's an interesting observation, but still far off. And since these are the three most correlated stats, trying to make other predictions using battle stats will not be a fruitful endeavor.
However, there are other items we could look at besides battle stats. Could properties such as height, weight, base happiness, capture rate, experience growth, and base egg steps correlate with any of the battle stats? Let's look at the same thing we did with the battle stats, but this time look at battle stats vs other properties of each species.
alt.Chart(poke).mark_circle().encode(
alt.X(alt.repeat("column"), type='quantitative'),
alt.Y(alt.repeat("row"), type='quantitative'),
color='Region:N'
).properties(
width=150,
height=150
).repeat(
row=['height_m', 'weight_kg', 'base_happiness', 'capture_rate', 'experience_growth', 'base_egg_steps'],
column=['attack', 'defense', 'sp_attack', 'sp_defense', 'speed', 'hp','base_total']
).interactive()
Even moreso than the battle stats against each other, it's pretty clear that these other properties of Pokemon species have no relationship with battle stats. This doesn't tell us anything new about what makes a Pokemon strong, but it has satisfied my curiosity, and it reinforces what we found previously about Pokemon characteristics being pretty arbitrary. It makes sense that these properties would be unpredictable such as to increase diversity amongst Pokemon species.
Before we get on to creating teams, let's look at different Pokemon types. First, we'll see what all the different types are, and then we can create dataframes for each type that we can use later. In creating the dataframes, we will consider the fact that some Pokemon have two types, so we want those Pokemon to show up in the dataframes for both of their types.
#To make dataframes for each type, I'm starting by creating a series that includes each of the 18 types.
types = pd.Series(poke.type1.unique())
#types = poke.type1.unique()
types
#Using a for loop to create a dataframe for each type using both primary 'type1' and secondary 'type 2'.
for i in types:
print (i)
vars()[i] = poke[(poke.type1==i) | (poke.type2==i)]
print("Dataframe has been created.")
We've created a dataframe for each type using a for loop. A Pokemon can have two types and we stated that we wanted the same Pokemon to show up in the dataframes for both of its types, so our loop looks at both its primary and secondary type to see if it belongs in each dataframe. Now, let us use these dataframes to create yet another dataframe, this time one that will tell us the strongest Pokemon of each type based off of the sum of their total stats, which in the dataset is listed under the column "base_total".
#A dataframe with the strongest Pokemon of each type, based off of the sum of their stats 'base_total'
bytype = pd.DataFrame()
for i in types:
i = poke[(poke.type1==i) | (poke.type2==i)]
bytype = bytype.append([i[i.is_legendary==False].sort_values(['base_total'],ascending=False).head(1)])
bytype[['name','type1','type2']]
There we have it! These are the strongest Pokemon of each type. Notice Salamence appears twice; since it is both dragon and flying type, it seems that it is the strongest Pokemon of both those types. Now, I'm interested in knowing: what are the strongest stats for this group of strongest Pokemon; is there a particular stat that most of them are highest in?
#base total chart for strongest Pokemon of each type
alt.Chart(bytype).mark_bar().encode(
x='base_total',
y='name',
color='name',
)
#attack chart for strongest Pokemon of each type
alt.Chart(bytype).mark_bar().encode(
x='attack',
y='name',
color='name',
)
#special attack chart for strongest Pokemon of each type
alt.Chart(bytype).mark_bar().encode(
x='sp_attack',
y='name',
color='name',
)
#defense chart for strongest Pokemon of each type
alt.Chart(bytype).mark_bar().encode(
x='defense',
y='name',
color='name',
)
#special defense chart for strongest Pokemon of each type
alt.Chart(bytype).mark_bar().encode(
x='sp_defense',
y='name',
color='name',
)
#speed chart for strongest Pokemon of each type
alt.Chart(bytype).mark_bar().encode(
x='speed',
y='name',
color='name',
)
#health point chart for strongest Pokemon of each type
alt.Chart(bytype).mark_bar().encode(
x='hp',
y='name',
color='name',
)
All of them have a base total of 550 or more, 2/3 of them have attacks over 120, and all but four have a special attack near or greater than 120. However, most of them do not seem very high in defense; only 5 have a defense greater than 120, and only 3 have a special defense near or greater than 120. Only 3 have health points near or greater than 100. Speed seems to have the most normal looking distribution.
These numbers are somewhat arbitrary; I've used them as a point of comparison because they visually look like a reasonable cutoff for "stronger" vs "weaker" Pokemon in a specific statistic.
From looking at this, it seems there aren't clear trends in what makes a Pokemon strong; there are a variety of combinations of stats that can make each Pokemon as strong as it is. Out of the strongest of each type, it seems that most of them have higher attack stats, but a few of them seem to rely more strongly on defense. There are over 800 Pokemon, so it makes sense that there is such a variety of combinations of battle stats that make each species unique.
What are the six Pokemon with the highest total stats? If we want to create a Pokemon team with them, which Pokemon would they be?
#Making a dataframe for the team of Pokemon with highest stats
high_list = poke.sort_values(['base_total'], ascending = False)
high = high_list.head(6)
high[['name','base_total','type1','type2','is_legendary']]
This is definitely interesting, but notice that all the Pokemon are legendary. I particularly want to look at non-legendary Pokemon, since they are more widely available to catch in games. Let's look at the same thing, but without legendary Pokemon.
#dataframe excluding legendary Pokemon
nonleg=poke[poke.is_legendary==False]
#highest base total stats excluding legendary Pokemon
high_nonleg = nonleg.sort_values(['base_total'], ascending = False)
highest = high_nonleg.head(6)
highest[['name','base_total','type1','type2','is_legendary']]
This is a fairly diverse team in terms of Pokemon types. However, is it really the best team in specific situations? Let's pick a Pokemon game and find a Champion to battle. I'll use the example of Pokemon LeafGreen, where the Champion's name is Green. His team consists of the following Pokemon: Pidgeot, Alakazam, Rhydon, Gyarados, Arcanine, and Venusaur. We will make a dataframe for his team, and then see if we can build an ideal team to fight his based off of type.
#Champion Green's team
Pidgeot = poke[poke.name=="Pidgeot"]
Alakazam = poke[poke.name=="Alakazam"]
Rhydon = poke[poke.name=="Rhydon"]
Gyarados = poke[poke.name=="Gyarados"]
Arcanine = poke[poke.name=="Arcanine"]
Venusaur = poke[poke.name=="Venusaur"]
green = pd.concat([Pidgeot, Alakazam, Rhydon, Gyarados, Arcanine, Venusaur])
#green = pd.concat((poke[poke.name=="Pidgeot"],poke[poke.name=="Alakazam"],poke[poke.name=="Rhydon"],poke[poke.name=="Gyarados"],poke[poke.name=="Arcanine"],poke[poke.name=="Venusaur"]), axis=0, join='outer', ignore_index=False)
green[['name','type1','type2','attack','defense','sp_attack','sp_defense','speed',]]
What types are each Pokemon weakest against? From the dataset, we will see columns labeled "against_" followed by a Pokemon type. If the value is less than 1, then that type is not very effective against the particular Pokemon; if the value is 1, then it has normal effectiveness; if the value is greater than 1, then that type is super effective.
green=green.rename(columns=({'against_fight':'against_fighting'}))
gagainst=green[['name','against_bug','against_dark','against_dragon','against_electric','against_fairy','against_fighting','against_fire','against_flying','against_ghost','against_grass','against_ground','against_ice','against_normal','against_poison','against_psychic','against_rock','against_steel','against_water']]
gagainst
From this, it seems Pidgeot is weakest against electric, ice, and rock.
Alakazam is weakest against bug, dark, and ghost.
Rhydon is weak against fighting, ground, ice, and steel, but super weak against grass and water.
Gyarados is weak against rock and super weak against electric.
Arcanine is weak against ground, rock, and water.
Venusaur is weak against fire, flying, ice, and psychic.
So, the types we want on our team are electric, ice, rock, bug, dark, ghost, grass, water, ground, fire, flying, and psychic. However, we do not need all of these types. If we have electric, water, ice, and either bug, dark, or ghost types on our team, then we've covered the weaknesses of each of Green's Pokemon.
leaf = pd.concat([electric[electric.is_legendary==False].sort_values(['base_total'], ascending = False).head(1),water[water.is_legendary==False].sort_values(['base_total'], ascending = False).head(1),ice[ice.is_legendary==False].sort_values(['base_total'], ascending = False).head(1),bug[bug.is_legendary==False].sort_values(['base_total'], ascending = False).head(1),dark[dark.is_legendary==False].sort_values(['base_total'], ascending = False).head(1),ghost[ghost.is_legendary==False].sort_values(['base_total'], ascending = False).head(1)])
leaf[['name','type1','type2']]
Cool. We have a well-rounded team, equipped to take on Champion Green. We've accomplished our goal!
But wait! Each Pokemon game has a certain generation of Pokemon available. In Pokemon LeafGreen, there is only generation 1 Pokemon present. Using the information about types we want to use against Green, what would an ideal team of Pokemon realistically look like in the game, when Pokemon from other generations aren't present?
leafgreen = pd.concat([electric[electric.generation==1].sort_values(['base_total'], ascending = False).head(1),water[water.generation==1].sort_values(['base_total'], ascending = False).head(1),ice[ice.generation==1].sort_values(['base_total'], ascending = False).head(1),bug[bug.generation==1].sort_values(['base_total'], ascending = False).head(1),dark[dark.generation==1].sort_values(['base_total'], ascending = False).head(1),ghost[ghost.generation==1].sort_values(['base_total'], ascending = False).head(1)])
leafgreen[['name','type1','type2','is_legendary','generation']]
We now have a team that's all from Generation 1, so we could conceivably have this team in Pokemon LeafGreen. However, we didn't select for only nonlegendary Pokemon, so it would be a little difficult to get this team. Let's get our final team by choosing the strongest nonlegendary Pokemon from Generation 1 of each of the types we identified as Green's weakness.
#The types of Pokemon Green's team is weak against
gteam = 'electric','water','ice','bug','dark','ghost'
#Creating a dataframe where we filter by legendary status, generation, and type, and select the strongest ones
lg = pd.DataFrame()
for i in gteam:
i = nonleg[(nonleg.type1==i) | (nonleg.type2==i)]
lg = lg.append([i[i.generation==1].sort_values(['base_total'],ascending=False).head(1)])
lg[['name','type1','type2','is_legendary','generation']]
We finally made it! Our end result tells us that the best team to fight Champion Green with in Pokemon LeafGreen is: Jolteon, Gyarados, Lapras, Pinsir, Persian, and Gengar.
We looked at Pokemon battle stats and tried to find correlations between stats, and then between battle stats vs other Pokemon properties. In both cases, there was no correlation present. Because of this, we could not predict one property based off of others. This makes sense because Pokemon species are meant to be diverse to give every player a unique experience, and having clear trends in what makes a "good" Pokemon would mean everyone would gravitate toward the same ones.
After that, we defined what it means to be a strong Pokemon, which we did simply by summing across battle stats for each Pokemon; we could have taken a more sophisticated route, but this sum was easy to use, especially since it was already present in the dataset. We then used that definition to create a few teams; the first was a team of the six "strongest" Pokemon, which all happened to be legendary Pokemon; the second was a team of the six "strongest" non-legendary Pokemon; the third team was created to beat Champion Green from the Pokemon LeafGreen game, using our definition of a strong Pokemon and the information we had about type strength and weaknesses from the dataset; the fourth team built on that principle and let us build a team we could realistically have in-game; and the fifth team brought everything together to select only the strongest non-legendary Pokemon of Generation 1 that were strongest against Green. Despite not gaining much clear information about battle stats, we were able to create some well-rounded and realistic Pokemon teams. Perhaps I will use this final team myself the next time I play Pokemon LeafGreen.