Ejercicio misceláneo: Análisis exploratorio y Feature Engineering
Contenido
Ejercicio misceláneo: Análisis exploratorio y Feature Engineering¶
import pandas as pd
import numpy as np
import os
from scipy import stats
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Input In [1], in <module>
2 import numpy as np
3 import os
----> 4 from scipy import stats
ModuleNotFoundError: No module named 'scipy'
ruta_archivo = os.path.join("titanic", "train.csv")
df = pd.read_csv(os.path.join("titanic", "train.csv"), index_col='PassengerId')
df.head()
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
PassengerId | |||||||||||
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
1. Elimine las variables/columnas “Ticket” y “Cabin”¶
df.drop(columns=['Cabin'], inplace=True)
df.drop(columns=['Ticket'], inplace=True)
df.head()
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|---|
PassengerId | |||||||||
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | 7.2500 | S |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | 71.2833 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | 7.9250 | S |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 53.1000 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 8.0500 | S |
df.describe(include='all')
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|---|
count | 892.000000 | 892.000000 | 892 | 892 | 715.000000 | 892.000000 | 892.000000 | 891.000000 | 890 |
unique | NaN | NaN | 892 | 2 | NaN | NaN | NaN | NaN | 3 |
top | NaN | NaN | Jonsson, Mr. Carl | male | NaN | NaN | NaN | NaN | S |
freq | NaN | NaN | 1 | 578 | NaN | NaN | NaN | NaN | 645 |
mean | 0.383408 | 2.309417 | NaN | NaN | 29.742196 | 0.522422 | 0.381166 | 32.204208 | NaN |
std | 0.486489 | 0.835923 | NaN | NaN | 14.561951 | 1.102264 | 0.805706 | 49.693429 | NaN |
min | 0.000000 | 1.000000 | NaN | NaN | 0.420000 | 0.000000 | 0.000000 | 0.000000 | NaN |
25% | 0.000000 | 2.000000 | NaN | NaN | 20.250000 | 0.000000 | 0.000000 | 7.910400 | NaN |
50% | 0.000000 | 3.000000 | NaN | NaN | 28.000000 | 0.000000 | 0.000000 | 14.454200 | NaN |
75% | 1.000000 | 3.000000 | NaN | NaN | 38.000000 | 1.000000 | 0.000000 | 31.000000 | NaN |
max | 1.000000 | 3.000000 | NaN | NaN | 80.000000 | 8.000000 | 6.000000 | 512.329200 | NaN |
2. Encuentre los mejores valores para completar la variable “Embarked” para los pasajeros con datos faltantes¶
df[df.Embarked.isnull()]
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Fare | Embarked | Title | IsMother | |
---|---|---|---|---|---|---|---|---|---|---|---|
PassengerId |
#Tome en cuenta que simplemente emplear el valor que más aparece casi nunca es la mejor alternativa
#Para obtener mejores resultados deberá apoyarse en los resultados de un análisis exploratorio de datos
#aplicado a las variable(s) que puedan estar relacionadas con 'Embarked'
df['Embarked'].value_counts()
S 645
C 168
Q 77
Name: Embarked, dtype: int64
# Obtenemos el puerto más común según la clase y el sexo
df.groupby(['Pclass', 'Sex'])['Embarked'].agg(pd.Series.mode)
Pclass Sex
1 female S
male S
2 female S
male S
3 female S
male S
Name: Embarked, dtype: object
# Como podemos observar, todos los valores que se obtuvieron en la celda de arriba pertenecen al puerto 'S'
# Entonces, reemplazaremos los valores NaN por el puerto 'S'
df['Embarked'].fillna('S', inplace=True)
df[df.Embarked.isnull()]
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Fare | Embarked | Title | IsMother | |
---|---|---|---|---|---|---|---|---|---|---|---|
PassengerId |
3. Complete el código de la siguiente función para extraer el título de cada pasajero¶
def extractTitle(name):
title_mapping = {'mr' : 'Mr',
'mrs' : 'Mrs',
'miss' : 'Miss',
'master' : 'Master',
'don' : 'Sir',
'rev' : 'Sir',
'dr' : 'Officer',
'mme' : 'Mrs',
'ms' : 'Mrs',
'major' : 'Officer',
'lady' : 'Lady',
'sir' : 'Sir',
'mlle' : 'Miss',
'col' : 'Officer',
'capt' : 'Officer',
'the countess' : 'Lady',
'jonkheer' : 'Sir',
'dona' : 'Lady'
}
title = name.split(', ')[1].split('.')[0].lower()
return title_mapping[title]
df['Title'] = df['Name'].map(lambda name : extractTitle(name))
df.head()
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Fare | Embarked | Title | |
---|---|---|---|---|---|---|---|---|---|---|
PassengerId | ||||||||||
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | 7.2500 | S | Mr |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | 71.2833 | C | Mrs |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | 7.9250 | S | Miss |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 53.1000 | S | Mrs |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 8.0500 | S | Mr |
df.describe(include='all')
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Fare | Embarked | Title | |
---|---|---|---|---|---|---|---|---|---|---|
count | 892.000000 | 892.000000 | 892 | 892 | 715.000000 | 892.000000 | 892.000000 | 891.000000 | 890 | 892 |
unique | NaN | NaN | 892 | 2 | NaN | NaN | NaN | NaN | 3 | 7 |
top | NaN | NaN | Simmons, Mr. John | male | NaN | NaN | NaN | NaN | S | Mr |
freq | NaN | NaN | 1 | 578 | NaN | NaN | NaN | NaN | 645 | 518 |
mean | 0.383408 | 2.309417 | NaN | NaN | 29.742196 | 0.522422 | 0.381166 | 32.204208 | NaN | NaN |
std | 0.486489 | 0.835923 | NaN | NaN | 14.561951 | 1.102264 | 0.805706 | 49.693429 | NaN | NaN |
min | 0.000000 | 1.000000 | NaN | NaN | 0.420000 | 0.000000 | 0.000000 | 0.000000 | NaN | NaN |
25% | 0.000000 | 2.000000 | NaN | NaN | 20.250000 | 0.000000 | 0.000000 | 7.910400 | NaN | NaN |
50% | 0.000000 | 3.000000 | NaN | NaN | 28.000000 | 0.000000 | 0.000000 | 14.454200 | NaN | NaN |
75% | 1.000000 | 3.000000 | NaN | NaN | 38.000000 | 1.000000 | 0.000000 | 31.000000 | NaN | NaN |
max | 1.000000 | 3.000000 | NaN | NaN | 80.000000 | 8.000000 | 6.000000 | 512.329200 | NaN | NaN |
4. Encuentre los mejores valores para completar la variable “Age” para los pasajeros con datos faltantes¶
Tome en cuenta que simplemente emplear el valor de tendencia central (mediana) de todo el conjunto casi nunca es la mejor alternativa Para obtener mejores resultados deberá apoyarse en los resultados de un análisis exploratorio de datos aplicado a las variable(s) que puedan estar relacionadas con “Age”
df[df['Age'].isnull()].head()
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Fare | Embarked | Title | |
---|---|---|---|---|---|---|---|---|---|---|
PassengerId | ||||||||||
6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 8.4583 | Q | Mr |
18 | 1 | 2 | Williams, Mr. Charles Eugene | male | NaN | 0 | 0 | 13.0000 | S | Mr |
20 | 1 | 3 | Masselmani, Mrs. Fatima | female | NaN | 0 | 0 | 7.2250 | C | Mrs |
27 | 0 | 3 | Emir, Mr. Farred Chehab | male | NaN | 0 | 0 | 7.2250 | C | Mr |
29 | 1 | 3 | O'Dwyer, Miss. Ellen "Nellie" | female | NaN | 0 | 0 | 7.8792 | Q | Miss |
df[df['Age'].isnull()].shape
(177, 10)
df['Age'].median()
28.0
df.corr()
Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|
Survived | 1.000000 | -0.338965 | -0.079388 | -0.034886 | 0.082010 | 0.257307 |
Pclass | -0.338965 | 1.000000 | -0.365161 | 0.082599 | 0.017994 | -0.549500 |
Age | -0.079388 | -0.365161 | 1.000000 | -0.308847 | -0.189989 | 0.096067 |
SibSp | -0.034886 | 0.082599 | -0.308847 | 1.000000 | 0.414985 | 0.159651 |
Parch | 0.082010 | 0.017994 | -0.189989 | 0.414985 | 1.000000 | 0.216225 |
Fare | 0.257307 | -0.549500 | 0.096067 | 0.159651 | 0.216225 | 1.000000 |
Según la matriz de correlación, Age tiene la correlación más alta con Pclass. Entonces, agruparemos los datos según este parámetro. También los agruparemos según el sexo, o sea, que los datos estarán agrupados por clase y por sexo. Porque las edades de las personas pueden variar según su sexo.
pasajeros_por_clase_y_sexo = df.groupby(['Pclass', 'Sex'])
mediana_age_pasajeros = pasajeros_por_clase_y_sexo['Age'].transform('median')
# Reemplazar los datos nulos de Age
df['Age'].fillna(mediana_age_pasajeros, inplace=True)
df[df['Age'].isnull()].head()
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Fare | Embarked | Title | |
---|---|---|---|---|---|---|---|---|---|---|
PassengerId |
5. Cree una nueva variable “IsMother” (1=es madre, 0=no es madre)¶
Responda la siguiente pregunta sustentando su respuesta con los resultados de un análisis exploratorio de datos.¶
Las madres tuvieron mayor probabilidad de sobrevivir al accidente del Titanic?¶
Elegimos a las madres como las mujeres que están casadas y que viajen al menos con un Parch(parientes que sean padres o hijos).
No incluímos a las mujeres con el título de Miss, para evitar incluir a las mujeres que no son madres pero que viajan con sus papás.
Hay que tener en cuenta que en esa época no habían muchas madres solteras (por el machismo), así que no se pierde mucha información en caso de que una mujer sea madre y no esté casada.
def es_mujer_y_no_soltera_y_viaja_con_hijos(fila):
return (fila['Parch'] > 0) & (fila['Sex'] == 'female') & (fila['Title'] != 'Miss')
def es_madre(fila):
if es_mujer_y_no_soltera_y_viaja_con_hijos(fila):
return 1
return 0
df['IsMother'] = df.apply(es_madre, axis=1)
df.tail()
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Fare | Embarked | Title | IsMother | |
---|---|---|---|---|---|---|---|---|---|---|---|
PassengerId | |||||||||||
888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 30.00 | S | Miss | 0 |
889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | 21.5 | 1 | 2 | 23.45 | S | Miss | 0 |
890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 30.00 | C | Mr | 0 |
891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 7.75 | Q | Mr | 0 |
1044 | 0 | 3 | Storey, Mr. Thomas | male | 60.5 | 0 | 0 | NaN | S | Mr | 0 |
df.loc[(df['IsMother'] == 1)]
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Fare | Embarked | Title | IsMother | |
---|---|---|---|---|---|---|---|---|---|---|---|
PassengerId | |||||||||||
9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 11.1333 | S | Mrs | 1 |
26 | 1 | 3 | Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... | female | 38.0 | 1 | 5 | 31.3875 | S | Mrs | 1 |
99 | 1 | 2 | Doling, Mrs. John T (Ada Julia Bone) | female | 34.0 | 0 | 1 | 23.0000 | S | Mrs | 1 |
141 | 0 | 3 | Boulos, Mrs. Joseph (Sultana) | female | 21.5 | 0 | 2 | 15.2458 | C | Mrs | 1 |
167 | 1 | 1 | Chibnall, Mrs. (Edith Martha Bowerman) | female | 35.0 | 0 | 1 | 55.0000 | S | Mrs | 1 |
168 | 0 | 3 | Skoog, Mrs. William (Anna Bernhardina Karlsson) | female | 45.0 | 1 | 4 | 27.9000 | S | Mrs | 1 |
248 | 1 | 2 | Hamalainen, Mrs. William (Anna) | female | 24.0 | 0 | 2 | 14.5000 | S | Mrs | 1 |
252 | 0 | 3 | Strom, Mrs. Wilhelm (Elna Matilda Persson) | female | 29.0 | 1 | 1 | 10.4625 | S | Mrs | 1 |
255 | 0 | 3 | Rosblom, Mrs. Viktor (Helena Wilhelmina) | female | 41.0 | 0 | 2 | 20.2125 | S | Mrs | 1 |
256 | 1 | 3 | Touma, Mrs. Darwis (Hanne Youssef Razi) | female | 29.0 | 0 | 2 | 15.2458 | C | Mrs | 1 |
260 | 1 | 2 | Parrish, Mrs. (Lutie Davis) | female | 50.0 | 0 | 1 | 26.0000 | S | Mrs | 1 |
269 | 1 | 1 | Graham, Mrs. William Thompson (Edith Junkins) | female | 58.0 | 0 | 1 | 153.4625 | S | Mrs | 1 |
273 | 1 | 2 | Mellinger, Mrs. (Elizabeth Anne Maidment) | female | 41.0 | 0 | 1 | 19.5000 | S | Mrs | 1 |
280 | 1 | 3 | Abbott, Mrs. Stanton (Rosa Hunt) | female | 35.0 | 1 | 1 | 20.2500 | S | Mrs | 1 |
300 | 1 | 1 | Baxter, Mrs. James (Helene DeLaudeniere Chaput) | female | 50.0 | 0 | 1 | 247.5208 | C | Mrs | 1 |
313 | 0 | 2 | Lahtinen, Mrs. William (Anna Sylfven) | female | 26.0 | 1 | 1 | 26.0000 | S | Mrs | 1 |
320 | 1 | 1 | Spedden, Mrs. Frederic Oakley (Margaretta Corn... | female | 40.0 | 1 | 1 | 134.5000 | C | Mrs | 1 |
324 | 1 | 2 | Caldwell, Mrs. Albert Francis (Sylvia Mae Harb... | female | 22.0 | 1 | 1 | 29.0000 | S | Mrs | 1 |
329 | 1 | 3 | Goldsmith, Mrs. Frank John (Emily Alice Brown) | female | 31.0 | 1 | 1 | 20.5250 | S | Mrs | 1 |
363 | 0 | 3 | Barbara, Mrs. (Catherine David) | female | 45.0 | 0 | 1 | 14.4542 | C | Mrs | 1 |
395 | 1 | 3 | Sandstrom, Mrs. Hjalmar (Agnes Charlotta Bengt... | female | 24.0 | 0 | 2 | 16.7000 | S | Mrs | 1 |
417 | 1 | 2 | Drew, Mrs. James Vivian (Lulu Thorne Christian) | female | 34.0 | 1 | 1 | 32.5000 | S | Mrs | 1 |
424 | 0 | 3 | Danbom, Mrs. Ernst Gilbert (Anna Sigrid Maria ... | female | 28.0 | 1 | 1 | 14.4000 | S | Mrs | 1 |
438 | 1 | 2 | Richards, Mrs. Sidney (Emily Hocking) | female | 24.0 | 2 | 3 | 18.7500 | S | Mrs | 1 |
441 | 1 | 2 | Hart, Mrs. Benjamin (Esther Ada Bloomfield) | female | 45.0 | 1 | 1 | 26.2500 | S | Mrs | 1 |
473 | 1 | 2 | West, Mrs. Edwy Arthur (Ada Mary Worth) | female | 33.0 | 1 | 2 | 27.7500 | S | Mrs | 1 |
499 | 0 | 1 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0 | 1 | 2 | 151.5500 | S | Mrs | 1 |
507 | 1 | 2 | Quick, Mrs. Frederick Charles (Jane Richards) | female | 33.0 | 0 | 2 | 26.0000 | S | Mrs | 1 |
524 | 1 | 1 | Hippach, Mrs. Louis Albert (Ida Sophia Fischer) | female | 44.0 | 0 | 1 | 57.9792 | C | Mrs | 1 |
534 | 1 | 3 | Peter, Mrs. Catherine (Catherine Rizk) | female | 21.5 | 0 | 2 | 22.3583 | C | Mrs | 1 |
559 | 1 | 1 | Taussig, Mrs. Emil (Tillie Mandelbaum) | female | 39.0 | 1 | 1 | 79.6500 | S | Mrs | 1 |
568 | 0 | 3 | Palsson, Mrs. Nils (Alma Cornelia Berglund) | female | 29.0 | 0 | 4 | 21.0750 | S | Mrs | 1 |
582 | 1 | 1 | Thayer, Mrs. John Borland (Marian Longstreth M... | female | 39.0 | 1 | 1 | 110.8833 | C | Mrs | 1 |
601 | 1 | 2 | Jacobsohn, Mrs. Sidney Samuel (Amy Frances Chr... | female | 24.0 | 2 | 1 | 27.0000 | S | Mrs | 1 |
609 | 1 | 2 | Laroche, Mrs. Joseph (Juliette Marie Louise La... | female | 22.0 | 1 | 2 | 41.5792 | C | Mrs | 1 |
611 | 0 | 3 | Andersson, Mrs. Anders Johan (Alfrida Konstant... | female | 39.0 | 1 | 5 | 31.2750 | S | Mrs | 1 |
639 | 0 | 3 | Panula, Mrs. Juha (Maria Emilia Ojala) | female | 41.0 | 0 | 5 | 39.6875 | S | Mrs | 1 |
658 | 0 | 3 | Bourke, Mrs. John (Catherine) | female | 32.0 | 1 | 1 | 15.5000 | Q | Mrs | 1 |
671 | 1 | 2 | Brown, Mrs. Thomas William Solomon (Elizabeth ... | female | 40.0 | 1 | 1 | 39.0000 | S | Mrs | 1 |
679 | 0 | 3 | Goodwin, Mrs. Frederick (Augusta Tyler) | female | 43.0 | 1 | 6 | 46.9000 | S | Mrs | 1 |
737 | 0 | 3 | Ford, Mrs. Edward (Margaret Ann Watson) | female | 48.0 | 1 | 3 | 34.3750 | S | Mrs | 1 |
755 | 1 | 2 | Herman, Mrs. Samuel (Jane Laver) | female | 48.0 | 1 | 2 | 65.0000 | S | Mrs | 1 |
764 | 1 | 1 | Carter, Mrs. William Ernest (Lucile Polk) | female | 36.0 | 1 | 2 | 120.0000 | S | Mrs | 1 |
775 | 1 | 2 | Hocking, Mrs. Elizabeth (Eliza Needs) | female | 54.0 | 1 | 3 | 23.0000 | S | Mrs | 1 |
780 | 1 | 1 | Robert, Mrs. Edward Scott (Elisabeth Walton Mc... | female | 43.0 | 0 | 1 | 211.3375 | S | Mrs | 1 |
800 | 0 | 3 | Van Impe, Mrs. Jean Baptiste (Rosalie Paula Go... | female | 30.0 | 1 | 1 | 24.1500 | S | Mrs | 1 |
802 | 1 | 2 | Collyer, Mrs. Harvey (Charlotte Annie Tate) | female | 31.0 | 1 | 1 | 26.2500 | S | Mrs | 1 |
821 | 1 | 1 | Hays, Mrs. Charles Melville (Clara Jennings Gr... | female | 52.0 | 1 | 1 | 93.5000 | S | Mrs | 1 |
824 | 1 | 3 | Moor, Mrs. (Beila) | female | 27.0 | 0 | 1 | 12.4750 | S | Mrs | 1 |
856 | 1 | 3 | Aks, Mrs. Sam (Leah Rosen) | female | 18.0 | 0 | 1 | 9.3500 | S | Mrs | 1 |
857 | 1 | 1 | Wick, Mrs. George Dennick (Mary Hitchcock) | female | 45.0 | 1 | 1 | 164.8667 | S | Mrs | 1 |
859 | 1 | 3 | Baclini, Mrs. Solomon (Latifa Qurban) | female | 24.0 | 0 | 3 | 19.2583 | C | Mrs | 1 |
872 | 1 | 1 | Beckwith, Mrs. Richard Leonard (Sallie Monypeny) | female | 47.0 | 1 | 1 | 52.5542 | S | Mrs | 1 |
880 | 1 | 1 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 0 | 1 | 83.1583 | C | Mrs | 1 |
881 | 1 | 2 | Shelley, Mrs. William (Imanita Parrish Hall) | female | 25.0 | 0 | 1 | 26.0000 | S | Mrs | 1 |
886 | 0 | 3 | Rice, Mrs. William (Margaret Norton) | female | 39.0 | 0 | 5 | 29.1250 | Q | Mrs | 1 |