Ejercicio misceláneo: Análisis exploratorio y Feature Engineering

import pandas as pd
import numpy as np
import os
from scipy import stats
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Input In [1], in <module>
      2 import numpy as np
      3 import os
----> 4 from scipy import stats

ModuleNotFoundError: No module named 'scipy'
ruta_archivo = os.path.join("titanic", "train.csv")
df = pd.read_csv(os.path.join("titanic", "train.csv"), index_col='PassengerId')
df.head()
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

1. Elimine las variables/columnas “Ticket” y “Cabin”

df.drop(columns=['Cabin'], inplace=True)
df.drop(columns=['Ticket'], inplace=True)
df.head()
Survived Pclass Name Sex Age SibSp Parch Fare Embarked
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 C
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 53.1000 S
5 0 3 Allen, Mr. William Henry male 35.0 0 0 8.0500 S
df.describe(include='all')
Survived Pclass Name Sex Age SibSp Parch Fare Embarked
count 892.000000 892.000000 892 892 715.000000 892.000000 892.000000 891.000000 890
unique NaN NaN 892 2 NaN NaN NaN NaN 3
top NaN NaN Jonsson, Mr. Carl male NaN NaN NaN NaN S
freq NaN NaN 1 578 NaN NaN NaN NaN 645
mean 0.383408 2.309417 NaN NaN 29.742196 0.522422 0.381166 32.204208 NaN
std 0.486489 0.835923 NaN NaN 14.561951 1.102264 0.805706 49.693429 NaN
min 0.000000 1.000000 NaN NaN 0.420000 0.000000 0.000000 0.000000 NaN
25% 0.000000 2.000000 NaN NaN 20.250000 0.000000 0.000000 7.910400 NaN
50% 0.000000 3.000000 NaN NaN 28.000000 0.000000 0.000000 14.454200 NaN
75% 1.000000 3.000000 NaN NaN 38.000000 1.000000 0.000000 31.000000 NaN
max 1.000000 3.000000 NaN NaN 80.000000 8.000000 6.000000 512.329200 NaN

2. Encuentre los mejores valores para completar la variable “Embarked” para los pasajeros con datos faltantes

df[df.Embarked.isnull()]
Survived Pclass Name Sex Age SibSp Parch Fare Embarked Title IsMother
PassengerId
#Tome en cuenta que simplemente emplear el valor que más aparece casi nunca es la mejor alternativa
#Para obtener mejores resultados deberá apoyarse en los resultados de un análisis exploratorio de datos
#aplicado a las variable(s) que puedan estar relacionadas con 'Embarked'
df['Embarked'].value_counts()
S    645
C    168
Q     77
Name: Embarked, dtype: int64
# Obtenemos el puerto más común según la clase y el sexo
df.groupby(['Pclass', 'Sex'])['Embarked'].agg(pd.Series.mode)
Pclass  Sex   
1       female    S
        male      S
2       female    S
        male      S
3       female    S
        male      S
Name: Embarked, dtype: object
# Como podemos observar, todos los valores que se obtuvieron en la celda de arriba pertenecen al puerto 'S'
# Entonces, reemplazaremos los valores NaN por el puerto 'S'
df['Embarked'].fillna('S', inplace=True)
df[df.Embarked.isnull()]
Survived Pclass Name Sex Age SibSp Parch Fare Embarked Title IsMother
PassengerId

3. Complete el código de la siguiente función para extraer el título de cada pasajero

def extractTitle(name):
    title_mapping = {'mr' : 'Mr', 
               'mrs' : 'Mrs', 
               'miss' : 'Miss', 
               'master' : 'Master',
               'don' : 'Sir',
               'rev' : 'Sir',
               'dr' : 'Officer',
               'mme' : 'Mrs',
               'ms' : 'Mrs',
               'major' : 'Officer',
               'lady' : 'Lady',
               'sir' : 'Sir',
               'mlle' : 'Miss',
               'col' : 'Officer',
               'capt' : 'Officer',
               'the countess' : 'Lady',
               'jonkheer' : 'Sir',
               'dona' : 'Lady'
                 }
    title = name.split(', ')[1].split('.')[0].lower()
    return title_mapping[title]
df['Title'] =  df['Name'].map(lambda name : extractTitle(name))
df.head()
Survived Pclass Name Sex Age SibSp Parch Fare Embarked Title
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 S Mr
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 C Mrs
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 S Miss
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 53.1000 S Mrs
5 0 3 Allen, Mr. William Henry male 35.0 0 0 8.0500 S Mr
df.describe(include='all')
Survived Pclass Name Sex Age SibSp Parch Fare Embarked Title
count 892.000000 892.000000 892 892 715.000000 892.000000 892.000000 891.000000 890 892
unique NaN NaN 892 2 NaN NaN NaN NaN 3 7
top NaN NaN Simmons, Mr. John male NaN NaN NaN NaN S Mr
freq NaN NaN 1 578 NaN NaN NaN NaN 645 518
mean 0.383408 2.309417 NaN NaN 29.742196 0.522422 0.381166 32.204208 NaN NaN
std 0.486489 0.835923 NaN NaN 14.561951 1.102264 0.805706 49.693429 NaN NaN
min 0.000000 1.000000 NaN NaN 0.420000 0.000000 0.000000 0.000000 NaN NaN
25% 0.000000 2.000000 NaN NaN 20.250000 0.000000 0.000000 7.910400 NaN NaN
50% 0.000000 3.000000 NaN NaN 28.000000 0.000000 0.000000 14.454200 NaN NaN
75% 1.000000 3.000000 NaN NaN 38.000000 1.000000 0.000000 31.000000 NaN NaN
max 1.000000 3.000000 NaN NaN 80.000000 8.000000 6.000000 512.329200 NaN NaN

4. Encuentre los mejores valores para completar la variable “Age” para los pasajeros con datos faltantes

Tome en cuenta que simplemente emplear el valor de tendencia central (mediana) de todo el conjunto casi nunca es la mejor alternativa Para obtener mejores resultados deberá apoyarse en los resultados de un análisis exploratorio de datos aplicado a las variable(s) que puedan estar relacionadas con “Age”

df[df['Age'].isnull()].head()
Survived Pclass Name Sex Age SibSp Parch Fare Embarked Title
PassengerId
6 0 3 Moran, Mr. James male NaN 0 0 8.4583 Q Mr
18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 13.0000 S Mr
20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 7.2250 C Mrs
27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 7.2250 C Mr
29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 7.8792 Q Miss
df[df['Age'].isnull()].shape
(177, 10)
df['Age'].median()
28.0
df.corr()
Survived Pclass Age SibSp Parch Fare
Survived 1.000000 -0.338965 -0.079388 -0.034886 0.082010 0.257307
Pclass -0.338965 1.000000 -0.365161 0.082599 0.017994 -0.549500
Age -0.079388 -0.365161 1.000000 -0.308847 -0.189989 0.096067
SibSp -0.034886 0.082599 -0.308847 1.000000 0.414985 0.159651
Parch 0.082010 0.017994 -0.189989 0.414985 1.000000 0.216225
Fare 0.257307 -0.549500 0.096067 0.159651 0.216225 1.000000

Según la matriz de correlación, Age tiene la correlación más alta con Pclass. Entonces, agruparemos los datos según este parámetro. También los agruparemos según el sexo, o sea, que los datos estarán agrupados por clase y por sexo. Porque las edades de las personas pueden variar según su sexo.

pasajeros_por_clase_y_sexo = df.groupby(['Pclass', 'Sex'])
mediana_age_pasajeros = pasajeros_por_clase_y_sexo['Age'].transform('median')
# Reemplazar los datos nulos de Age
df['Age'].fillna(mediana_age_pasajeros, inplace=True)

df[df['Age'].isnull()].head()
Survived Pclass Name Sex Age SibSp Parch Fare Embarked Title
PassengerId

5. Cree una nueva variable “IsMother” (1=es madre, 0=no es madre)

Responda la siguiente pregunta sustentando su respuesta con los resultados de un análisis exploratorio de datos.

Las madres tuvieron mayor probabilidad de sobrevivir al accidente del Titanic?

Elegimos a las madres como las mujeres que están casadas y que viajen al menos con un Parch(parientes que sean padres o hijos).

No incluímos a las mujeres con el título de Miss, para evitar incluir a las mujeres que no son madres pero que viajan con sus papás.

Hay que tener en cuenta que en esa época no habían muchas madres solteras (por el machismo), así que no se pierde mucha información en caso de que una mujer sea madre y no esté casada.

def es_mujer_y_no_soltera_y_viaja_con_hijos(fila):
    return (fila['Parch'] > 0) & (fila['Sex'] == 'female') & (fila['Title'] != 'Miss')

def es_madre(fila):
    if es_mujer_y_no_soltera_y_viaja_con_hijos(fila):
        return 1
    return 0

df['IsMother'] = df.apply(es_madre, axis=1)
df.tail()
Survived Pclass Name Sex Age SibSp Parch Fare Embarked Title IsMother
PassengerId
888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 30.00 S Miss 0
889 0 3 Johnston, Miss. Catherine Helen "Carrie" female 21.5 1 2 23.45 S Miss 0
890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 30.00 C Mr 0
891 0 3 Dooley, Mr. Patrick male 32.0 0 0 7.75 Q Mr 0
1044 0 3 Storey, Mr. Thomas male 60.5 0 0 NaN S Mr 0
df.loc[(df['IsMother'] == 1)]
Survived Pclass Name Sex Age SibSp Parch Fare Embarked Title IsMother
PassengerId
9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 11.1333 S Mrs 1
26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 5 31.3875 S Mrs 1
99 1 2 Doling, Mrs. John T (Ada Julia Bone) female 34.0 0 1 23.0000 S Mrs 1
141 0 3 Boulos, Mrs. Joseph (Sultana) female 21.5 0 2 15.2458 C Mrs 1
167 1 1 Chibnall, Mrs. (Edith Martha Bowerman) female 35.0 0 1 55.0000 S Mrs 1
168 0 3 Skoog, Mrs. William (Anna Bernhardina Karlsson) female 45.0 1 4 27.9000 S Mrs 1
248 1 2 Hamalainen, Mrs. William (Anna) female 24.0 0 2 14.5000 S Mrs 1
252 0 3 Strom, Mrs. Wilhelm (Elna Matilda Persson) female 29.0 1 1 10.4625 S Mrs 1
255 0 3 Rosblom, Mrs. Viktor (Helena Wilhelmina) female 41.0 0 2 20.2125 S Mrs 1
256 1 3 Touma, Mrs. Darwis (Hanne Youssef Razi) female 29.0 0 2 15.2458 C Mrs 1
260 1 2 Parrish, Mrs. (Lutie Davis) female 50.0 0 1 26.0000 S Mrs 1
269 1 1 Graham, Mrs. William Thompson (Edith Junkins) female 58.0 0 1 153.4625 S Mrs 1
273 1 2 Mellinger, Mrs. (Elizabeth Anne Maidment) female 41.0 0 1 19.5000 S Mrs 1
280 1 3 Abbott, Mrs. Stanton (Rosa Hunt) female 35.0 1 1 20.2500 S Mrs 1
300 1 1 Baxter, Mrs. James (Helene DeLaudeniere Chaput) female 50.0 0 1 247.5208 C Mrs 1
313 0 2 Lahtinen, Mrs. William (Anna Sylfven) female 26.0 1 1 26.0000 S Mrs 1
320 1 1 Spedden, Mrs. Frederic Oakley (Margaretta Corn... female 40.0 1 1 134.5000 C Mrs 1
324 1 2 Caldwell, Mrs. Albert Francis (Sylvia Mae Harb... female 22.0 1 1 29.0000 S Mrs 1
329 1 3 Goldsmith, Mrs. Frank John (Emily Alice Brown) female 31.0 1 1 20.5250 S Mrs 1
363 0 3 Barbara, Mrs. (Catherine David) female 45.0 0 1 14.4542 C Mrs 1
395 1 3 Sandstrom, Mrs. Hjalmar (Agnes Charlotta Bengt... female 24.0 0 2 16.7000 S Mrs 1
417 1 2 Drew, Mrs. James Vivian (Lulu Thorne Christian) female 34.0 1 1 32.5000 S Mrs 1
424 0 3 Danbom, Mrs. Ernst Gilbert (Anna Sigrid Maria ... female 28.0 1 1 14.4000 S Mrs 1
438 1 2 Richards, Mrs. Sidney (Emily Hocking) female 24.0 2 3 18.7500 S Mrs 1
441 1 2 Hart, Mrs. Benjamin (Esther Ada Bloomfield) female 45.0 1 1 26.2500 S Mrs 1
473 1 2 West, Mrs. Edwy Arthur (Ada Mary Worth) female 33.0 1 2 27.7500 S Mrs 1
499 0 1 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0 1 2 151.5500 S Mrs 1
507 1 2 Quick, Mrs. Frederick Charles (Jane Richards) female 33.0 0 2 26.0000 S Mrs 1
524 1 1 Hippach, Mrs. Louis Albert (Ida Sophia Fischer) female 44.0 0 1 57.9792 C Mrs 1
534 1 3 Peter, Mrs. Catherine (Catherine Rizk) female 21.5 0 2 22.3583 C Mrs 1
559 1 1 Taussig, Mrs. Emil (Tillie Mandelbaum) female 39.0 1 1 79.6500 S Mrs 1
568 0 3 Palsson, Mrs. Nils (Alma Cornelia Berglund) female 29.0 0 4 21.0750 S Mrs 1
582 1 1 Thayer, Mrs. John Borland (Marian Longstreth M... female 39.0 1 1 110.8833 C Mrs 1
601 1 2 Jacobsohn, Mrs. Sidney Samuel (Amy Frances Chr... female 24.0 2 1 27.0000 S Mrs 1
609 1 2 Laroche, Mrs. Joseph (Juliette Marie Louise La... female 22.0 1 2 41.5792 C Mrs 1
611 0 3 Andersson, Mrs. Anders Johan (Alfrida Konstant... female 39.0 1 5 31.2750 S Mrs 1
639 0 3 Panula, Mrs. Juha (Maria Emilia Ojala) female 41.0 0 5 39.6875 S Mrs 1
658 0 3 Bourke, Mrs. John (Catherine) female 32.0 1 1 15.5000 Q Mrs 1
671 1 2 Brown, Mrs. Thomas William Solomon (Elizabeth ... female 40.0 1 1 39.0000 S Mrs 1
679 0 3 Goodwin, Mrs. Frederick (Augusta Tyler) female 43.0 1 6 46.9000 S Mrs 1
737 0 3 Ford, Mrs. Edward (Margaret Ann Watson) female 48.0 1 3 34.3750 S Mrs 1
755 1 2 Herman, Mrs. Samuel (Jane Laver) female 48.0 1 2 65.0000 S Mrs 1
764 1 1 Carter, Mrs. William Ernest (Lucile Polk) female 36.0 1 2 120.0000 S Mrs 1
775 1 2 Hocking, Mrs. Elizabeth (Eliza Needs) female 54.0 1 3 23.0000 S Mrs 1
780 1 1 Robert, Mrs. Edward Scott (Elisabeth Walton Mc... female 43.0 0 1 211.3375 S Mrs 1
800 0 3 Van Impe, Mrs. Jean Baptiste (Rosalie Paula Go... female 30.0 1 1 24.1500 S Mrs 1
802 1 2 Collyer, Mrs. Harvey (Charlotte Annie Tate) female 31.0 1 1 26.2500 S Mrs 1
821 1 1 Hays, Mrs. Charles Melville (Clara Jennings Gr... female 52.0 1 1 93.5000 S Mrs 1
824 1 3 Moor, Mrs. (Beila) female 27.0 0 1 12.4750 S Mrs 1
856 1 3 Aks, Mrs. Sam (Leah Rosen) female 18.0 0 1 9.3500 S Mrs 1
857 1 1 Wick, Mrs. George Dennick (Mary Hitchcock) female 45.0 1 1 164.8667 S Mrs 1
859 1 3 Baclini, Mrs. Solomon (Latifa Qurban) female 24.0 0 3 19.2583 C Mrs 1
872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 1 52.5542 S Mrs 1
880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 83.1583 C Mrs 1
881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 1 26.0000 S Mrs 1
886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 29.1250 Q Mrs 1