Data Reshaping

Antes de analizar los datos, se necesita formar los datos obtenidos en un formato regular y que sea procesable por el algoritumo que luego utilizaremos. Es necesario asegurar que todos los datos correspondan con las variables. También, es necesario lidiar con los valores nulos, si es que hubiese. En términos generales, se puede decir que Data Reshaping es cambiar la manera en que los datos están organizados en coumnas y filas.

  1. Join

  2. Union

  3. Stack, Unstack

  4. Pivot

  5. Melt

Join

Join, o merge, es el proceso de unir dos DataFrame diferentes en uno solo. Por ejemplo, si tenemos dos DataFrames que contienen diferente información pero sobre los mismos clientes, podemos unirlos en uno solo.

Aquí algunos enlaces extras:

pandas.merge

Pandas : How to Merge Dataframes using Dataframe.merge() in Python – Part 1

Pandas : How to Merge Dataframes using Dataframe.merge() in Python – Part 2

Pandas : How to Merge Dataframes using Dataframe.merge() in Python – Part 3

import pandas as pd
import numpy as np
import os
mall_customers_info = pd.read_csv(os.path.join("csv", "mall_customers_info.csv"))
mall_customers_info.tail()
CustomerID Gender Age Annual_Income
146 147 Male 48 77
147 148 Female 32 77
148 149 Female 34 78
149 150 Male 34 78
150 151 Female 48 80
mall_customers_info.shape
(151, 4)
mall_customers_score = pd.read_csv(os.path.join("csv", "mall_customers_score.csv"))
mall_customers_score.tail()
CustomerID Spending Score (1-100)
146 147 36
147 148 74
148 149 22
149 150 90
150 152 95
mall_customers_score.shape
(151, 2)
#customer_data = pd.merge(mall_customers_info[['CustomerID','Gender','Annual_Income']], mall_customers_score, on='CustomerID')
customer_data = pd.merge(mall_customers_info,mall_customers_score,on='CustomerID')
customer_data.tail()
CustomerID Gender Age Annual_Income Spending Score (1-100)
145 146 Male 28 77 97
146 147 Male 48 77 36
147 148 Female 32 77 74
148 149 Female 34 78 22
149 150 Male 34 78 90
customer_data.shape
(150, 5)
customer_data = pd.merge(mall_customers_info,mall_customers_score,on='CustomerID', how='left')
customer_data.tail()
CustomerID Gender Age Annual_Income Spending Score (1-100)
146 147 Male 48 77 36.0
147 148 Female 32 77 74.0
148 149 Female 34 78 22.0
149 150 Male 34 78 90.0
150 151 Female 48 80 NaN
customer_data = pd.merge(mall_customers_info,mall_customers_score,on='CustomerID', how='right')
customer_data.tail()
CustomerID Gender Age Annual_Income Spending Score (1-100)
146 147 Male 48.0 77.0 36
147 148 Female 32.0 77.0 74
148 149 Female 34.0 78.0 22
149 150 Male 34.0 78.0 90
150 152 NaN NaN NaN 95
customer_data = pd.merge(mall_customers_info,mall_customers_score,on='CustomerID', how='outer')
customer_data.tail()
CustomerID Gender Age Annual_Income Spending Score (1-100)
147 148 Female 32.0 77.0 74.0
148 149 Female 34.0 78.0 22.0
149 150 Male 34.0 78.0 90.0
150 151 Female 48.0 80.0 NaN
151 152 NaN NaN NaN 95.0
customer_data = pd.merge(mall_customers_info,mall_customers_score,on='CustomerID', how='inner')
customer_data.tail()
CustomerID Gender Age Annual_Income Spending Score (1-100)
145 146 Male 28 77 97
146 147 Male 48 77 36
147 148 Female 32 77 74
148 149 Female 34 78 22
149 150 Male 34 78 90
customer_data.shape
(150, 5)

Union

pandas.concat

mall_customers_more = pd.read_csv(os.path.join("csv", "customers_data_2.csv"))
mall_customers_more.head()
CustomerID Gender Age Annual_Income Spending Score (1-100)
0 151 Male 43 78 17
1 152 Male 39 78 88
2 153 Female 44 78 20
3 154 Female 38 78 76
4 155 Female 47 78 16
mall_customers_more.shape
(50, 5)
customers_data_all = pd.concat([customer_data, mall_customers_more])
#customers_data_all.sample(10)
customers_data_all.tail(10)
CustomerID Gender Age Annual_Income Spending Score (1-100)
40 191 Female 34 103 23
41 192 Female 32 103 69
42 193 Male 33 113 8
43 194 Female 38 113 91
44 195 Female 47 120 16
45 196 Female 35 120 79
46 197 Female 45 115 28
47 198 Male 32 115 74
48 199 Male 32 170 70
49 200 Male 30 165 73
customers_data_all.shape
(200, 5)
customers_data_all.reset_index(inplace=True, drop=True)
#customers_data_all.sample(10)
customers_data_all.tail(10)
CustomerID Gender Age Annual_Income Spending Score (1-100)
190 191 Female 34 103 23
191 192 Female 32 103 69
192 193 Male 33 113 8
193 194 Female 38 113 91
194 195 Female 47 120 16
195 196 Female 35 120 79
196 197 Female 45 115 28
197 198 Male 32 115 74
198 199 Male 32 170 70
199 200 Male 30 165 73

Stack, Unstack

Reshape using Stack() and unstack() function in Pandas python

datos_mensuales = pd.read_csv('./csv/monthly_data.csv')
datos_mensuales.head(5)
YYYY JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC YEAR
0 2008 10140 10239 10050 10111 10159 10159 10141 10117 10178 10148 10125 10182 10146
1 2009 10137 10140 10140 10141 10188 10168 10128 10165 10208 10166 10041 10068 10141
2 2010 10151 10034 10168 10194 10158 10166 10158 10129 10147 10135 10057 10133 10136
3 2011 10182 10161 10227 10192 10182 10154 10123 10130 10149 10182 10194 10099 10165
4 2012 10194 10286 10271 10053 10159 10127 10139 10155 10149 10109 10108 10085 10153
# Preparacion: usar 'YYYY' como el ID/Indice
datos_mensuales.set_index('YYYY', inplace=True)
datos_mensuales.head(5)
JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC YEAR
YYYY
2008 10140 10239 10050 10111 10159 10159 10141 10117 10178 10148 10125 10182 10146
2009 10137 10140 10140 10141 10188 10168 10128 10165 10208 10166 10041 10068 10141
2010 10151 10034 10168 10194 10158 10166 10158 10129 10147 10135 10057 10133 10136
2011 10182 10161 10227 10192 10182 10154 10123 10130 10149 10182 10194 10099 10165
2012 10194 10286 10271 10053 10159 10127 10139 10155 10149 10109 10108 10085 10153
#En valor en cada columna se transforma en una fila
datos_mensuales.stack()
YYYY      
2008  JAN     10140
      FEB     10239
      MAR     10050
      APR     10111
      MAY     10159
              ...  
2017  SEP     10135
      OCT     10176
      NOV     10141
      DEC     10120
      YEAR    10160
Length: 130, dtype: int64
athletes = pd.read_csv('./csv/athletes.csv')
athletes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11538 entries, 0 to 11537
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             11538 non-null  int64  
 1   name           11538 non-null  object 
 2   nationality    11538 non-null  object 
 3   sex            11538 non-null  object 
 4   date_of_birth  11538 non-null  object 
 5   height         11208 non-null  float64
 6   weight         10879 non-null  float64
 7   sport          11538 non-null  object 
 8   gold           11538 non-null  int64  
 9   silver         11538 non-null  int64  
 10  bronze         11538 non-null  int64  
 11  info           131 non-null    object 
dtypes: float64(2), int64(4), object(6)
memory usage: 1.1+ MB
athletes.head(5)
id name nationality sex date_of_birth height weight sport gold silver bronze info
0 736041664 A Jesus Garcia ESP male 1969-10-17 1.72 64.0 athletics 0 0 0 NaN
1 532037425 A Lam Shin KOR female 1986-09-23 1.68 56.0 fencing 0 0 0 NaN
2 435962603 Aaron Brown CAN male 1992-05-27 1.98 79.0 athletics 0 0 1 NaN
3 521041435 Aaron Cook MDA male 1991-01-02 1.83 80.0 taekwondo 0 0 0 NaN
4 33922579 Aaron Gate NZL male 1990-11-26 1.81 71.0 cycling 0 0 0 NaN
weight_mean_by_sport_and_sex = athletes.groupby(['sport', 'sex'])['weight'].mean()
weight_mean_by_sport_and_sex
sport              sex   
aquatics           female     62.284483
                   male       82.219061
archery            female     64.301587
                   male       80.079365
athletics          female     60.152542
                   male       74.777680
badminton          female     61.209877
                   male       76.156627
basketball         female     75.377622
                   male      100.297872
boxing             female           NaN
                   male             NaN
canoe              female     66.457944
                   male       82.150000
cycling            female     60.207254
                   male       72.576052
equestrian         female     58.634146
                   male       72.954887
fencing            female     62.733871
                   male       78.785124
football           female     61.061069
                   male       74.451713
golf               female     63.200000
                   male       79.000000
gymnastics         female     49.555024
                   male       63.254545
handball           female     70.789157
                   male       95.431694
hockey             female     60.425926
                   male       77.375000
judo               female     65.392157
                   male       84.616740
modern pentathlon  female     58.000000
                   male       73.916667
rowing             female     69.776190
                   male       86.504615
rugby sevens       female     66.595890
                   male       90.450331
sailing            female     63.194969
                   male       77.122066
shooting           female     62.675676
                   male       81.068966
table tennis       female     57.530120
                   male       72.558140
taekwondo          female     61.258065
                   male       74.809524
tennis             female     64.670455
                   male       80.417476
triathlon          female     54.563636
                   male       66.814815
volleyball         female     70.684211
                   male       89.421875
weightlifting      female     68.788462
                   male       87.538961
wrestling          female     61.805310
                   male       85.372881
Name: weight, dtype: float64
#Mueve cada valor del ultimo nivel de un indice mutinivel a un columna
weight_mean_by_sport_and_sex.unstack()
sex female male
sport
aquatics 62.284483 82.219061
archery 64.301587 80.079365
athletics 60.152542 74.777680
badminton 61.209877 76.156627
basketball 75.377622 100.297872
boxing NaN NaN
canoe 66.457944 82.150000
cycling 60.207254 72.576052
equestrian 58.634146 72.954887
fencing 62.733871 78.785124
football 61.061069 74.451713
golf 63.200000 79.000000
gymnastics 49.555024 63.254545
handball 70.789157 95.431694
hockey 60.425926 77.375000
judo 65.392157 84.616740
modern pentathlon 58.000000 73.916667
rowing 69.776190 86.504615
rugby sevens 66.595890 90.450331
sailing 63.194969 77.122066
shooting 62.675676 81.068966
table tennis 57.530120 72.558140
taekwondo 61.258065 74.809524
tennis 64.670455 80.417476
triathlon 54.563636 66.814815
volleyball 70.684211 89.421875
weightlifting 68.788462 87.538961
wrestling 61.805310 85.372881

Pivot

pandas.DataFrame.pivot

#El ID se repite para cada uno de las propiedaded del producto :/
products = pd.DataFrame({'id': [823905, 823905,
                         235897, 235897, 235897,
                         983422, 983422],
                  'item': ['prize', 'unit', 
                           'prize', 'unit', 'stock', 
                           'prize', 'stock'],
                  'value': [3.49, 'kg',
                            12.89, 'l', 50,
                            0.49, 4]})
products
id item value
0 823905 prize 3.49
1 823905 unit kg
2 235897 prize 12.89
3 235897 unit l
4 235897 stock 50
5 983422 prize 0.49
6 983422 stock 4

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot.html

De manera similar a unstack, sirve para mover fila como columnas y así construir un DF con más columnas y menos filas

Usar id como el indice de cada fila y los valores en la columna “item” para crear columnas. Los valores que debe aparecer en cada columnas esta en la columna “value

products.pivot(index='id', columns='item', values='value')
item prize stock unit
id
235897 12.89 50 l
823905 3.49 NaN kg
983422 0.49 4 NaN
stocks = pd.read_csv('https://gist.githubusercontent.com/alexdebrie/b3f40efc3dd7664df5a20f5eee85e854/raw/ee3e6feccba2464cbbc2e185fb17961c53d2a7f5/stocks.csv')
stocks.head(10)
date symbol open high low close volume
0 2019-03-01 AMZN 1655.13 1674.26 1651.00 1671.73 4974877
1 2019-03-04 AMZN 1685.00 1709.43 1674.36 1696.17 6167358
2 2019-03-05 AMZN 1702.95 1707.80 1689.01 1692.43 3681522
3 2019-03-06 AMZN 1695.97 1697.75 1668.28 1668.95 3996001
4 2019-03-07 AMZN 1667.37 1669.75 1620.51 1625.95 4957017
5 2019-03-01 AAPL 174.28 175.15 172.89 174.97 25886167
6 2019-03-04 AAPL 175.69 177.75 173.97 175.85 27436203
7 2019-03-05 AAPL 175.94 176.00 174.54 175.53 19737419
8 2019-03-06 AAPL 174.67 175.49 173.94 174.52 20810384
9 2019-03-07 AAPL 173.87 174.44 172.02 172.50 24796374
stocks.pivot(index='symbol', columns='date', values='volume')
date 2019-03-01 2019-03-04 2019-03-05 2019-03-06 2019-03-07
symbol
AAPL 25886167 27436203 19737419 20810384 24796374
AMZN 4974877 6167358 3681522 3996001 4957017
GOOG 1450316 1446047 1443174 1099289 1166559

Melt

pandas.DataFrame.melt

melt() hace lo opuesto a pivot()

grades = pd.DataFrame([[6, 4, 5], [7, 8, 7], [6, 7, 9], [6, 5, 5], [5, 2, 7]], 
                       index = ['Mary', 'John', 'Ann', 'Pete', 'Laura'],
                       columns = ['test_1', 'test_2', 'test_3'])
grades.reset_index(inplace=True)
grades
index test_1 test_2 test_3
0 Mary 6 4 5
1 John 7 8 7
2 Ann 6 7 9
3 Pete 6 5 5
4 Laura 5 2 7
grades.melt(id_vars=['index']) # indicar las columna que identifican a cada entidad
index variable value
0 Mary test_1 6
1 John test_1 7
2 Ann test_1 6
3 Pete test_1 6
4 Laura test_1 5
5 Mary test_2 4
6 John test_2 8
7 Ann test_2 7
8 Pete test_2 5
9 Laura test_2 2
10 Mary test_3 5
11 John test_3 7
12 Ann test_3 9
13 Pete test_3 5
14 Laura test_3 7