Data Reshaping
Contenido
Data Reshaping¶
Antes de analizar los datos, se necesita formar los datos obtenidos en un formato regular y que sea procesable por el algoritumo que luego utilizaremos. Es necesario asegurar que todos los datos correspondan con las variables. También, es necesario lidiar con los valores nulos, si es que hubiese. En términos generales, se puede decir que Data Reshaping es cambiar la manera en que los datos están organizados en coumnas y filas.
Join¶
Join, o merge, es el proceso de unir dos DataFrame diferentes en uno solo. Por ejemplo, si tenemos dos DataFrames que contienen diferente información pero sobre los mismos clientes, podemos unirlos en uno solo.
Aquí algunos enlaces extras:
Pandas : How to Merge Dataframes using Dataframe.merge() in Python – Part 1
Pandas : How to Merge Dataframes using Dataframe.merge() in Python – Part 2
Pandas : How to Merge Dataframes using Dataframe.merge() in Python – Part 3
import pandas as pd
import numpy as np
import os
mall_customers_info = pd.read_csv(os.path.join("csv", "mall_customers_info.csv"))
mall_customers_info.tail()
CustomerID | Gender | Age | Annual_Income | |
---|---|---|---|---|
146 | 147 | Male | 48 | 77 |
147 | 148 | Female | 32 | 77 |
148 | 149 | Female | 34 | 78 |
149 | 150 | Male | 34 | 78 |
150 | 151 | Female | 48 | 80 |
mall_customers_info.shape
(151, 4)
mall_customers_score = pd.read_csv(os.path.join("csv", "mall_customers_score.csv"))
mall_customers_score.tail()
CustomerID | Spending Score (1-100) | |
---|---|---|
146 | 147 | 36 |
147 | 148 | 74 |
148 | 149 | 22 |
149 | 150 | 90 |
150 | 152 | 95 |
mall_customers_score.shape
(151, 2)
#customer_data = pd.merge(mall_customers_info[['CustomerID','Gender','Annual_Income']], mall_customers_score, on='CustomerID')
customer_data = pd.merge(mall_customers_info,mall_customers_score,on='CustomerID')
customer_data.tail()
CustomerID | Gender | Age | Annual_Income | Spending Score (1-100) | |
---|---|---|---|---|---|
145 | 146 | Male | 28 | 77 | 97 |
146 | 147 | Male | 48 | 77 | 36 |
147 | 148 | Female | 32 | 77 | 74 |
148 | 149 | Female | 34 | 78 | 22 |
149 | 150 | Male | 34 | 78 | 90 |
customer_data.shape
(150, 5)
customer_data = pd.merge(mall_customers_info,mall_customers_score,on='CustomerID', how='left')
customer_data.tail()
CustomerID | Gender | Age | Annual_Income | Spending Score (1-100) | |
---|---|---|---|---|---|
146 | 147 | Male | 48 | 77 | 36.0 |
147 | 148 | Female | 32 | 77 | 74.0 |
148 | 149 | Female | 34 | 78 | 22.0 |
149 | 150 | Male | 34 | 78 | 90.0 |
150 | 151 | Female | 48 | 80 | NaN |
customer_data = pd.merge(mall_customers_info,mall_customers_score,on='CustomerID', how='right')
customer_data.tail()
CustomerID | Gender | Age | Annual_Income | Spending Score (1-100) | |
---|---|---|---|---|---|
146 | 147 | Male | 48.0 | 77.0 | 36 |
147 | 148 | Female | 32.0 | 77.0 | 74 |
148 | 149 | Female | 34.0 | 78.0 | 22 |
149 | 150 | Male | 34.0 | 78.0 | 90 |
150 | 152 | NaN | NaN | NaN | 95 |
customer_data = pd.merge(mall_customers_info,mall_customers_score,on='CustomerID', how='outer')
customer_data.tail()
CustomerID | Gender | Age | Annual_Income | Spending Score (1-100) | |
---|---|---|---|---|---|
147 | 148 | Female | 32.0 | 77.0 | 74.0 |
148 | 149 | Female | 34.0 | 78.0 | 22.0 |
149 | 150 | Male | 34.0 | 78.0 | 90.0 |
150 | 151 | Female | 48.0 | 80.0 | NaN |
151 | 152 | NaN | NaN | NaN | 95.0 |
customer_data = pd.merge(mall_customers_info,mall_customers_score,on='CustomerID', how='inner')
customer_data.tail()
CustomerID | Gender | Age | Annual_Income | Spending Score (1-100) | |
---|---|---|---|---|---|
145 | 146 | Male | 28 | 77 | 97 |
146 | 147 | Male | 48 | 77 | 36 |
147 | 148 | Female | 32 | 77 | 74 |
148 | 149 | Female | 34 | 78 | 22 |
149 | 150 | Male | 34 | 78 | 90 |
customer_data.shape
(150, 5)
Union¶
mall_customers_more = pd.read_csv(os.path.join("csv", "customers_data_2.csv"))
mall_customers_more.head()
CustomerID | Gender | Age | Annual_Income | Spending Score (1-100) | |
---|---|---|---|---|---|
0 | 151 | Male | 43 | 78 | 17 |
1 | 152 | Male | 39 | 78 | 88 |
2 | 153 | Female | 44 | 78 | 20 |
3 | 154 | Female | 38 | 78 | 76 |
4 | 155 | Female | 47 | 78 | 16 |
mall_customers_more.shape
(50, 5)
customers_data_all = pd.concat([customer_data, mall_customers_more])
#customers_data_all.sample(10)
customers_data_all.tail(10)
CustomerID | Gender | Age | Annual_Income | Spending Score (1-100) | |
---|---|---|---|---|---|
40 | 191 | Female | 34 | 103 | 23 |
41 | 192 | Female | 32 | 103 | 69 |
42 | 193 | Male | 33 | 113 | 8 |
43 | 194 | Female | 38 | 113 | 91 |
44 | 195 | Female | 47 | 120 | 16 |
45 | 196 | Female | 35 | 120 | 79 |
46 | 197 | Female | 45 | 115 | 28 |
47 | 198 | Male | 32 | 115 | 74 |
48 | 199 | Male | 32 | 170 | 70 |
49 | 200 | Male | 30 | 165 | 73 |
customers_data_all.shape
(200, 5)
customers_data_all.reset_index(inplace=True, drop=True)
#customers_data_all.sample(10)
customers_data_all.tail(10)
CustomerID | Gender | Age | Annual_Income | Spending Score (1-100) | |
---|---|---|---|---|---|
190 | 191 | Female | 34 | 103 | 23 |
191 | 192 | Female | 32 | 103 | 69 |
192 | 193 | Male | 33 | 113 | 8 |
193 | 194 | Female | 38 | 113 | 91 |
194 | 195 | Female | 47 | 120 | 16 |
195 | 196 | Female | 35 | 120 | 79 |
196 | 197 | Female | 45 | 115 | 28 |
197 | 198 | Male | 32 | 115 | 74 |
198 | 199 | Male | 32 | 170 | 70 |
199 | 200 | Male | 30 | 165 | 73 |
Stack, Unstack¶
Reshape using Stack() and unstack() function in Pandas python
datos_mensuales = pd.read_csv('./csv/monthly_data.csv')
datos_mensuales.head(5)
YYYY | JAN | FEB | MAR | APR | MAY | JUN | JUL | AUG | SEP | OCT | NOV | DEC | YEAR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2008 | 10140 | 10239 | 10050 | 10111 | 10159 | 10159 | 10141 | 10117 | 10178 | 10148 | 10125 | 10182 | 10146 |
1 | 2009 | 10137 | 10140 | 10140 | 10141 | 10188 | 10168 | 10128 | 10165 | 10208 | 10166 | 10041 | 10068 | 10141 |
2 | 2010 | 10151 | 10034 | 10168 | 10194 | 10158 | 10166 | 10158 | 10129 | 10147 | 10135 | 10057 | 10133 | 10136 |
3 | 2011 | 10182 | 10161 | 10227 | 10192 | 10182 | 10154 | 10123 | 10130 | 10149 | 10182 | 10194 | 10099 | 10165 |
4 | 2012 | 10194 | 10286 | 10271 | 10053 | 10159 | 10127 | 10139 | 10155 | 10149 | 10109 | 10108 | 10085 | 10153 |
# Preparacion: usar 'YYYY' como el ID/Indice
datos_mensuales.set_index('YYYY', inplace=True)
datos_mensuales.head(5)
JAN | FEB | MAR | APR | MAY | JUN | JUL | AUG | SEP | OCT | NOV | DEC | YEAR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
YYYY | |||||||||||||
2008 | 10140 | 10239 | 10050 | 10111 | 10159 | 10159 | 10141 | 10117 | 10178 | 10148 | 10125 | 10182 | 10146 |
2009 | 10137 | 10140 | 10140 | 10141 | 10188 | 10168 | 10128 | 10165 | 10208 | 10166 | 10041 | 10068 | 10141 |
2010 | 10151 | 10034 | 10168 | 10194 | 10158 | 10166 | 10158 | 10129 | 10147 | 10135 | 10057 | 10133 | 10136 |
2011 | 10182 | 10161 | 10227 | 10192 | 10182 | 10154 | 10123 | 10130 | 10149 | 10182 | 10194 | 10099 | 10165 |
2012 | 10194 | 10286 | 10271 | 10053 | 10159 | 10127 | 10139 | 10155 | 10149 | 10109 | 10108 | 10085 | 10153 |
#En valor en cada columna se transforma en una fila
datos_mensuales.stack()
YYYY
2008 JAN 10140
FEB 10239
MAR 10050
APR 10111
MAY 10159
...
2017 SEP 10135
OCT 10176
NOV 10141
DEC 10120
YEAR 10160
Length: 130, dtype: int64
athletes = pd.read_csv('./csv/athletes.csv')
athletes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11538 entries, 0 to 11537
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 11538 non-null int64
1 name 11538 non-null object
2 nationality 11538 non-null object
3 sex 11538 non-null object
4 date_of_birth 11538 non-null object
5 height 11208 non-null float64
6 weight 10879 non-null float64
7 sport 11538 non-null object
8 gold 11538 non-null int64
9 silver 11538 non-null int64
10 bronze 11538 non-null int64
11 info 131 non-null object
dtypes: float64(2), int64(4), object(6)
memory usage: 1.1+ MB
athletes.head(5)
id | name | nationality | sex | date_of_birth | height | weight | sport | gold | silver | bronze | info | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 736041664 | A Jesus Garcia | ESP | male | 1969-10-17 | 1.72 | 64.0 | athletics | 0 | 0 | 0 | NaN |
1 | 532037425 | A Lam Shin | KOR | female | 1986-09-23 | 1.68 | 56.0 | fencing | 0 | 0 | 0 | NaN |
2 | 435962603 | Aaron Brown | CAN | male | 1992-05-27 | 1.98 | 79.0 | athletics | 0 | 0 | 1 | NaN |
3 | 521041435 | Aaron Cook | MDA | male | 1991-01-02 | 1.83 | 80.0 | taekwondo | 0 | 0 | 0 | NaN |
4 | 33922579 | Aaron Gate | NZL | male | 1990-11-26 | 1.81 | 71.0 | cycling | 0 | 0 | 0 | NaN |
weight_mean_by_sport_and_sex = athletes.groupby(['sport', 'sex'])['weight'].mean()
weight_mean_by_sport_and_sex
sport sex
aquatics female 62.284483
male 82.219061
archery female 64.301587
male 80.079365
athletics female 60.152542
male 74.777680
badminton female 61.209877
male 76.156627
basketball female 75.377622
male 100.297872
boxing female NaN
male NaN
canoe female 66.457944
male 82.150000
cycling female 60.207254
male 72.576052
equestrian female 58.634146
male 72.954887
fencing female 62.733871
male 78.785124
football female 61.061069
male 74.451713
golf female 63.200000
male 79.000000
gymnastics female 49.555024
male 63.254545
handball female 70.789157
male 95.431694
hockey female 60.425926
male 77.375000
judo female 65.392157
male 84.616740
modern pentathlon female 58.000000
male 73.916667
rowing female 69.776190
male 86.504615
rugby sevens female 66.595890
male 90.450331
sailing female 63.194969
male 77.122066
shooting female 62.675676
male 81.068966
table tennis female 57.530120
male 72.558140
taekwondo female 61.258065
male 74.809524
tennis female 64.670455
male 80.417476
triathlon female 54.563636
male 66.814815
volleyball female 70.684211
male 89.421875
weightlifting female 68.788462
male 87.538961
wrestling female 61.805310
male 85.372881
Name: weight, dtype: float64
#Mueve cada valor del ultimo nivel de un indice mutinivel a un columna
weight_mean_by_sport_and_sex.unstack()
sex | female | male |
---|---|---|
sport | ||
aquatics | 62.284483 | 82.219061 |
archery | 64.301587 | 80.079365 |
athletics | 60.152542 | 74.777680 |
badminton | 61.209877 | 76.156627 |
basketball | 75.377622 | 100.297872 |
boxing | NaN | NaN |
canoe | 66.457944 | 82.150000 |
cycling | 60.207254 | 72.576052 |
equestrian | 58.634146 | 72.954887 |
fencing | 62.733871 | 78.785124 |
football | 61.061069 | 74.451713 |
golf | 63.200000 | 79.000000 |
gymnastics | 49.555024 | 63.254545 |
handball | 70.789157 | 95.431694 |
hockey | 60.425926 | 77.375000 |
judo | 65.392157 | 84.616740 |
modern pentathlon | 58.000000 | 73.916667 |
rowing | 69.776190 | 86.504615 |
rugby sevens | 66.595890 | 90.450331 |
sailing | 63.194969 | 77.122066 |
shooting | 62.675676 | 81.068966 |
table tennis | 57.530120 | 72.558140 |
taekwondo | 61.258065 | 74.809524 |
tennis | 64.670455 | 80.417476 |
triathlon | 54.563636 | 66.814815 |
volleyball | 70.684211 | 89.421875 |
weightlifting | 68.788462 | 87.538961 |
wrestling | 61.805310 | 85.372881 |
Pivot¶
#El ID se repite para cada uno de las propiedaded del producto :/
products = pd.DataFrame({'id': [823905, 823905,
235897, 235897, 235897,
983422, 983422],
'item': ['prize', 'unit',
'prize', 'unit', 'stock',
'prize', 'stock'],
'value': [3.49, 'kg',
12.89, 'l', 50,
0.49, 4]})
products
id | item | value | |
---|---|---|---|
0 | 823905 | prize | 3.49 |
1 | 823905 | unit | kg |
2 | 235897 | prize | 12.89 |
3 | 235897 | unit | l |
4 | 235897 | stock | 50 |
5 | 983422 | prize | 0.49 |
6 | 983422 | stock | 4 |
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot.html
De manera similar a unstack, sirve para mover fila como columnas y así construir un DF con más columnas y menos filas
Usar id como el indice de cada fila y los valores en la columna “item” para crear columnas. Los valores que debe aparecer en cada columnas esta en la columna “value
products.pivot(index='id', columns='item', values='value')
item | prize | stock | unit |
---|---|---|---|
id | |||
235897 | 12.89 | 50 | l |
823905 | 3.49 | NaN | kg |
983422 | 0.49 | 4 | NaN |
stocks = pd.read_csv('https://gist.githubusercontent.com/alexdebrie/b3f40efc3dd7664df5a20f5eee85e854/raw/ee3e6feccba2464cbbc2e185fb17961c53d2a7f5/stocks.csv')
stocks.head(10)
date | symbol | open | high | low | close | volume | |
---|---|---|---|---|---|---|---|
0 | 2019-03-01 | AMZN | 1655.13 | 1674.26 | 1651.00 | 1671.73 | 4974877 |
1 | 2019-03-04 | AMZN | 1685.00 | 1709.43 | 1674.36 | 1696.17 | 6167358 |
2 | 2019-03-05 | AMZN | 1702.95 | 1707.80 | 1689.01 | 1692.43 | 3681522 |
3 | 2019-03-06 | AMZN | 1695.97 | 1697.75 | 1668.28 | 1668.95 | 3996001 |
4 | 2019-03-07 | AMZN | 1667.37 | 1669.75 | 1620.51 | 1625.95 | 4957017 |
5 | 2019-03-01 | AAPL | 174.28 | 175.15 | 172.89 | 174.97 | 25886167 |
6 | 2019-03-04 | AAPL | 175.69 | 177.75 | 173.97 | 175.85 | 27436203 |
7 | 2019-03-05 | AAPL | 175.94 | 176.00 | 174.54 | 175.53 | 19737419 |
8 | 2019-03-06 | AAPL | 174.67 | 175.49 | 173.94 | 174.52 | 20810384 |
9 | 2019-03-07 | AAPL | 173.87 | 174.44 | 172.02 | 172.50 | 24796374 |
stocks.pivot(index='symbol', columns='date', values='volume')
date | 2019-03-01 | 2019-03-04 | 2019-03-05 | 2019-03-06 | 2019-03-07 |
---|---|---|---|---|---|
symbol | |||||
AAPL | 25886167 | 27436203 | 19737419 | 20810384 | 24796374 |
AMZN | 4974877 | 6167358 | 3681522 | 3996001 | 4957017 |
GOOG | 1450316 | 1446047 | 1443174 | 1099289 | 1166559 |
Melt¶
melt()
hace lo opuesto a pivot()
grades = pd.DataFrame([[6, 4, 5], [7, 8, 7], [6, 7, 9], [6, 5, 5], [5, 2, 7]],
index = ['Mary', 'John', 'Ann', 'Pete', 'Laura'],
columns = ['test_1', 'test_2', 'test_3'])
grades.reset_index(inplace=True)
grades
index | test_1 | test_2 | test_3 | |
---|---|---|---|---|
0 | Mary | 6 | 4 | 5 |
1 | John | 7 | 8 | 7 |
2 | Ann | 6 | 7 | 9 |
3 | Pete | 6 | 5 | 5 |
4 | Laura | 5 | 2 | 7 |
grades.melt(id_vars=['index']) # indicar las columna que identifican a cada entidad
index | variable | value | |
---|---|---|---|
0 | Mary | test_1 | 6 |
1 | John | test_1 | 7 |
2 | Ann | test_1 | 6 |
3 | Pete | test_1 | 6 |
4 | Laura | test_1 | 5 |
5 | Mary | test_2 | 4 |
6 | John | test_2 | 8 |
7 | Ann | test_2 | 7 |
8 | Pete | test_2 | 5 |
9 | Laura | test_2 | 2 |
10 | Mary | test_3 | 5 |
11 | John | test_3 | 7 |
12 | Ann | test_3 | 9 |
13 | Pete | test_3 | 5 |
14 | Laura | test_3 | 7 |