파이썬 - Pandas 실습 -3
winemag-data.csv 파일을 reviews 로 읽는다.
import pandas as pd
import numpy as np
reviews = pd.read_csv("winemag-data.csv",index_col = 0)
reviews.head(2)
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
리뷰 데이터프레임에서 points 컬럼의 median 값은?
reviews['points'].median()
88.0
나라를 중복되지 않도록 가져와서 countries 변수에 저장하고, 화면에 출력하시오.
countries = reviews['country'].unique()
print(countries)
['Italy' 'Portugal' 'US' 'Spain' 'France' 'Germany' 'Argentina' 'Chile'
'Australia' 'Austria' 'South Africa' 'New Zealand' 'Israel' 'Hungary'
'Greece' 'Romania' 'Mexico' 'Canada' nan 'Turkey' 'Czech Republic'
'Slovenia' 'Luxembourg' 'Croatia' 'Georgia' 'Uruguay' 'England' 'Lebanon'
'Serbia' 'Brazil' 'Moldova' 'Morocco' 'Peru' 'India' 'Bulgaria' 'Cyprus'
'Armenia' 'Switzerland' 'Bosnia and Herzegovina' 'Ukraine' 'Slovakia'
'Macedonia' 'China' 'Egypt']
각 국가별로는 몇개의 리뷰가 있는지, 각국가별 리뷰수를 구하시오.
reviews['country'].value_counts()
US 54504
France 22093
Italy 19540
Spain 6645
Portugal 5691
Chile 4472
Argentina 3800
Austria 3345
Australia 2329
Germany 2165
New Zealand 1419
South Africa 1401
Israel 505
Greece 466
Canada 257
Hungary 146
Bulgaria 141
Romania 120
Uruguay 109
Turkey 90
Slovenia 87
Georgia 86
England 74
Croatia 73
Mexico 70
Moldova 59
Brazil 52
Lebanon 35
Morocco 28
Peru 16
Ukraine 14
Czech Republic 12
Macedonia 12
Serbia 12
Cyprus 11
India 9
Switzerland 7
Luxembourg 6
Bosnia and Herzegovina 2
Armenia 2
Slovakia 1
Egypt 1
China 1
Name: country, dtype: int64
리뷰 데이터프레임의 price 컬럼 값에서, price의 평균값을 뺀 값을, centered_price 라고 저장하시오.
centered_price = reviews['price'] - reviews['price'].mean()
centered_price
0 NaN
1 -20.363389
2 -21.363389
3 -22.363389
4 29.636611
...
129966 -7.363389
129967 39.636611
129968 -5.363389
129969 -3.363389
129970 -14.363389
Name: price, Length: 129971, dtype: float64
나는 경제적이므로, 가격대비 포인트가 가장 큰 와인을 사려한다. 해당 와인의 title은?
reviews.head(2)
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
# 포인트 / 가격 = 가성비
price_point = reviews['points']/reviews['price']
reviews.loc[price_point == price_point.max(),]
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
64590 | US | There's a lot going on in this Merlot, which i... | NaN | 86 | 4.0 | California | California | California Other | NaN | NaN | Bandit NV Merlot (California) | Merlot | Bandit |
126096 | Romania | Notes of sun-dried hay and green flower highli... | UnWineD | 86 | 4.0 | Viile Timisului | NaN | Unknown | Anna Lee C. Iijima | NaN | Cramele Recas 2011 UnWineD Pinot Grigio (Viile... | Pinot Grigio | Cramele Recas |
사람들이 어떤와인을 더 많이 거론했는지 보려한다.
“tropical” 이 들어있는 리뷰의 갯수를 세고, “fruity” 라고 들어있는 리뷰의 갯수를 세어서
판다스 시리즈로 descriptor_counts 변수로 만들어 보자.
def A(strings):
if 'tropical' in strings:
return 1
else:
return 0
A('Aromas include tropical fruit, broom, brimston')
1
tropical_counts = reviews['description'].apply(A)
tropical_counts.sum()
3607
def B(strings):
if 'fruity' in strings:
return 1
else:
return 0
furity_counts = reviews['description'].apply(B)
furity_counts.sum()
9090
descriptor_counts = pd.Series( data = [3607,9090],index = ['tropical','fruity'])
descriptor_counts
tropical 3607
fruity 9090
dtype: int64
reviews.head(3)
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
# str.contains 를 이용하여 구하기
descrip = reviews['description'].str.replace(",","")
reviews['description'].str.contains('fruity').sum()
9090
별점 시스템을 만들려고 한다. 따라서 별점에 대한 데이터가 필요하다.
별점은 1,2,3 즉 3개로 만들것이다.
포인트가 95점 이상이면 3점, 85점 이상이면 2점, 나머지는 1점으로 할 것이다.
리뷰데이터를 통해 각 데이터의 별점을 구하시오.
def star(point):
if point >= 95:
return 3
elif point >= 85:
return 2
else:
return 1
reviews['별점'] = reviews['points'].apply( star )
reviews.head(4)
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | 별점 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia | 2 |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos | 2 |
2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm | 2 |
3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian | 2 |
reviews.loc[ reviews['별점'] == 3,].head()
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | 별점 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
345 | Australia | This wine contains some material over 100 year... | Rare | 100 | 350.0 | Victoria | Rutherglen | NaN | Joe Czerwinski | @JoeCz | Chambers Rosewood Vineyards NV Rare Muscat (Ru... | Muscat | Chambers Rosewood Vineyards | 3 |
346 | Australia | This deep brown wine smells like a damp, mossy... | Rare | 98 | 350.0 | Victoria | Rutherglen | NaN | Joe Czerwinski | @JoeCz | Chambers Rosewood Vineyards NV Rare Muscadelle... | Muscadelle | Chambers Rosewood Vineyards | 3 |
347 | Germany | Dusty, saffron-spiced earthiness is juxtaposed... | Kiedrich Gräfenberg Trockenbeerenauslese | 97 | 775.0 | Rheingau | NaN | NaN | Anna Lee C. Iijima | NaN | Robert Weil 2014 Kiedrich Gräfenberg Trockenbe... | Riesling | Robert Weil | 3 |
348 | Australia | Deep mahogany. Dried fig and black tea on the ... | Grand | 97 | 100.0 | Victoria | Rutherglen | NaN | Joe Czerwinski | @JoeCz | Chambers Rosewood Vineyards NV Grand Muscat (R... | Muscat | Chambers Rosewood Vineyards | 3 |
349 | Australia | RunRig is always complex, and the 2012 doesn't... | RunRig | 97 | 225.0 | South Australia | Barossa | NaN | Joe Czerwinski | @JoeCz | Torbreck 2012 RunRig Shiraz-Viognier (Barossa) | Shiraz-Viognier | Torbreck | 3 |
리뷰의 region_2 컬럼에 데이터가 비어있는 경우에는, ‘Unknown’으로 셋팅하자.
reviews['region_2'].isna().sum()
79460
reviews['region_2'] = reviews['region_2'].fillna('Unknown')
reviews.head(5)
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | 별점 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | Unknown | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia | 2 |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | Unknown | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos | 2 |
2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm | 2 |
3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | Unknown | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian | 2 |
4 | US | Much like the regular bottling from 2012, this... | Vintner's Reserve Wild Child Block | 87 | 65.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Sweet Cheeks 2012 Vintner's Reserve Wild Child... | Pinot Noir | Sweet Cheeks | 2 |
포인트로 그룹바이하고, 몇개의 데이터가 있는지 카운팅
reviews['points'].value_counts()
88 17207
87 16933
90 15410
86 12600
89 12226
91 11359
92 9613
85 9530
93 6489
84 6480
94 3758
83 3025
82 1836
95 1535
81 692
96 523
80 397
97 229
98 77
99 33
100 19
Name: points, dtype: int64
reviews.groupby('points')['points'].count()
points
80 397
81 692
82 1836
83 3025
84 6480
85 9530
86 12600
87 16933
88 17207
89 12226
90 15410
91 11359
92 9613
93 6489
94 3758
95 1535
96 523
97 229
98 77
99 33
100 19
Name: points, dtype: int64
포인트별 가격의 최소값을 구해본다.
reviews.groupby('points')['price'].min()
points
80 5.0
81 5.0
82 4.0
83 4.0
84 4.0
85 4.0
86 4.0
87 5.0
88 6.0
89 7.0
90 8.0
91 7.0
92 11.0
93 12.0
94 13.0
95 20.0
96 20.0
97 35.0
98 50.0
99 44.0
100 80.0
Name: price, dtype: float64
각 국가별로, 가격의 최소값, 최대값 은?
reviews.groupby('country')['price'].agg([np.min,np.max])
amin | amax | |
---|---|---|
country | ||
Argentina | 4.0 | 230.0 |
Armenia | 14.0 | 15.0 |
Australia | 5.0 | 850.0 |
Austria | 7.0 | 1100.0 |
Bosnia and Herzegovina | 12.0 | 13.0 |
Brazil | 10.0 | 60.0 |
Bulgaria | 8.0 | 100.0 |
Canada | 12.0 | 120.0 |
Chile | 5.0 | 400.0 |
China | 18.0 | 18.0 |
Croatia | 12.0 | 65.0 |
Cyprus | 11.0 | 21.0 |
Czech Republic | 15.0 | 45.0 |
Egypt | NaN | NaN |
England | 25.0 | 95.0 |
France | 5.0 | 3300.0 |
Georgia | 9.0 | 40.0 |
Germany | 5.0 | 775.0 |
Greece | 8.0 | 79.0 |
Hungary | 10.0 | 764.0 |
India | 10.0 | 20.0 |
Israel | 8.0 | 150.0 |
Italy | 5.0 | 900.0 |
Lebanon | 13.0 | 75.0 |
Luxembourg | 16.0 | 30.0 |
Macedonia | 15.0 | 20.0 |
Mexico | 8.0 | 108.0 |
Moldova | 8.0 | 42.0 |
Morocco | 14.0 | 40.0 |
New Zealand | 7.0 | 130.0 |
Peru | 10.0 | 68.0 |
Portugal | 5.0 | 1000.0 |
Romania | 4.0 | 320.0 |
Serbia | 15.0 | 42.0 |
Slovakia | 16.0 | 16.0 |
Slovenia | 7.0 | 90.0 |
South Africa | 5.0 | 330.0 |
Spain | 4.0 | 770.0 |
Switzerland | 21.0 | 160.0 |
Turkey | 14.0 | 120.0 |
US | 4.0 | 2013.0 |
Ukraine | 6.0 | 13.0 |
Uruguay | 10.0 | 130.0 |
댓글남기기