7 분 소요

winemag-data.csv 파일을 reviews 로 읽는다.

import pandas as pd
import numpy as np
reviews = pd.read_csv("winemag-data.csv",index_col = 0)
reviews.head(2)
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos

리뷰 데이터프레임에서 points 컬럼의 median 값은?

reviews['points'].median()
88.0

나라를 중복되지 않도록 가져와서 countries 변수에 저장하고, 화면에 출력하시오.

countries = reviews['country'].unique()
print(countries)
['Italy' 'Portugal' 'US' 'Spain' 'France' 'Germany' 'Argentina' 'Chile'
 'Australia' 'Austria' 'South Africa' 'New Zealand' 'Israel' 'Hungary'
 'Greece' 'Romania' 'Mexico' 'Canada' nan 'Turkey' 'Czech Republic'
 'Slovenia' 'Luxembourg' 'Croatia' 'Georgia' 'Uruguay' 'England' 'Lebanon'
 'Serbia' 'Brazil' 'Moldova' 'Morocco' 'Peru' 'India' 'Bulgaria' 'Cyprus'
 'Armenia' 'Switzerland' 'Bosnia and Herzegovina' 'Ukraine' 'Slovakia'
 'Macedonia' 'China' 'Egypt']

각 국가별로는 몇개의 리뷰가 있는지, 각국가별 리뷰수를 구하시오.

reviews['country'].value_counts()
US                        54504
France                    22093
Italy                     19540
Spain                      6645
Portugal                   5691
Chile                      4472
Argentina                  3800
Austria                    3345
Australia                  2329
Germany                    2165
New Zealand                1419
South Africa               1401
Israel                      505
Greece                      466
Canada                      257
Hungary                     146
Bulgaria                    141
Romania                     120
Uruguay                     109
Turkey                       90
Slovenia                     87
Georgia                      86
England                      74
Croatia                      73
Mexico                       70
Moldova                      59
Brazil                       52
Lebanon                      35
Morocco                      28
Peru                         16
Ukraine                      14
Czech Republic               12
Macedonia                    12
Serbia                       12
Cyprus                       11
India                         9
Switzerland                   7
Luxembourg                    6
Bosnia and Herzegovina        2
Armenia                       2
Slovakia                      1
Egypt                         1
China                         1
Name: country, dtype: int64

리뷰 데이터프레임의 price 컬럼 값에서, price의 평균값을 뺀 값을, centered_price 라고 저장하시오.

centered_price = reviews['price'] - reviews['price'].mean()
centered_price
0               NaN
1        -20.363389
2        -21.363389
3        -22.363389
4         29.636611
            ...    
129966    -7.363389
129967    39.636611
129968    -5.363389
129969    -3.363389
129970   -14.363389
Name: price, Length: 129971, dtype: float64

나는 경제적이므로, 가격대비 포인트가 가장 큰 와인을 사려한다. 해당 와인의 title은?

reviews.head(2)
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
# 포인트 / 가격 = 가성비
price_point = reviews['points']/reviews['price']
reviews.loc[price_point == price_point.max(),]
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
64590 US There's a lot going on in this Merlot, which i... NaN 86 4.0 California California California Other NaN NaN Bandit NV Merlot (California) Merlot Bandit
126096 Romania Notes of sun-dried hay and green flower highli... UnWineD 86 4.0 Viile Timisului NaN Unknown Anna Lee C. Iijima NaN Cramele Recas 2011 UnWineD Pinot Grigio (Viile... Pinot Grigio Cramele Recas

사람들이 어떤와인을 더 많이 거론했는지 보려한다.

“tropical” 이 들어있는 리뷰의 갯수를 세고, “fruity” 라고 들어있는 리뷰의 갯수를 세어서

판다스 시리즈로 descriptor_counts 변수로 만들어 보자.

def A(strings):
    if 'tropical' in strings:
        return 1
    else:
        return 0
A('Aromas include tropical fruit, broom, brimston')
1
tropical_counts = reviews['description'].apply(A)
tropical_counts.sum()
3607
def B(strings):
    if 'fruity' in strings:
        return 1
    else:
        return 0
furity_counts = reviews['description'].apply(B)
furity_counts.sum()
9090
descriptor_counts = pd.Series( data = [3607,9090],index = ['tropical','fruity'])
descriptor_counts
tropical    3607
fruity      9090
dtype: int64
reviews.head(3)
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
# str.contains 를 이용하여 구하기
descrip = reviews['description'].str.replace(",","")
reviews['description'].str.contains('fruity').sum()
9090

별점 시스템을 만들려고 한다. 따라서 별점에 대한 데이터가 필요하다.

별점은 1,2,3 즉 3개로 만들것이다.

포인트가 95점 이상이면 3점, 85점 이상이면 2점, 나머지는 1점으로 할 것이다.

리뷰데이터를 통해 각 데이터의 별점을 구하시오.

def star(point):
    if point >= 95:
        return 3
    elif point >= 85:
        return 2
    else:
        return 1
reviews['별점'] = reviews['points'].apply( star )
reviews.head(4)
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery 별점
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia 2
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos 2
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm 2
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian 2
reviews.loc[ reviews['별점'] == 3,].head()
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery 별점
345 Australia This wine contains some material over 100 year... Rare 100 350.0 Victoria Rutherglen NaN Joe Czerwinski @JoeCz Chambers Rosewood Vineyards NV Rare Muscat (Ru... Muscat Chambers Rosewood Vineyards 3
346 Australia This deep brown wine smells like a damp, mossy... Rare 98 350.0 Victoria Rutherglen NaN Joe Czerwinski @JoeCz Chambers Rosewood Vineyards NV Rare Muscadelle... Muscadelle Chambers Rosewood Vineyards 3
347 Germany Dusty, saffron-spiced earthiness is juxtaposed... Kiedrich Gräfenberg Trockenbeerenauslese 97 775.0 Rheingau NaN NaN Anna Lee C. Iijima NaN Robert Weil 2014 Kiedrich Gräfenberg Trockenbe... Riesling Robert Weil 3
348 Australia Deep mahogany. Dried fig and black tea on the ... Grand 97 100.0 Victoria Rutherglen NaN Joe Czerwinski @JoeCz Chambers Rosewood Vineyards NV Grand Muscat (R... Muscat Chambers Rosewood Vineyards 3
349 Australia RunRig is always complex, and the 2012 doesn't... RunRig 97 225.0 South Australia Barossa NaN Joe Czerwinski @JoeCz Torbreck 2012 RunRig Shiraz-Viognier (Barossa) Shiraz-Viognier Torbreck 3

리뷰의 region_2 컬럼에 데이터가 비어있는 경우에는, ‘Unknown’으로 셋팅하자.

reviews['region_2'].isna().sum()
79460
reviews['region_2'] = reviews['region_2'].fillna('Unknown')
reviews.head(5)
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery 별점
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna Unknown Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia 2
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN Unknown Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos 2
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm 2
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore Unknown Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian 2
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks 2

포인트로 그룹바이하고, 몇개의 데이터가 있는지 카운팅

reviews['points'].value_counts()
88     17207
87     16933
90     15410
86     12600
89     12226
91     11359
92      9613
85      9530
93      6489
84      6480
94      3758
83      3025
82      1836
95      1535
81       692
96       523
80       397
97       229
98        77
99        33
100       19
Name: points, dtype: int64
reviews.groupby('points')['points'].count()
points
80       397
81       692
82      1836
83      3025
84      6480
85      9530
86     12600
87     16933
88     17207
89     12226
90     15410
91     11359
92      9613
93      6489
94      3758
95      1535
96       523
97       229
98        77
99        33
100       19
Name: points, dtype: int64

포인트별 가격의 최소값을 구해본다.

reviews.groupby('points')['price'].min()
points
80      5.0
81      5.0
82      4.0
83      4.0
84      4.0
85      4.0
86      4.0
87      5.0
88      6.0
89      7.0
90      8.0
91      7.0
92     11.0
93     12.0
94     13.0
95     20.0
96     20.0
97     35.0
98     50.0
99     44.0
100    80.0
Name: price, dtype: float64

각 국가별로, 가격의 최소값, 최대값 은?

reviews.groupby('country')['price'].agg([np.min,np.max])
amin amax
country
Argentina 4.0 230.0
Armenia 14.0 15.0
Australia 5.0 850.0
Austria 7.0 1100.0
Bosnia and Herzegovina 12.0 13.0
Brazil 10.0 60.0
Bulgaria 8.0 100.0
Canada 12.0 120.0
Chile 5.0 400.0
China 18.0 18.0
Croatia 12.0 65.0
Cyprus 11.0 21.0
Czech Republic 15.0 45.0
Egypt NaN NaN
England 25.0 95.0
France 5.0 3300.0
Georgia 9.0 40.0
Germany 5.0 775.0
Greece 8.0 79.0
Hungary 10.0 764.0
India 10.0 20.0
Israel 8.0 150.0
Italy 5.0 900.0
Lebanon 13.0 75.0
Luxembourg 16.0 30.0
Macedonia 15.0 20.0
Mexico 8.0 108.0
Moldova 8.0 42.0
Morocco 14.0 40.0
New Zealand 7.0 130.0
Peru 10.0 68.0
Portugal 5.0 1000.0
Romania 4.0 320.0
Serbia 15.0 42.0
Slovakia 16.0 16.0
Slovenia 7.0 90.0
South Africa 5.0 330.0
Spain 4.0 770.0
Switzerland 21.0 160.0
Turkey 14.0 120.0
US 4.0 2013.0
Ukraine 6.0 13.0
Uruguay 10.0 130.0

댓글남기기