「Rプログラミング入門」をPythonで書き直す

何の話かというと

作者: Garrett Grolemund,大橋真也,長尾高弘
出版社/メーカー: オライリージャパン
発売日: 2015/03/25
メディア: 大型本
この商品を含むブログを見る

某編集長から上記の書籍が送られてきて、「これは、次はRの本を書けという指示か？？？？」と勘ぐってみたものの、筆者はPython派なので、「これと同じことは全部Pythonでもできるんだよー」と言いたくなって、このエントリーを書き始めた次第です。ちなみに、この本、Rの入門書としてはよくできているので、これのPython版ができたら、それはそれで役に立つ気もします。

なお、このエントリーでは、あくまでコードの部分だけを書き直して、RとPythonの差異についての説明だけを行ないます。コードそのものの説明については、上記の書籍をご購入ください。

環境準備

IPython, numpy, pandas, matplotlib あたりが入っていればOKです。個人的には、Canopy Express を愛用しています。導入手順は、下記資料のp.5〜p.7を参照。

ソフトウェアエンジニアのための「機械学習理論」入門・ハンズオン演習ガイド from Etsuji Nakai

コマンドは、IPythonのCLIから実行します。numpy, pyplot, pandasは下記でインポートしてあって、それぞれ、np, plt, pdで参照できる前提です。SeriesとDataFrameはよく使うので、明示的にインポートしておきます。

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pandas import Series, DataFrame

1.1 Rのユーザーインターフェイス

リスト（ベクトル）はnumpyのarrayオブジェクトを使用します。arange()でリストを作成する場合、末尾に指定した値はリストに含まれないので注意。

In [1]: np.arange(100,131)
Out[1]: 
array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,
       113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125,
       126, 127, 128, 129, 130])

In [2]: np.arange(1,7)
Out[2]: array([1, 2, 3, 4, 5, 6])

1.2 オブジェクト

数値演算の際は、int型とfloat型の区別に注意。1/2 と 1/2.0 は結果が異なります。

In [1]: die = np.arange(1,7)

In [2]: die
Out[2]: array([1, 2, 3, 4, 5, 6])

In [3]: die / 2.0
Out[3]: array([ 0.5,  1. ,  1.5,  2. ,  2.5,  3. ])

In [4]: die * die
Out[4]: array([ 1,  4,  9, 16, 25, 36])

「ベクトルのリサイクル規則」はPythonには無いので、下記のようにtile()で明示的に繰り返します。

In [5]: die + np.tile(np.arange(1,3), 3)
Out[5]: array([2, 4, 4, 6, 6, 8])

ベクトル演算（内積、外積）を行うときは、リストではなく、(1,N)行列として定義しておく必要があります。

In [6]: die = np.array([range(1,7)])

In [7]: die
Out[7]: array([[1, 2, 3, 4, 5, 6]])

In [8]: die.T
Out[8]: 
array([[1],
       [2],
       [3],
       [4],
       [5],
       [6]])

内積は「横ベクトル×縦ベクトル」、外積は「縦ベクトル×横ベクトル」で計算します。

In [9]: np.dot(die, die.T)
Out[9]: array([[91]])

In [10]: np.dot(die.T, die)
Out[10]: 
array([[ 1,  2,  3,  4,  5,  6],
       [ 2,  4,  6,  8, 10, 12],
       [ 3,  6,  9, 12, 15, 18],
       [ 4,  8, 12, 16, 20, 24],
       [ 5, 10, 15, 20, 25, 30],
       [ 6, 12, 18, 24, 30, 36]])

リストのままだと、内積は計算できますが、外積が計算できません。

In [11]: die = np.arange(1,7)

In [12]: die
Out[12]: array([1, 2, 3, 4, 5, 6])

In [13]: np.dot(die, die)
Out[13]: 91

1.3 関数

factorial()はいくつかの選択肢がありますが、scipy.math.factorial()を使うのが安全。（これ以外は、引数にリストがとれない。）

In [1]: np.round(3.1415)
Out[1]: 3.0

In [2]: import scipy

In [3]: scipy.math.factorial(3)
Out[3]: 6

In [4]: die = np.arange(1,7)

In [5]: mean(die)
Out[5]: 3.5

In [6]: round(mean(die))
Out[6]: 4.0

In [7]: round(3.1415, ndigits=2)
Out[7]: 3.14

リストからのランダムサンプリングは、np.random.choice()を使います。デフォルトでは、replace=True（元に戻すサンプリング）なので注意。

In [8]: np.random.choice(die, size=2, replace=False)
Out[8]: array([1, 4])

1.4 元に戻すサンプリング

replace=Trueを明示的に指定してみた。

In [1]: die = np.arange(1,7)

In [2]: np.random.choice(die, size=2, replace=True)
Out[2]: array([4, 1])

1.5 独自関数の書き方

戻り値は、明示的にreturnする必要があります。

In [1]: def roll():
   ...:         die = np.arange(1,7)
   ...:         dice = np.random.choice(die, size=2, replace=True)
   ...:         return sum(dice)
   ...: 

In [2]: roll()
Out[2]: 11

In [3]: def roll2(bones=np.arange(1,7)):
   ...:         dice = np.random.choice(bones, size=2, replace=True)
   ...:         return sum(dice)
   ...: 

In [4]: roll2(bones=np.arange(1,5))
Out[4]: 4

2.1 パッケージ

散布図は、plt.scatter()を使います。冪乗は「**」で計算します。

In [1]: x = np.arange(-1,1.2,0.2)

In [2]: y = x ** 3

In [3]: plt.scatter(x,y)
Out[3]: <matplotlib.collections.PathCollection at 0x4c44e90>

ヒストグラムは、plt.hist()です。区間の幅ではなく、区間の個数（bins）を指定します。

In [4]: x = np.array([1,2,2,2,3,3])

In [5]: plt.hist(x,bins=4)
Out[5]: 
(array([ 1.,  0.,  3.,  2.]),
 array([ 1. ,  1.5,  2. ,  2.5,  3. ]),
 <a list of 4 Patch objects>)

関数の繰り返し呼び出しは、リストの内包表記を使います。

In [6]: def roll():
   ...:         die = np.arange(1,7)
   ...:         dice = np.random.choice(die, size=2, replace=True)
   ...:         return sum(dice)
   ...: 

In [7]: [ roll() for x in np.arange(10) ]
Out[7]: [4, 7, 8, 7, 5, 8, 8, 5, 5, 8]

In [8]: rolls = [ roll() for x in np.arange(10000) ]

In [9]: plt.hist(rolls, bins=11)
Out[9]: 
(array([  274.,   543.,   833.,  1081.,  1346.,  1695.,  1437.,  1097.,
          851.,   566.,   277.]),
 array([  2.        ,   2.90909091,   3.81818182,   4.72727273,
          5.63636364,   6.54545455,   7.45454545,   8.36363636,
          9.27272727,  10.18181818,  11.09090909,  12.        ]),
 <a list of 11 Patch objects>)

2.2 ヘルプページに教えてもらう

IPythonのヘルプ機能を使いましょう。

In [1]: sqrt?
In [2]: np.random.choice?

練習問題の解答は次の通り。np.random.choice()のpオプションの指定では、float型で計算している点に注意。

In [3]: def roll():
   ...:         die = np.arange(1,7)
   ...:         dice = np.random.choice(die, size=2, replace=True,
   ...:             p = [1.0/8, 1.0/8, 1.0/8, 1.0/8, 1.0/8, 3.0/8])
   ...:         return sum(dice)
   ...: 

In [4]: rolls = [ roll() for x in np.arange(10000) ]

In [5]: plt.hist(rolls,bins=11)
Out[5]: 
(array([  170.,   322.,   479.,   631.,   777.,  1595.,  1367.,  1241.,
         1075.,   897.,  1446.]),
 array([  2.        ,   2.90909091,   3.81818182,   4.72727273,
          5.63636364,   6.54545455,   7.45454545,   8.36363636,
          9.27272727,  10.18181818,  11.09090909,  12.        ]),
 <a list of 11 Patch objects>)

3.1 アトミックベクトル

インスタンスの型は、type()、および、isinstance()で確認。

In [1]: die = np.arange(1,7)

In [2]: type(die)
Out[2]: numpy.ndarray

In [3]: isinstance(die, np.ndarray)
Out[3]: True

ただし、ndarrayの中身は、dtypeで確認が必要。

In [4]: die.dtype
Out[4]: dtype('int64')

整数型と文字列型の例。文字列はシングルクォートの方が標準的。

In [5]: int = np.array([1L, 5L])
In [6]: text = np.array(['ace', 'hearts'])

小数点を付けない数値は、デフォルトで整数型になるので注意。

In [7]: type(4)
Out[7]: int

In [8]: type(4.0)
Out[8]: float

虚数はjで表現。

In [9]: comp = np.array([1+1j, 1+2j, 1+3j])

In [10]: comp.dtype
Out[10]: dtype('complex128')

3.2 属性

名前付きのベクトルを扱う際は、pandasのSeriesオブジェクトを使います。

In [1]: die = Series(range(1,7))

In [2]: die.index = ['one','two','three','four','five','six']

In [3]: die
Out[3]: 
one      1
two      2
three    3
four     4
five     5
six      6
dtype: int64

In [4]: die + 1
Out[4]: 
one      2
two      3
three    4
four     5
five     6
six      7
dtype: int64

In [5]: die.index = ['uno', 'dos', 'tres', 'cuatro', 'cinco', 'seis']

In [6]: die
Out[6]: 
uno       1
dos       2
tres      3
cuatro    4
cinco     5
seis      6
dtype: int64

3.3 行列

行列を生成する際は、reshape()が便利です。

In [1]: die.index = ['one','two','three','four','five','six']

In [2]: die.reshape(2,3)
Out[2]: 
array([[1, 2, 3],
       [4, 5, 6]])

In [3]: die.reshape(2,3,order=1)
Out[3]: 
array([[1, 3, 5],
       [2, 4, 6]])

3.4 配列

多次元配列もreshape()で生成できます。

In [1]: np.array([range(11,15), range(21,25), range(31,35)]).reshape(3,2,2)
Out[1]: 
array([[[11, 12],
        [13, 14]],

       [[21, 22],
        [23, 24]],

       [[31, 32],
        [33, 34]]])

練習問題の解答：しばし考えた結果、これが簡単かと。

In [2]: np.array([['ace','king','queen','jack','ten'],['spades']*5]).T
Out[2]: 
array([['ace', 'spades'],
       ['king', 'spades'],
       ['queen', 'spades'],
       ['jack', 'spades'],
       ['ten', 'spades']], 
      dtype='|S6')

3.8 データフレーム

Rのデータフレームは、pandasのDataFrameオブジェクトが対応します。

In [1]: df = DataFrame({
   ...:     'face':['ace','two','six'],
   ...:     'suit':['clubs','clubs','clubs'],
   ...:     'value':[1,2,3]})

In [2]: df
Out[2]: 
  face   suit  value
0  ace  clubs      1
1  two  clubs      2
2  six  clubs      3

トランプのデッキは次の定義で一撃。

In [3]: face = ['king','queen','jack','ten','nine','eight',
   ...:         'seven','six','five','four','three','two','ace']

In [4]: suit = ['spades', 'clubs', 'diamonds', 'hearts']

In [5]: value = range(13,0,-1)

In [6]: deck = DataFrame({
   ...:     'face': np.tile(face,4),
   ...:     'suit': np.repeat(suit,13),
   ...:     'value': np.tile(value,4)})

In [7]: deck
Out[7]: 
     face      suit  value
0    king    spades     13
1   queen    spades     12
2    jack    spades     11
3     ten    spades     10
4    nine    spades      9
5   eight    spades      8
6   seven    spades      7
7     six    spades      6
8    five    spades      5
9    four    spades      4
10  three    spades      3
11    two    spades      2
12    ace    spades      1
13   king     clubs     13
14  queen     clubs     12
15   jack     clubs     11
16    ten     clubs     10
17   nine     clubs      9
18  eight     clubs      8
19  seven     clubs      7
20    six     clubs      6
21   five     clubs      5
22   four     clubs      4
23  three     clubs      3
24    two     clubs      2
25    ace     clubs      1
26   king  diamonds     13
27  queen  diamonds     12
28   jack  diamonds     11
29    ten  diamonds     10
30   nine  diamonds      9
31  eight  diamonds      8
32  seven  diamonds      7
33    six  diamonds      6
34   five  diamonds      5
35   four  diamonds      4
36  three  diamonds      3
37    two  diamonds      2
38    ace  diamonds      1
39   king    hearts     13
40  queen    hearts     12
41   jack    hearts     11
42    ten    hearts     10
43   nine    hearts      9
44  eight    hearts      8
45  seven    hearts      7
46    six    hearts      6
47   five    hearts      5
48   four    hearts      4
49  three    hearts      3
50    two    hearts      2
51    ace    hearts      1

4.1 値の選択

DataFrameからの行／列の抽出は、下記を参照。

・pandasのDataFrameからのデータ選択

抽出したオブジェクトは、オリジナルのコピーではなく、オリジナルを参照するDataviewになっているので要注意。

4.3 デッキのシャッフル

インデックスのpermutationで一撃。

In [5]: deck2 = deck.reindex(np.random.permutation(deck.index))

In [6]: deck.head()
Out[6]: 
    face    suit  value
0   king  spades     13
1  queen  spades     12
2   jack  spades     11
3    ten  spades     10
4   nine  spades      9

In [7]: deck2.head()
Out[7]: 
    face    suit  value
46   six  hearts      6
12   ace  spades      1
7    six  spades      6
13  king   clubs     13
41  jack  hearts     11

5.2 論理添字

練習問題の解答（その1）

In [18]: deck2['face']=='ace'
Out[18]: 
43    False
1     False
13    False
49    False
0     False
18    False
25     True
30    False
...以下略...

In [19]: sum(deck2['face']=='ace')
Out[19]: 4

練習問題の解答（その2）

DataFrameの値を書き換える時は、.locメソッドで行と列を指定して値を代入します。

In [15]: deck2[deck2['suit']=='hearts'].index
Out[15]: Int64Index([39, 40, 50, 42, 43, 45, 49, 44, 41, 46, 47, 51, 48], dtype='int64')

In [16]: deck2.loc[deck2[deck2['suit']=='hearts'].index,'value']
Out[16]: 
39    13
40    12
50     2
42    10
43     9
45     7
49     3
44     8
41    11
46     6
47     5
51     1
48     4
Name: value, dtype: int64

In [17]: deck2.loc[deck2[deck2['suit']=='hearts'].index,'value']=1

In [18]: deck2[deck2['suit']=='hearts'].head()
Out[18]: 
     face    suit  value
39   king  hearts      1
40  queen  hearts      1
50    two  hearts      1
42    ten  hearts      1
43   nine  hearts      1

5.3 欠損情報

DataFrameの欠損データは、np.nanを用います。平均や合計を計算する際は、自動的に0とみなされます。

In [1]: df = DataFrame([np.nan]).append(DataFrame(range(1,51)), ignore_index=True)

In [2]: df.head()
Out[2]: 
    0
0 NaN
1   1
2   2
3   3
4   4

In [3]: df.isnull().head()
Out[3]: 
       0
0   True
1  False
2  False
3  False
4  False

In [4]: df.mean()
Out[4]: 
0    25.5
dtype: float64

In [5]: mean(df)
Out[5]: 
0    25.5
dtype: float64

第6章環境

うーん。Rの環境って、要するに変数のスコープの管理機構？　Pythonだったら普通にクラス化するだけかと。。。。

In [1]: cat mydeck.py
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pandas import Series, DataFrame

class Deck:
    def __init__(self):
        face = ['king','queen','jack','ten','nine','eight',
                'seven','six','five','four','three','two','ace']
        suit = ['spades', 'clubs', 'diamonds', 'hearts']
        value = range(13,0,-1)
        self.deck = DataFrame({
                        'face': np.tile(face,4),
                        'suit': np.repeat(suit,13),
                        'value': np.tile(value,4)})

    def shuffle(self):
        self.deck = self.deck.reindex(np.random.permutation(self.deck.index))

    def deal(self):
        card = self.deck[0:1]
        self.deck = self.deck.drop(self.deck[0:1].index)
        return card


In [2]: import mydeck

In [3]: deck = mydeck.Deck()

In [4]: deck.shuffle()

In [5]: deck.deal()
Out[5]: 
   face    suit  value
9  four  spades      4

In [6]: deck.deal()
Out[6]: 
    face      suit  value
34  five  diamonds      5

Ⅲ部　スロットマシン

すいません。。。ここは普通のプログラム入門なので、最後の結果だけ。。。。

In [1]: cat myslot.py
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pandas import Series, DataFrame
import itertools

class Slot:
    def __init__(self):
        self.wheel_data = {'DD':0.03, '7':0.03, 'BBB':0.06, 'BB':0.1, 'B':0.25,
                'C':0.01, '0':0.52}
        self.wheel = self.wheel_data.keys()
        self.prob = self.wheel_data.values()

    def get_df(self):
        df = DataFrame(columns=['Var1','Var2','Var3'])
        for x in itertools.product(self.wheel,repeat=3):
            df = df.append(Series(x, index=['Var1','Var2','Var3']),
                            ignore_index=True)
        df['prob1'] = df['Var1'].map(self.wheel_data)
        df['prob2'] = df['Var2'].map(self.wheel_data)
        df['prob3'] = df['Var3'].map(self.wheel_data)
        df['prob'] = df['prob1']*df['prob2']*df['prob3']
        df['prize'] = np.nan
        for index, line in df.iterrows():
            symbols = line[['Var1','Var2','Var3']]
            df.loc[index, 'prize'] = self.score(symbols)
        return df

    def run_emulation(self, n):
        wheel = self.wheel_data.keys()
        prob = self.wheel_data.values()
        samples = np.random.choice(wheel, size=3*n,
                    replace=True, p=prob).reshape(n,3)
        return float(sum([self.score(x) for x in samples]))/n

    def get_symbols(self):
        wheel = self.wheel_data.keys()
        prob = self.wheel_data.values()
        return np.random.choice(wheel, size=3, replace=True, p=prob)

    def score(self, symbols):
        diamonds = sum(symbols == 'DD')
        cherries = sum(symbols == 'C')
        slots = symbols[symbols != 'DD']
        same = len(set(slots.tolist())) == 1
        bars = [ x in ['B', 'BB', 'BBB'] for x in slots ]

        if diamonds == 3:
            prize = 100 
        elif same:
            payouts = {'7':80, 'BBB':40, 'BB':25, 'B':10, 'C':10, '0':0}
            prize = payouts[slots[0]]
        elif all(bars):
            prize = 5
        elif cherries > 0:
            prize = [0,2,5][cherries + diamonds]
        else:
            prize = 0

        prize *= 2 ** diamonds

        return prize
                        
    def play(self):
        symbols = self.get_symbols()
        sc = self.score(symbols)
        print '%s %s %s' % (symbols[0], symbols[1], symbols[2])
        print '$%d' % sc

In [2]: import myslot; slot=myslot.Slot(); df = slot.get_df()

In [3]: sum(df.prob * df.prize)
Out[3]: 0.93435599999999985

In [4]: time slot.run_emulation(1000000)
CPU times: user 52 s, sys: 45 ms, total: 52 s
Wall time: 52.1 s
Out[4]: 0.922667

めもめも

このブログに記載の内容は個人の見解であり、必ずしも所属組織の立場、戦略、意見を代表するものではありません。

「Rプログラミング入門」をPythonで書き直す

何の話かというと

環境準備

1.1 Rのユーザーインターフェイス

1.2 オブジェクト

1.3 関数

1.4 元に戻すサンプリング

1.5 独自関数の書き方

2.1 パッケージ

2.2 ヘルプページに教えてもらう

3.1 アトミックベクトル

3.2 属性

3.3 行列

3.4 配列

3.8 データフレーム

4.1 値の選択

4.3 デッキのシャッフル

5.2 論理添字

5.3 欠損情報

第6章環境

Ⅲ部　スロットマシン

何の話かというと

環境準備

1.1 Rのユーザーインターフェイス

1.2 オブジェクト

1.3 関数

1.4 元に戻すサンプリング

1.5 独自関数の書き方

2.1 パッケージ

2.2 ヘルプページに教えてもらう

3.1 アトミックベクトル

3.2 属性

3.3 行列

3.4 配列

3.8 データフレーム

4.1 値の選択

4.3 デッキのシャッフル

5.2 論理添字

5.3 欠損情報

第6章 環境

Ⅲ部 スロットマシン

第6章環境

Ⅲ部　スロットマシン