Apache Arrow 是 Apache 基金会全新孵化的一个顶级项目。它设计的目的在于作为一个跨平台的数据层，来加快大数据分析项目的运行速度。

## arrow的python版本安装

arrow是apache下的一个顶级项目，它是一个跨平台的内存数据交换格式。通过conda来进行安装：conda install -c conda-forge pyarrow，官方的安装推荐使用conda，具体可以见apache arrow documentation.

<pyarrow.lib.Buffer at 0x7f46706d0ab0>


### 结合numpy使用

<pyarrow.lib.DoubleArray object at 0x7f467b68aea8>
[
0.688175,
0.979032,
0.91343,
0.725985,
0.469235,
0.373089,
0.792048,
0.472252,
0.615361,
0.693604,
...
0.240511,
0.162609,
0.518071,
0.816558,
0.736163,
0.509702,
0.914533,
0.879404,
0.979877,
0.883003
]


### 结合pandas使用

['_column',
'_validate',
'append_column',
'cast',
'column',
'columns',
'drop',
'equals',
'flatten',
'from_arrays',
'from_batches',
'from_pandas',
'itercolumns',
'num_columns',
'num_rows',
'remove_column',
'schema',
'set_column',
'shape',
'to_batches',
'to_pandas',
'to_pydict']


### 导入csv格式的数据

26 ms ± 97.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

164 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## arrow的IO操作

934 µs ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
969 µs ± 88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

8.76 µs ± 24.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
5.73 µs ± 9.02 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
8.26 µs ± 28.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


## pickle和arrow的对比

207 µs ± 1.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

200 ms ± 5.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
206 ms ± 7.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

4.72 ms ± 18.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
12.4 ms ± 42.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)