一、Numpy
1.数组的拷贝
(1)不拷贝
(2)View或者浅拷贝
(3)深拷贝
# 堆区相当于硬盘,比栈区大,运行没有栈区快,一般把数据存放在堆区。
# 栈区相当于内存,比堆区要小,但是运行比较快,一般存放地址的名字。
# 拷贝:深浅栈区内存是不一样的,但是浅拷贝堆区内存一样,深拷贝堆区内存不一样
# 不拷贝:栈区、堆区内存都是一样的,只是定义了不同的名字
import numpy as np
a = np.arange(12)
print("修改前的 a = ")
print(a)
# 不拷贝
b = a # 这种情况不会进行拷贝,只是赋值
print(b is a) # 返回True ,说明b和a
# 浅拷贝 view
# 只拷贝栈区,堆区的内存是一样的
c = a.view()
# 此时c与是不同的栈区,但是仍然指向堆区的同一个地址
print(c is a)
# 如果修改c值,a也会变化
c[0] = 100
print("浅拷贝修改后的 a = ")
print(a) # 发生改变
# 深拷贝copy
# 将栈区和堆区的内存空间各拷贝一份,也就是地址都不一样
d = a.copy()
print(d is a)
d[1] = 200
d
print("深拷贝修改后 a = ")
print(a) # a不改变,因为堆区是两个不同的内存
# ravel()数组一维化时的深浅拷贝
print("="*40)
a1 = np.random.rand(2,4)
print("原始a1 = ")
print(a1)
a2 = a1.ravel()
a2[0] = 0
print("a2 = ")
print(a2)
print("浅拷贝后 a1 = ")
print(a1) # 改变
# flatten()深拷贝
a3 = a1.flatten()
a3[1] = 100
print("a3 = ")
print(a3)
print("深拷贝后 a1 = ")
print(a1) # 不改变,因为堆区内存已经不一样了
输出:
修改前的 a =
[ 0 1 2 3 4 5 6 7 8 9 10 11]
True
False
浅拷贝修改后的 a =
[100 1 2 3 4 5 6 7 8 9 10 11]
False
深拷贝修改后 a =
[100 1 2 3 4 5 6 7 8 9 10 11]
========================================
原始a1 =
[[0.28861696 0.90031111 0.16516921 0.57358814]
[0.94398138 0.30219887 0.6312207 0.86375758]]
a2 =
[0. 0.90031111 0.16516921 0.57358814 0.94398138 0.30219887
0.6312207 0.86375758]
浅拷贝后 a1 =
[[0. 0.90031111 0.16516921 0.57358814]
[0.94398138 0.30219887 0.6312207 0.86375758]]
a3 =
[ 0. 100. 0.16516921 0.57358814 0.94398138
0.30219887 0.6312207 0.86375758]
深拷贝后 a1 =
[[0. 0.90031111 0.16516921 0.57358814]
[0.94398138 0.30219887 0.6312207 0.86375758]]
总结
在数组操作中分成三种拷贝:
- 不拷贝:直接赋值,那么栈区没有拷贝,只是用同一个栈区定义了不同的名称
- 浅拷贝:只拷贝栈区,栈区指定的堆区并没有拷贝
- 深拷贝:栈区和堆区都拷贝了
2.csv文件操作
import numpy as np
scores = np.random.randint(0,100,size=(20,2))
scores
# np.savetxt用于实现将一个数组保存到文件
np.savetxt("score.csv",scores,delimiter=",",header="英语,数学",comments="",fmt="%d")
# np.loadtxt实现从文件中读取数据
# skiprows为跳过前面行数,比如表头是 英语,数学,是字符串类型,不能转换为int类型
b = np.loadtxt("score.csv",dtype=np.int32,delimiter=",",skiprows=1)
b
输出:
help(np.savetxt)
# 输出
Help on function savetxt in module numpy:
savetxt(fname, X, fmt='%.18e', delimiter=' ', newline='\n', header='', footer='', comments='# ', encoding=None)
Save an array to a text file.
Parameters
----------
fname : filename or file handle
If the filename ends in ``.gz``, the file is automatically saved in
compressed gzip format. `loadtxt` understands gzipped files
transparently.
X : 1D or 2D array_like
Data to be saved to a text file.
fmt : str or sequence of strs, optional
A single format (%10.5f), a sequence of formats, or a
multi-format string, e.g. 'Iteration %d -- %10.5f', in which
case `delimiter` is ignored. For complex `X`, the legal options
for `fmt` are:
* a single specifier, `fmt='%.4e'`, resulting in numbers formatted
like `' (%s+%sj)' % (fmt, fmt)`
* a full string specifying every real and imaginary part, e.g.
`' %.4e %+.4ej %.4e %+.4ej %.4e %+.4ej'` for 3 columns
* a list of specifiers, one per column - in this case, the real
and imaginary part must have separate specifiers,
e.g. `['%.3e + %.3ej', '(%.15e%+.15ej)']` for 2 columns
delimiter : str, optional
String or character separating columns.
newline : str, optional
String or character separating lines.
.. versionadded:: 1.5.0
header : str, optional
String that will be written at the beginning of the file.
.. versionadded:: 1.7.0
footer : str, optional
String that will be written at the end of the file.
.. versionadded:: 1.7.0
comments : str, optional
String that will be prepended to the ``header`` and ``footer`` strings,
to mark them as comments. Default: '# ', as expected by e.g.
``numpy.loadtxt``.
.. versionadded:: 1.7.0
encoding : {None, str}, optional
Encoding used to encode the outputfile. Does not apply to output
streams. If the encoding is something other than 'bytes' or 'latin1'
you will not be able to load the file in NumPy versions < 1.14. Default
is 'latin1'.
.. versionadded:: 1.14.0
See Also
--------
save : Save an array to a binary file in NumPy ``.npy`` format
savez : Save several arrays into an uncompressed ``.npz`` archive
savez_compressed : Save several arrays into a compressed ``.npz`` archive
Notes
-----
Further explanation of the `fmt` parameter
(``%[flag]width[.precision]specifier``):
flags:
``-`` : left justify
``+`` : Forces to precede result with + or -.
``0`` : Left pad the number with zeros instead of space (see width).
width:
Minimum number of characters to be printed. The value is not truncated
if it has more characters.
precision:
- For integer specifiers (eg. ``d,i,o,x``), the minimum number of
digits.
- For ``e, E`` and ``f`` specifiers, the number of digits to print
after the decimal point.
- For ``g`` and ``G``, the maximum number of significant digits.
- For ``s``, the maximum number of characters.
specifiers:
``c`` : character
``d`` or ``i`` : signed decimal integer
``e`` or ``E`` : scientific notation with ``e`` or ``E``.
``f`` : decimal floating point
``g,G`` : use the shorter of ``e,E`` or ``f``
``o`` : signed octal
``s`` : string of characters
``u`` : unsigned decimal integer
``x,X`` : unsigned hexadecimal integer
This explanation of ``fmt`` is not complete, for an exhaustive
specification see [1]_.
References
----------
.. [1] `Format Specification Mini-Language
<https://docs.python.org/library/string.html#format-specification-mini-language>`_,
Python Documentation.
Examples
--------
>>> x = y = z = np.arange(0.0,5.0,1.0)
>>> np.savetxt('test.out', x, delimiter=',') # X is an array
>>> np.savetxt('test.out', (x,y,z)) # x,y,z equal sized 1D arrays
>>> np.savetxt('test.out', x, fmt='%1.4e') # use exponential notation
help(np.loadtxt)
# 输出
Help on function loadtxt in module numpy:
loadtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, converters=None, skiprows=0, usecols=None, unpack=False, ndmin=0, encoding='bytes', max_rows=None, *, like=None)
Load data from a text file.
Each row in the text file must have the same number of values.
Parameters
----------
fname : file, str, or pathlib.Path
File, filename, or generator to read. If the filename extension is
``.gz`` or ``.bz2``, the file is first decompressed. Note that
generators should return byte strings.
dtype : data-type, optional
Data-type of the resulting array; default: float. If this is a
structured data-type, the resulting array will be 1-dimensional, and
each row will be interpreted as an element of the array. In this
case, the number of columns used must match the number of fields in
the data-type.
comments : str or sequence of str, optional
The characters or list of characters used to indicate the start of a
comment. None implies no comments. For backwards compatibility, byte
strings will be decoded as 'latin1'. The default is '#'.
delimiter : str, optional
The string used to separate values. For backwards compatibility, byte
strings will be decoded as 'latin1'. The default is whitespace.
converters : dict, optional
A dictionary mapping column number to a function that will parse the
column string into the desired value. E.g., if column 0 is a date
string: ``converters = {0: datestr2num}``. Converters can also be
used to provide a default value for missing data (but see also
`genfromtxt`): ``converters = {3: lambda s: float(s.strip() or 0)}``.
Default: None.
skiprows : int, optional
Skip the first `skiprows` lines, including comments; default: 0.
usecols : int or sequence, optional
Which columns to read, with 0 being the first. For example,
``usecols = (1,4,5)`` will extract the 2nd, 5th and 6th columns.
The default, None, results in all columns being read.
.. versionchanged:: 1.11.0
When a single column has to be read it is possible to use
an integer instead of a tuple. E.g ``usecols = 3`` reads the
fourth column the same way as ``usecols = (3,)`` would.
unpack : bool, optional
If True, the returned array is transposed, so that arguments may be
unpacked using ``x, y, z = loadtxt(...)``. When used with a
structured data-type, arrays are returned for each field.
Default is False.
ndmin : int, optional
The returned array will have at least `ndmin` dimensions.
Otherwise mono-dimensional axes will be squeezed.
Legal values: 0 (default), 1 or 2.
.. versionadded:: 1.6.0
encoding : str, optional
Encoding used to decode the inputfile. Does not apply to input streams.
The special value 'bytes' enables backward compatibility workarounds
that ensures you receive byte arrays as results if possible and passes
'latin1' encoded strings to converters. Override this value to receive
unicode arrays and pass strings as input to converters. If set to None
the system default is used. The default value is 'bytes'.
.. versionadded:: 1.14.0
max_rows : int, optional
Read `max_rows` lines of content after `skiprows` lines. The default
is to read all the lines.
.. versionadded:: 1.16.0
like : array_like
Reference object to allow the creation of arrays which are not
NumPy arrays. If an array-like passed in as ``like`` supports
the ``__array_function__`` protocol, the result will be defined
by it. In this case, it ensures the creation of an array object
compatible with that passed in via this argument.
.. note::
The ``like`` keyword is an experimental feature pending on
acceptance of :ref:`NEP 35 <NEP35>`.
.. versionadded:: 1.20.0
Returns
-------
out : ndarray
Data read from the text file.
See Also
--------
load, fromstring, fromregex
genfromtxt : Load data with missing values handled as specified.
scipy.io.loadmat : reads MATLAB data files
Notes
-----
This function aims to be a fast reader for simply formatted files. The
`genfromtxt` function provides more sophisticated handling of, e.g.,
lines with missing values.
.. versionadded:: 1.10.0
The strings produced by the Python float.hex method can be used as
input for floats.
Examples
--------
>>> from io import StringIO # StringIO behaves like a file object
>>> c = StringIO("0 1\n2 3")
>>> np.loadtxt(c)
array([[0., 1.],
[2., 3.]])
>>> d = StringIO("M 21 72\nF 35 58")
>>> np.loadtxt(d, dtype={'names': ('gender', 'age', 'weight'),
... 'formats': ('S1', 'i4', 'f4')})
array([(b'M', 21, 72.), (b'F', 35, 58.)],
dtype=[('gender', 'S1'), ('age', '<i4'), ('weight', '<f4')])
>>> c = StringIO("1,0,2\n3,0,4")
>>> x, y = np.loadtxt(c, delimiter=',', usecols=(0, 2), unpack=True)
>>> x
array([1., 3.])
>>> y
array([2., 4.])
This example shows how `converters` can be used to convert a field
with a trailing minus sign into a negative number.
>>> s = StringIO('10.01 31.25-\n19.22 64.31\n17.57- 63.94')
>>> def conv(fld):
... return -float(fld[:-1]) if fld.endswith(b'-') else float(fld)
...
>>> np.loadtxt(s, converters={0: conv, 1: conv})
array([[ 10.01, -31.25],
[ 19.22, 64.31],
[-17.57, 63.94]])
3.np独有的存储解决方案
# numpy 中还有一种独有的存储解决方案。文件名是以 .npy 或者 npz 结尾的。
# 1. 存储: np.save(fname,array) 或 np.savez(fname,array) 。其中,前者函数的扩展名
# 是 .npy ,后者的扩展名是 .npz ,后者是经过压缩的。
# 2. 加载: np.load(fname) 。
# 用于存储对象多样多种,可以存储多维数组
import numpy as np
c = np.random.randint(0,10,size=(2,3))
np.save("c",c)
np.load("c.npy")
# savetxt只能保存一维或者二维数组
d = np.random.randint(0,10,size=(2,3,4)) # 2块3行4列
# np.savetxt("d.csv",d) # 报错:ValueError: Expected 1D or 2D array, got 3D array instead
np.save("d",d)
输出:
总结:
- np.savetxt和np.loadtxt一般用来操作csv文件,可以设置header,但是不能存储三维以上的数组
- np.save和np.load一般用来存储非文本类型的文件,不可以设置header,但是可以存储三维数组以上的数组
- 如果想专门操作CSV文件,其实还有另外一个模块叫做csv,这个模块是python内置的,不需要安装
4.读取CSV文件的两种方式
# -*- coding:utf-8 -*-
import csv
# 以列表方式获取文件数据
def read_csv_demo01():
with open('D:\MyDocuments\Code\DataAnalysis\代码资料\数据分析代码\数据分析代码\02numpy\02numpy\stock.csv','r') as fp:
# reader是一个迭代器
reader = csv.reader(fp)
next(reader) # 从第2行开始读取,去掉标题栏
# 以列表方式读取
for x in reader:
name = x[3]
volumn = x[-1]
print({'name':name,'volumn':volumn})
# 以字典方式读取,通过键值对
def read_csv_demo02():
with open('D:\MyDocuments\Code\DataAnalysis\代码资料\数据分析代码\数据分析代码\02numpy\02numpy\stock.csv','r') as fp:
# 使用DictReader返回一个字典对象,而且reader不会包含标题行数据
# reader是一个迭代器,遍历这个迭代器,返回来的是一个字典
reader = csv.DictReader(fp)
for x in reader:
value = {"name":x['secShortName'],'volumn':x['turnoverVol']}
print(value)
if __name__ == '__main__':
read_csv_demo02()
输出:
5.写入CSV文件的两种方式
import csv
def write_csv_demo01():
headers = ['username','age','height']
# 列表形式
values = [
('张三',18,180),
('李四',19,190),
('王五',20,160)
]
# newline=''可以去除行与行之间的空行
with open('D:\MyDocuments\Code\DataAnalysis\代码资料\数据分析代码\数据分析代码\02numpy\02numpy\classroom2.csv','w',encoding='utf-8',newline='') as fp:
writer = csv.writer(fp)
# 插入一行
writer.writerow(headers)
# 一次插入多行
writer.writerows(values)
def write_csv_demo02():
headers = ['username', 'age', 'height']
# 也可以使用字典的方式把数据写入进去。这时候就需要使用 DictWriter 了
# 字典形式
values = [
{'username': '张三', 'age': 18, 'height': 180},
{'username': '李四', 'age': 19, 'height': 190},
{'username': '王五', 'age': 20, 'height': 170}
]
with open('D:\MyDocuments\Code\DataAnalysis\代码资料\数据分析代码\数据分析代码\02numpy\02numpy\classroom3.csv','w',encoding='utf-8',newline='') as fp:
# 需要先传入表头,但是后面还需要手动写入表头
writer = csv.DictWriter(fp,headers)
# 写入表头数据的时候,需要调用writeheader方法
writer.writeheader()
writer.writerows(values)
if __name__ == '__main__':
write_csv_demo02()
输出:
6.NAN和INF
import numpy as np
data = np.random.randint(0,10,size=(3,5))
print(data)
# INF和NAN
# NAN : Not A number
# INF : Infinity
# 转换为浮点型
data = data.astype(np.float16)
data[0,1] = np.NAN
print(data)
print(data/0)
# NAN和NAN不相等。比如 np.NAN != np.NAN 这个条件是成立的。
print(np.NAN == np.NAN)
# NAN和任何值做运算,结果都是NAN
print(data[0,1]*2)
# 将数值设置为NAN
data[1,2] = np.NAN
print(data)
# NAN和NAF值的删除
# 判断是否为NAN值
print(np.isnan(data))
# 提取所有NAN值
print(data[np.isnan(data)])
# 提取所有非NAN值
print(data[~np.isnan(data)])
# 删除含有NAN的行,返回一个元组
# 使用delete方法删除指定的行,axis=0表示删除行,lines表示删除的行号
print('-'*40)
lines = np.where(np.isnan(data))[0] #返回元组中的第一个
print(lines)
print("删除NAN行后:")
np.delete(data,lines,axis=0)
# NAN和NAF值的替换
# 先读取score文件数据
# 默认encoding='gbk'
# 由于NAN在表格中是字符串,所以直接将所有数据转换为字符串类型
print('-'*40)
scores = np.loadtxt('D:\MyDocuments\Code\DataAnalysis\代码资料\数据分析代码\数据分析代码\02numpy\02numpy\score.csv',delimiter=",",skiprows=1,dtype=np.str)
print(scores)
# 先将空白处转换为NAN,再转换为浮点类型,
# 否则无法将之前转换后发字符串类型直接转换为浮点类型
# 将空白格的字符串型转换为NAN
print('-'*40)
scores[scores==''] = np.NAN
print(scores)
print('-'*40)
scores1 = scores.astype(np.float16)
print(scores1)
# 将所有NAN替换为0
print('-'*40)
scores1[np.isnan(scores1)] = 0
print(scores1)
# 求每行之和,用axis=1表示
# 除了 np.delete(data,lines,axis=0)是一个例外,此时行用axis=0,其余均为axis=1
print('-'*40)
print("求每行和:")
sum = scores1.sum(axis=1)
print(sum)
# 求每列平均值,需要先处理NAN值
# 如果想要求平均分,那么就可以把NAN替换成其他值的平均值
print('-'*20 + "求每列平均值" + '-'*20)
scores2 = scores.astype(np.float16)
print(scores2)
for x in range(scores2.shape[1]):
col = scores2[:,x]
non_nan_col = col[~np.isnan(col)]
mean = non_nan_col.mean() # 求平均值
col[np.isnan(col)] = mean # 将去除NAN数值的数据平均值赋值给NAN所在位置
print('-'*40)
print(scores2)
输出:
总结
7.random模块
import numpy as np
# np.random.seed:用于指定随机数生成时所用算法开始的整数值
# 使用相同的 seed() 值,则每次生成的随即数都相同
np.random.seed(1)
np.random.rand()
print(np.random.rand(2,3))
# np.random.randn:生成均值(μ)为0,标准差(σ)为1的标准正态分布的值
data = np.random.randn(2,3)
print(data)
# np.random.randint:生成指定范围内的随机数,并且可以通过 size 参数指定维度
data1 = np.random.randint(10,size=(3,5)) #生成值在0-10之间,3行5列的数组
data2 = np.random.randint(1,20,size=(3,6)) #生成值在1-20之间,3行6列的数组
# np.random.choice:从一个列表或者数组中,随机进行采样
data = [4,65,6,3,5,73,23,5,6]
result1 = np.random.choice(data,size=(2,3)) #从data中随机采样,生成2行3列的数组
result2 = np.random.choice(data,3) #从data中随机采样3个数据形成一个一维数组
result3 = np.random.choice(10,3) #从0-10之间随机取3个值
# np.random.shuffle:把原来数组的元素的位置打乱
a = np.arange(10)
np.random.shuffle(a) #将a的元素的位置都会进行随机更换
输出:
[[7.20324493e-01 1.14374817e-04 3.02332573e-01]
[1.46755891e-01 9.23385948e-02 1.86260211e-01]]
[[-1.10593508 -1.65451545 -2.3634686 ]
[ 1.13534535 -1.01701414 0.63736181]]
8. axis理解
import numpy as np
a = np.arange(0,4).reshape(2,2)
print(a)
print(a.sum(axis=0))
print(a.sum(axis=1))
x = np.arange(12).reshape(2,6)
print('x = ')
print(x)
print(x.max(axis=0))
print(x.max(axis=1))
# np.delete 是个例外。我们按照 axis=0 的方式进行删除,
# 那么他会首先找到最外面的括号下
# 的直接子元素中的第0个,然后删掉,剩下最后一行的数据。
print(np.delete(x,0,axis=0))
y = np.arange(24).reshape(2,2,6)
print('y = ')
print(y)
print('y.max(axis=0) = ')
print(y.max(axis=0))
输出:
这次学习用了12天,中间因为毕业等一系列繁杂的事情耽误了很多时间,从今天开始,我将在家呆两个月,希望能保持好的学习状态。