鍍金池/ 問答/ 數(shù)據(jù)分析&挖掘問答
陌上花 回答

盡少調(diào)用 plt.scatter 方法便可大幅提升性能.

詳解
假設(shè) WX_b 為 M N 矩陣, mx 為 M 1 矩陣, 下面代碼

for i in range(WX_b.shape[0]):
    for j in range(WX_b.shape[1]):
        plt.scatter(mx[i], WX_b[i][j])

可以優(yōu)化成

plt.scatter(mx.repeat(WX_b.shape[1], axis=1), WX_b)

jupyter 示例代碼

%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np

WX_b = np.random.randn(30, 5)
mx = np.random.randn(WX_b.shape[0], 1)

def func1():
    for i in range(WX_b.shape[0]):
        for j in range(WX_b.shape[1]):
            plt.scatter(mx[i], WX_b[i][j])
            
def func2():
    plt.scatter(mx.repeat(WX_b.shape[1], axis=1), WX_b)
    
%time func1()
%time func2()

參考結(jié)果: func2 運行時間大約是 func1 的 5%.

別瞎鬧 回答

python 基礎(chǔ)有待加強

#df = ts.get_tick_data('601688',date='begin.strftime("%Y-%m-%d")') 
df = ts.get_tick_data('601688',date=begin.strftime("%Y-%m-%d")) 
吃藕丑 回答

<?php

public function b($arr = array()) {
    if (!empty($arr)) {
        return "";
    } else {
        foreach ($arr as &$v) {
            if (is_array($v)) {
                $v = $this->b($v);
            } else {
                $v = $v + 1;
            }
        }
        return $arr;
    }
}

?>

心沉 回答

我覺得,你沒有搞明白,什么叫“數(shù)”,什么叫“字節(jié)”吧。
0xfffe7b89 這個數(shù),就是 4294867849 ,負的是 -0xfffe7b89 。
事實上,它就不是負數(shù),只是你自己“覺得”它是負數(shù)。

不討囍 回答

csdn上面的,直接搬了過來:

因為要做觀點,觀點的屋子類似于知乎的話題,所以得想辦法把他給爬下來,搞了半天最終還是妥妥的搞定了,代碼是python寫的,不懂得麻煩自學哈!懂得直接看代碼,絕對可用


#coding:utf-8
"""
@author:haoning
@create time:2015.8.5
"""
from __future__ import division  # 精確除法
from Queue import Queue
from __builtin__ import False
import json
import os
import re
import platform
import uuid
import urllib
import urllib2
import sys
import time
import MySQLdb as mdb
from bs4 import BeautifulSoup


reload(sys)
sys.setdefaultencoding( "utf-8" )


headers = {
   'User-Agent' : 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0',
   'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
   'X-Requested-With':'XMLHttpRequest',
   'Referer':'https://www.zhihu.com/topics',
   'Cookie':'__utma=51854390.517069884.1416212035.1416212035.1416212035.1; q_c1=c02bf44d00d240798bfabcfc95baeb56|1455778173000|1416205243000; _za=b1c8ae35-f986-46a2-b24a-cb9359dc6b2a; aliyungf_tc=AQAAAJ1m71jL1woArKqF22VFnL/wRy6C; _xsrf=9d494558f9271340ab24598d85b2a3c8; cap_id="MDNiMjcwM2U0MTRhNDVmYjgxZWVhOWI0NTA2OGU5OTg=|1455864276|2a4ce8247ebd3c0df5393bb5661713ad9eec01dd"; n_c=1; _alicdn_sec=56c6ba4d556557d27a0f8c876f563d12a285f33a'
}


DB_HOST = '127.0.0.1'
DB_USER = 'root'
DB_PASS = 'root'


queue= Queue() #接收隊列
nodeSet=set()
keywordSet=set()
stop=0
offset=-20
level=0
maxLevel=7
counter=0
base=""


conn = mdb.connect(DB_HOST, DB_USER, DB_PASS, 'zhihu', charset='utf8')
conn.autocommit(False)
curr = conn.cursor()


def get_html(url):
    try:
        req = urllib2.Request(url)
        response = urllib2.urlopen(req,None,3) #在這里應(yīng)該加入代理
        html = response.read()
        return html
    except:
        pass
    return None


def getTopics():
    url = 'https://www.zhihu.com/topics'
    print url
    try:
        req = urllib2.Request(url)
        response = urllib2.urlopen(req) #鍦ㄨ繖閲屽簲璇ュ姞鍏ヤ唬鐞?
        html = response.read().decode('utf-8')
        print html
        soup = BeautifulSoup(html)
        lis = soup.find_all('li', {'class' : 'zm-topic-cat-item'})
        
        for li in lis:
            data_id=li.get('data-id')
            name=li.text
            curr.execute('select id from classify_new where name=%s',(name))
            y= curr.fetchone()
            if not y:
                curr.execute('INSERT INTO classify_new(data_id,name)VALUES(%s,%s)',(data_id,name))
        conn.commit()
    except Exception as e:
        print "get topic error",e
        


def get_extension(name):  
    where=name.rfind('.')
    if where!=-1:
        return name[where:len(name)]
    return None




def which_platform():
    sys_str = platform.system()
    return sys_str


def GetDateString():
    when=time.strftime('%Y-%m-%d',time.localtime(time.time()))
    foldername = str(when)
    return foldername 


def makeDateFolder(par,classify):
    try:
        if os.path.isdir(par):
            newFolderName=par + '//' + GetDateString() + '//'  +str(classify)
            if which_platform()=="Linux":
                newFolderName=par + '/' + GetDateString() + "/" +str(classify)
            if not os.path.isdir( newFolderName ):
                os.makedirs( newFolderName )
            return newFolderName
        else:
            return None 
    except Exception,e:
        print "kk",e
    return None 


def download_img(url,classify):
    try:
        extention=get_extension(url)
        if(extention is None):
            return None
        req = urllib2.Request(url)
        resp = urllib2.urlopen(req,None,3)
        dataimg=resp.read()
        name=str(uuid.uuid1()).replace("-","")+"_www.guandn.com"+extention
        top="E://topic_pic"
        folder=makeDateFolder(top, classify)
        filename=None
        if folder is not None:
            filename  =folder+"http://"+name
        try:
            if "e82bab09c_m" in str(url):
                return True
            if not os.path.exists(filename):
                file_object = open(filename,'w+b')
                file_object.write(dataimg)
                file_object.close()
                return '/room/default/'+GetDateString()+'/'+str(classify)+"/"+name
            else:
                print "file exist"
                return None
        except IOError,e1:
            print "e1=",e1
            pass
    except Exception as e:
        print "eee",e
        pass
    return None #如果沒有下載下來就利用原來網(wǎng)站的鏈接


def getChildren(node,name):
    global queue,nodeSet
    try:
        url="https://www.zhihu.com/topic/"+str(node)+"/hot"
        html=get_html(url)
        if html is None:
            return
        soup = BeautifulSoup(html)
        p_ch='父話題'
        node_name=soup.find('div', {'id' : 'zh-topic-title'}).find('h1').text
        topic_cla=soup.find('div', {'class' : 'child-topic'})
        if topic_cla is not None:
            try:
                p_ch=str(topic_cla.text)
                aList = soup.find_all('a', {'class' : 'zm-item-tag'}) #獲取所有子節(jié)點
                if u'子話題' in p_ch:
                    for a in aList:
                        token=a.get('data-token')
                        a=str(a).replace('\n','').replace('\t','').replace('\r','')
                        start=str(a).find('>')
                        end=str(a).rfind('</a>')
                        new_node=str(str(a)[start+1:end])
                        curr.execute('select id from rooms where name=%s',(new_node)) #先保證名字絕不相同
                        y= curr.fetchone()
                        if not y:
                            print "y=",y,"new_node=",new_node,"token=",token
                            queue.put((token,new_node,node_name))
            except Exception as e:
                print "add queue error",e
    except Exception as e:
        print "get html error",e
        
    


def getContent(n,name,p,top_id):
    try:
        global counter
        curr.execute('select id from rooms where name=%s',(name)) #先保證名字絕不相同
        y= curr.fetchone()
        print "exist?? ",y,"n=",n
        if not y:
            url="https://www.zhihu.com/topic/"+str(n)+"/hot"
            html=get_html(url)
            if html is None:
                return
            soup = BeautifulSoup(html)
            title=soup.find('div', {'id' : 'zh-topic-title'}).find('h1').text
            pic_path=soup.find('a',{'id':'zh-avartar-edit-form'}).find('img').get('src')
            description=soup.find('div',{'class':'zm-editable-content'})
            if description is not None:
                description=description.text
                
            if (u"未歸類" in title or u"根話題" in title): #允許入庫,避免死循環(huán)
                description=None
                
            tag_path=download_img(pic_path,top_id)
            print "tag_path=",tag_path
            if (tag_path is not None) or tag_path==True:
                if tag_path==True:
                    tag_path=None
                father_id=2 #默認為雜談
                curr.execute('select id from rooms where name=%s',(p))
                results = curr.fetchall()
                for r in results:
                    father_id=r[0]
                name=title
                curr.execute('select id from rooms where name=%s',(name)) #先保證名字絕不相同
                y= curr.fetchone()
                print "store see..",y
                if not y:
                    friends_num=0
                    temp = time.time()
                    x = time.localtime(float(temp))
                    create_time = time.strftime("%Y-%m-%d %H:%M:%S",x) # get time now
                    create_time
                    creater_id=None
                    room_avatar=tag_path
                    is_pass=1
                    has_index=0
                    reason_id=None  
                    #print father_id,name,friends_num,create_time,creater_id,room_avatar,is_pass,has_index,reason_id
                    ######################有資格入庫的內(nèi)容
                    counter=counter+1
                    curr.execute("INSERT INTO rooms(father_id,name,friends_num,description,create_time,creater_id,room_avatar,is_pass,has_index,reason_id)VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)",(father_id,name,friends_num,description,create_time,creater_id,room_avatar,is_pass,has_index,reason_id))
                    conn.commit() #必須時時進入數(shù)據(jù)庫,不然找不到父節(jié)點
                    if counter % 200==0:
                        print "current node",name,"num",counter
    except Exception as e:
        print "get content error",e       


def work():
    global queue
    curr.execute('select id,node,parent,name from classify where status=1')
    results = curr.fetchall()
    for r in results:
        top_id=r[0]
        node=r[1]
        parent=r[2]
        name=r[3]
        try:
            queue.put((node,name,parent)) #首先放入隊列
            while queue.qsize() >0:
                n,p=queue.get() #頂節(jié)點出隊
                getContent(n,p,top_id)
                getChildren(n,name) #出隊內(nèi)容的子節(jié)點
            conn.commit()
        except Exception as e:
            print "what's wrong",e  
            
def new_work():
    global queue
    curr.execute('select id,data_id,name from classify_new_copy where status=1')
    results = curr.fetchall()
    for r in results:
        top_id=r[0]
        data_id=r[1]
        name=r[2]
        try:
            get_topis(data_id,name,top_id)
        except:
            pass




def get_topis(data_id,name,top_id):
    global queue
    url = 'https://www.zhihu.com/node/TopicsPlazzaListV2'
    isGet = True;
    offset = -20;
    data_id=str(data_id)
    while isGet:
        offset = offset + 20
        values = {'method': 'next', 'params': '{"topic_id":'+data_id+',"offset":'+str(offset)+',"hash_id":""}'}
        try:
            msg=None
            try:
                data = urllib.urlencode(values)
                request = urllib2.Request(url,data,headers)
                response = urllib2.urlopen(request,None,5)
                html=response.read().decode('utf-8')
                json_str = json.loads(html)
                ms=json_str['msg']
                if len(ms) <5:
                    break
                msg=ms[0]
            except Exception as e:
                print "eeeee",e
            #print msg
            if msg is not None:
                soup = BeautifulSoup(str(msg))
                blks = soup.find_all('div', {'class' : 'blk'})
                for blk in blks:
                    page=blk.find('a').get('href')
                    if page is not None:
                        node=page.replace("/topic/","") #將更多的種子入庫
                        parent=name
                        ne=blk.find('strong').text
                        try:
                            queue.put((node,ne,parent)) #首先放入隊列
                            while queue.qsize() >0:
                                n,name,p=queue.get() #頂節(jié)點出隊
                                size=queue.qsize()
                                if size > 0:
                                    print size
                                getContent(n,name,p,top_id)
                                getChildren(n,name) #出隊內(nèi)容的子節(jié)點
                            conn.commit()
                        except Exception as e:
                            print "what's wrong",e  
        except urllib2.URLError, e:
            print "error is",e
            pass 
            
        
if __name__ == '__main__':
    i=0
    while i<400:
        new_work()
        i=i+1

說下數(shù)據(jù)庫的問題,我這里就不傳附件了,看字段自己建立,因為這確實太簡單了,我是用的mysql,你看自己的需求自己建。

有什么不懂得麻煩去去轉(zhuǎn)盤網(wǎng)找我,因為這個也是我開發(fā)的,上面會及時更新qq群號,這里不留qq號啥的,以免被系統(tǒng)給K了。

喜歡你 回答

我去年爬IT桔子的時候也卡在了登錄這里,后來我是直接把登入后的cookies放到程序中解決的。。。

陌顏 回答

直接過濾掉那個index不就可以了嗎

var arr = [{index: 1, a: "1", b: "2", c: "3", d: "4"},{index: 2, a: "4", b: "5", c: "6", d: "7"}];
var result = arr.filter(o=>o.index != 1);
console.log(result);
下墜 回答

又是引用問題

var a = {};
var b = a;
b.id = 1;
console.log(a)//{ id: 1 }
鹿惑 回答

clipboard.png
頁面應(yīng)該是有做過反爬蟲處理的,有關(guān)數(shù)據(jù)在html源碼中是被注釋掉的,可以先把注釋符號去掉再進行解析

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.basketball-reference.com/teams/MIN/2018.html#all_per_game')
// 去掉html的注釋符號,并進行解析
soup = BeautifulSoup(r.text.replace('<!--','').replace('-->',''),'lxml')
trs = soup.select('#per_game > tbody > tr')
print(trs[0])
祉小皓 回答

因為read_csv的第一個參數(shù)是:

filepath_or_buffer : str, pathlib.Path, py._path.local.LocalPath or any object with a read() method (such as a file handle or StringIO)

所以可以接受open之后的io對象,而open函數(shù)是支持中文名字的,所以不會出現(xiàn)打開錯誤

失魂人 回答

因為你只makeRow了一次,矩陣中的每一“行” 都引用了同一個數(shù)組,你改矩陣中的值就相當于改 “行” 中的一個

氕氘氚 回答

可以直接在selectNav這個方法中判斷索引為2的不做處理就可以了

艷骨 回答

MySQL8在這里和低版本不兼容,你可以重新安裝MySQL(或者用Reconfigure選項),把認證的選項設(shè)置為“Use Legacy Authentication Method”, 或者你如果不是必須要用MySQL 8,可以降級到低版本。

入她眼 回答

emmm...哪里出現(xiàn)了2^n??...

毀憶 回答
import pandas as pd
df = pd.DataFrame([['?', 1], ['?', 3], ['?', 2], [3, '?']])
print(df)
print(df.replace('?', 0))
def f(df, col=1):
    return df[df['data2'] == max(df['data2'])]

df1 = df.groupby(['key1']).apply(f)
別傷我 回答

df.drop(df.index[1:][df.B[1:]<df.B[:-1]])


需要根據(jù)B列篩選,條件為目標列的后值大于前值,把index=4的這行去掉。

@zoujj

>>> import pandas as pd
>>> df = pd.DataFrame(list(range(1,8)),columns=['B'])
>>> df.B[4]=3
>>> df['A']=1
>>> df
   B  A
0  1  1
1  2  1
2  3  1
3  4  1
4  3  1
5  6  1
6  7  1
>>> df[(df.B[1:]<df.B[:-1])] # @zoujj 的方法是錯的
Traceback (most recent call last):
  File "<pyshell#21>", line 1, in <module>
    df[(df.B[1:]<df.B[:-1])]
……
pandas.core.indexing.IndexingError: Unalignable boolean Series key provided
>>> df.drop(df.index[1:][df.B[1:]<df.B[:-1]])
   B  A
0  1  1
1  2  1
2  3  1
3  4  1
5  6  1
6  7  1
>>>