鍍金池/ 問答/ 數(shù)據(jù)分析&挖掘問答

$("input[type='checkbox']").is(':checked')

鹿惑 回答

clipboard.png
頁面應(yīng)該是有做過反爬蟲處理的,有關(guān)數(shù)據(jù)在html源碼中是被注釋掉的,可以先把注釋符號去掉再進(jìn)行解析

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.basketball-reference.com/teams/MIN/2018.html#all_per_game')
// 去掉html的注釋符號,并進(jìn)行解析
soup = BeautifulSoup(r.text.replace('<!--','').replace('-->',''),'lxml')
trs = soup.select('#per_game > tbody > tr')
print(trs[0])
喵小咪 回答

Python 的網(wǎng)頁解析一般有以下方法:
1.字符串方法
2.正則表達(dá)式
3.html/xml文本解析庫的調(diào)用(如著名的BeautifulSoup庫)
對于你所給的例子, 假設(shè):

>>> s = '<tr><td><b><a href=".././statistics/power" title="Exponent of the power-law degree distibution">Power law exponent (estimated) with d<sub>min</sub></a></b></td><td>2.1610(d<sub>min</sub> = 2) </td></tr>'

由于文本特征非常明顯, 可以這樣處理:
1.字符串處理方法:

>>> s.split('<td>')[-1].split('(d')[0]
'2.1610'

2.re:

>>> import re
>>> pattern = re.compile('</b></td><td>(.*)\(d<sub>')
>>> pattern.findall(s)
['2.1610']

3.BeautifulSoup:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(s, 'html.parser')
>>> soup.find_all('td')[1].contents[0][:-2]
'2.1610'

以上方法均是根據(jù)給定的例子臨時設(shè)計的.

情已空 回答
var arr = ['0.1.1', '2.3.3', '0.3002.1', '4.2', '4.3.5', '4.3.4.5']
arr.sort((a,b)=>{
    var items1 = a.split('.')
    var items2 = b.split('.')
    var k = 0
    for (let i in items1) {
      let a1 = items1[i]
      let b1 = items2[i]
      if (typeof a1 === undefined) {
        k = -1
        break
      } else if (typeof b1 === undefined) {
        k = 1
        break
      } else {
        if (a1 === b1) {
          continue
        }
        k = Number(a1) - Number(b1)
        break
      }
    }
    return k
})
console.log(arr)
貓館 回答

console.log(JSON.stringify(this)),你看到的是你展開這個對象時的快照。

青檸 回答

注意正則的*號,看圖片

import requests
import re
def text():

for a in range(1,13):
    url = 'https://sf.taobao.com/list/50025969__1___%BA%BC%D6%DD.htm?spm=a213w.7398504.pagination.3.W9af3L&auction_start_seg=-1&page='+str(a)
    html = requests.get(url).text
    ids = re.findall('"id":(.*?),"itemUrl"',html)
    names = re.findall('"title":"(.*?)"',html)
    prices = re.findall('"initialPrice": (.*?) ,"currentPrice"',html)
    find = zip(ids,names,prices)
    for txt in find:
        print(txt)

if name == '__main__':

print('\t\t\t序號\t\t\t','\t\t\t\t\t地點(diǎn)\t\t\t','\t\t\t\t\t\t價格')
text()

圖片描述

練命 回答

參考這里
發(fā)現(xiàn)自己也不懂,抱著學(xué)習(xí)的態(tài)度把上面的翻譯了一遍(翻譯得太挫勿噴,能看英文盡量看英文吧)

貼在CSDN上了,鏈接在這

log_df[['id','device']].groupby(['id'])['device'].apply(lambda x:len(set(x)))

陪我終 回答

Quite simple:

>>> print '"Hello,\\nworld!"'.decode('string_escape')
"Hello,
world!"

>>> data = json.loads('{\"count\":8,\"sub_images\":[{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/470700000c7084773fb2\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/470700000c7084773fb2\"},{\"url\":\"http:\\/\\/pb9.pstatp.com\\/origin\\/470700000c7084773fb2\"},{\"url\":\"http:\\/\\/pb1.pstatp.com\\/origin\\/470700000c7084773fb2\"}],\"uri\":\"origin\\/470700000c7084773fb2\",\"height\":1590},{\"url\":\"http:\\/\\/p9.pstatp.com\\/origin\\/47050001b69355a0bf1b\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p9.pstatp.com\\/origin\\/47050001b69355a0bf1b\"},{\"url\":\"http:\\/\\/pb1.pstatp.com\\/origin\\/47050001b69355a0bf1b\"},{\"url\":\"http:\\/\\/pb3.pstatp.com\\/origin\\/47050001b69355a0bf1b\"}],\"uri\":\"origin\\/47050001b69355a0bf1b\",\"height\":1557},{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/470300020761150d671a\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/470300020761150d671a\"},{\"url\":\"http:\\/\\/pb9.pstatp.com\\/origin\\/470300020761150d671a\"},{\"url\":\"http:\\/\\/pb1.pstatp.com\\/origin\\/470300020761150d671a\"}],\"uri\":\"origin\\/470300020761150d671a\",\"height\":1552},{\"url\":\"http:\\/\\/p1.pstatp.com\\/origin\\/47000002200f2a0a9020\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p1.pstatp.com\\/origin\\/47000002200f2a0a9020\"},{\"url\":\"http:\\/\\/pb3.pstatp.com\\/origin\\/47000002200f2a0a9020\"},{\"url\":\"http:\\/\\/pb9.pstatp.com\\/origin\\/47000002200f2a0a9020\"}],\"uri\":\"origin\\/47000002200f2a0a9020\",\"height\":1575},{\"url\":\"http:\\/\\/p1.pstatp.com\\/origin\\/470000022011d5569ccb\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p1.pstatp.com\\/origin\\/470000022011d5569ccb\"},{\"url\":\"http:\\/\\/pb3.pstatp.com\\/origin\\/470000022011d5569ccb\"},{\"url\":\"http:\\/\\/pb9.pstatp.com\\/origin\\/470000022011d5569ccb\"}],\"uri\":\"origin\\/470000022011d5569ccb\",\"height\":1588},{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/4700000220127db96444\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/4700000220127db96444\"},{\"url\":\"http:\\/\\/pb9.pstatp.com\\/origin\\/4700000220127db96444\"},{\"url\":\"http:\\/\\/pb1.pstatp.com\\/origin\\/4700000220127db96444\"}],\"uri\":\"origin\\/4700000220127db96444\",\"height\":1561},{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/46ff000532e33a9fa35a\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p3.pstatp.com\\/origin\\/46ff000532e33a9fa35a\"},{\"url\":\"http:\\/\\/pb9.pstatp.com\\/origin\\/46ff000532e33a9fa35a\"},{\"url\":\"http:\\/\\/pb1.pstatp.com\\/origin\\/46ff000532e33a9fa35a\"}],\"uri\":\"origin\\/46ff000532e33a9fa35a\",\"height\":1563},{\"url\":\"http:\\/\\/p9.pstatp.com\\/origin\\/470700000c7b871a5fae\",\"width\":1178,\"url_list\":[{\"url\":\"http:\\/\\/p9.pstatp.com\\/origin\\/470700000c7b871a5fae\"},{\"url\":\"http:\\/\\/pb1.pstatp.com\\/origin\\/470700000c7b871a5fae\"},{\"url\":\"http:\\/\\/pb3.pstatp.com\\/origin\\/470700000c7b871a5fae\"}],\"uri\":\"origin\\/470700000c7b871a5fae\",\"height\":1575}],\"max_img_width\":1178,\"labels\":[],\"sub_abstracts\":[\" \",\" \",\" \",\" \",\" \",\" \",\" \",\" \"],\"sub_titles\":[\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\",\"\\u6e05\\u65b0\\u81ea\\u7136\\uff0c\\u7f8e\\u4e3d\\u65e0\\u53cc\"]}'.decode('string_escape'))
>>> 
>>> data["count"]
8
>>> 
心上人 回答
$arr = $arr['data'];

$arr1 = array_filter($arr, function ($item){
    return $item['symbol'] == 'BTC';
});
var_dump($arr1);
孤酒 回答

問題已解決,又被js坑了(下次一定記住)。
那個頁面我選擇的標(biāo)簽是用js動態(tài)添加的,所以什么都爬不到就正常了。然后我又分析了一下用爬蟲獲取的頁面,是已經(jīng)登錄成功了的。

空痕 回答

inputVals數(shù)組有多少個元素啊?
沒看懂這個賦值是怎么賦值.

雨蝶 回答

超時錯誤是服務(wù)端問題,又不是客戶端問題。網(wǎng)絡(luò)出錯很正常啊,重試就好。

淺時光 回答

需要先將$result結(jié)果使用$result = json_decode($result, true);解析為數(shù)組,之后再執(zhí)行如下操作

foreach($result['list'] as $mydata)
{
    echo $mydata['name'];
}
尋仙 回答

搜索建議結(jié)果是用js動態(tài)生成的.
可以直接觀察它是向哪個 api 請求的.
比如搜索hello, 可以直接請求
https://finance.yahoo.com/_finance_doubledown/api/resource/searchassist;searchTerm=hello
那么代碼可以這樣寫:

import json
import requests

kw = 'hello'
url_base = 'https://finance.yahoo.com/_finance_doubledown/api/resource/searchassist;searchTerm='
url = url_base + kw
resp = requests.get(url)
print(json.dumps(json.loads(resp.text), indent=4, sort_keys=True))

得到類似的結(jié)果:

{
    "hiConf": false,
    "items": [
        {
            "exch": "FRA",
            "exchDisp": "Frankfurt",
            "name": "HelloFresh SE",
            "symbol": "HFG.F",
            "type": "S",
            "typeDisp": "Equity"
        },
        ...

我嘗試的貌似直接請求即可, 尚不知 yahoo 有沒有限制請求的措施.

離人歸 回答

其實是編譯器帶你做了轉(zhuǎn)換,提高了容錯性,防止不必要的思考

赱丅呿 回答

應(yīng)該不是 你可以使用fiddler 抓包看下

我以為 回答

瀉藥, 看起來你的問題已經(jīng)解決了。
一個建議是,對于爬蟲抓取類程序,我通常會選擇mongodb而非mysql這樣的關(guān)系型數(shù)據(jù)庫進(jìn)行存儲,有很多好處:

  1. 爬蟲類程序一大難題在于被抓取的數(shù)據(jù)格式很多時候在你遇到問題之前是無法預(yù)知的,mongo是nosql,字段靈活,一個集合當(dāng)中你插入的每一條文檔都可以有不同的key,查詢時按照mongo的那一套也完全沒問題,如果sql系db添加一個字段可能涉及到整張table的修改
  2. mysql的優(yōu)勢在于事務(wù),適合成熟穩(wěn)定的業(yè)務(wù)類型,爬蟲抓取存儲的一手?jǐn)?shù)據(jù)多數(shù)情況是臨時性的,往往會開發(fā)第二層、第三層的查庫、篩選、清洗程序,那時你可以從mongo中取出需要的數(shù)據(jù)存入相應(yīng)的其他db滿足業(yè)務(wù)需求,或直接dump出excel