鍍金池/ 問答/ 數(shù)據(jù)分析&挖掘問答
真難過 回答

注意:python3以后才支持yield from語法

import collections


def flatten(d, prefix="", sep="_"):
    def _take_prefix(k, v, p):
        if p:
            yield from flatten(v, "{}{}{}".format(p, sep, k))
        else:
            yield from flatten(v, str(k))

    if isinstance(d, dict):
        for k, v in d.items():
            if isinstance(v, str) or not isinstance(v, collections.Iterable):
                if prefix:
                    yield "{}{}{}".format(prefix, sep, k), v
                else:
                    yield k, v
            elif isinstance(v, dict):
                yield from _take_prefix(k, v, prefix)
            elif isinstance(v, list):
                for i in v:
                    yield from _take_prefix(k, i, prefix)
            else:
                pass
    else:
        pass

dic = {your dataset}
for key, value in flatten(dic):
    print("{}: {}".format(key, value))

結(jié)果如下,應(yīng)該能拍平了

status: changed
dataset_id: 5a4b463c855d783af4f5f695
dataset_name: AE_E
dataset_label: 1- ADVERSE EVENTS - Not Analyzed
details_variables_variable_id: 5a4b4647855d783b494f9d3f
details_variables_variable_name: CPEVENT
details_variables_variable_label: CPEVENT
details_variables_status: changed
details_variables_details_r_type_new_value: unary
details_variables_details_r_type_old_value: factor
details_variables_message: Variable with different R Type
details_variables_variable_id: 5a4b4647855d783b494f9d25
details_variables_variable_name: CPEVENT2
details_variables_variable_label: CPEVENT2
details_variables_status: changed
details_variables_details_r_type_new_value: unary
details_variables_details_r_type_old_value: binary
details_variables_message: Variable with different R Type
details_variables_variable_id: 5a4b4647855d783b494f9d26
details_variables_variable_name: CP_UNSCHEDULED
details_variables_variable_label: CP_UNSCHEDULED
details_variables_status: changed
details_variables_details_r_type_new_value: undefined
details_variables_details_r_type_old_value: unary
details_variables_message: Variable with different R Type
details_variables_variable_id: 5a4b4647855d783b494f9d02
details_variables_variable_name: VISIT_NUMBER
details_variables_variable_label: VISIT_NUMBER
details_variables_status: changed
details_variables_details_r_type_new_value: unary
details_variables_details_r_type_old_value: integer
details_variables_message: Variable with different R Type
details_variables_variable_id: 5a4b4647855d783b494f9ccf
details_variables_variable_name: VISIT_NUMBER2
details_variables_variable_label: VISIT_NUMBER2
details_variables_status: changed
details_variables_details_r_type_new_value: unary
details_variables_details_r_type_old_value: binary
details_variables_message: Variable with different R Type
details_many_visits: None

針對你修改后的問題, 再加個(gè)函數(shù)就搞定:

# 這個(gè)fuck_all函數(shù)比較特例, 完全是針對你要區(qū)分的dataset下面的N個(gè)變量信息這種需求
def fuck_all(dic, prefix="details_variables"):
    lst = list(flatten(dic))  # flatten函數(shù)則比較通用,任何嵌套數(shù)據(jù)集都可以用它拍平
    lines = []
    top = {k: v for k, v in lst if not k.startswith(prefix)}
    index = 0
    for key, value in lst:
        if not key.startswith(prefix):
            continue
        else:
            if not lines:
                lines.append(top.copy())
        if key in lines[index].keys():
            index += 1
            lines.append(top.copy())
        lines[index][key] = value
    return lines

d = {your dataset}
for i in fuck_all(d):
    print(i)    

結(jié)果長這樣,應(yīng)該是能滿足你需求了

{'status': 'changed', 'dataset_id': '5a4b463c855d783af4f5f695', 'dataset_name': 'AE_E', 'dataset_label': '1- ADVERSE EVENTS - Not Analyzed', 'details_many_visits': None, 'details_variables_variable_id': '5a4b4647855d783b494f9d3f', 'details_variables_variable_name': 'CPEVENT', 'details_variables_variable_label': 'CPEVENT', 'details_variables_status': 'changed', 'details_variables_details_r_type_new_value': 'unary', 'details_variables_details_r_type_old_value': 'factor', 'details_variables_message': 'Variable with different R Type'}
{'status': 'changed', 'dataset_id': '5a4b463c855d783af4f5f695', 'dataset_name': 'AE_E', 'dataset_label': '1- ADVERSE EVENTS - Not Analyzed', 'details_many_visits': None, 'details_variables_variable_id': '5a4b4647855d783b494f9d25', 'details_variables_variable_name': 'CPEVENT2', 'details_variables_variable_label': 'CPEVENT2', 'details_variables_status': 'changed', 'details_variables_details_r_type_new_value': 'unary', 'details_variables_details_r_type_old_value': 'binary', 'details_variables_message': 'Variable with different R Type'}
{'status': 'changed', 'dataset_id': '5a4b463c855d783af4f5f695', 'dataset_name': 'AE_E', 'dataset_label': '1- ADVERSE EVENTS - Not Analyzed', 'details_many_visits': None, 'details_variables_variable_id': '5a4b4647855d783b494f9d26', 'details_variables_variable_name': 'CP_UNSCHEDULED', 'details_variables_variable_label': 'CP_UNSCHEDULED', 'details_variables_status': 'changed', 'details_variables_details_r_type_new_value': 'undefined', 'details_variables_details_r_type_old_value': 'unary', 'details_variables_message': 'Variable with different R Type'}
{'status': 'changed', 'dataset_id': '5a4b463c855d783af4f5f695', 'dataset_name': 'AE_E', 'dataset_label': '1- ADVERSE EVENTS - Not Analyzed', 'details_many_visits': None, 'details_variables_variable_id': '5a4b4647855d783b494f9d02', 'details_variables_variable_name': 'VISIT_NUMBER', 'details_variables_variable_label': 'VISIT_NUMBER', 'details_variables_status': 'changed', 'details_variables_details_r_type_new_value': 'unary', 'details_variables_details_r_type_old_value': 'integer', 'details_variables_message': 'Variable with different R Type'}
{'status': 'changed', 'dataset_id': '5a4b463c855d783af4f5f695', 'dataset_name': 'AE_E', 'dataset_label': '1- ADVERSE EVENTS - Not Analyzed', 'details_many_visits': None, 'details_variables_variable_id': '5a4b4647855d783b494f9ccf', 'details_variables_variable_name': 'VISIT_NUMBER2', 'details_variables_variable_label': 'VISIT_NUMBER2', 'details_variables_status': 'changed', 'details_variables_details_r_type_new_value': 'unary', 'details_variables_details_r_type_old_value': 'binary', 'details_variables_message': 'Variable with different R Type'}

送佛送到西好了

from functools import reduce
import json

import pandas as pd


with open("your dataset file", "r") as fh:
    dic = json.load(fh)

df = pd.DataFrame(reduce(lambda x, y: x + y, (fuck_all(i) for i in dic)))
df.to_csv("out.csv", index=False)

成品

clipboard.png

解夏 回答

為啥要用1000臺(tái)服務(wù)器,是為了IP分散么?如果是這個(gè)目的建議改用代理池

心夠野 回答

你wait一會(huì)兒,有可能呢頁面還沒渲染好,或者你用until來判斷一下頁面是否加載完全,隨后你再去獲取頁面的html

笑浮塵 回答

就是切換的時(shí)候塞數(shù)字的問題,為什么要用定時(shí)器呢,而且還是間隔一段時(shí)間執(zhí)行的
你不停的點(diǎn)擊不就有越來越多的定時(shí)器在執(zhí)行么,后面都不知道誰先執(zhí)行,誰后執(zhí)行

伐木累 回答
  1. for (var i=1; i<=ss.length; i++)循環(huán)條件不對,i的變化范圍應(yīng)該是從0ss.length - 1,不過這個(gè)不會(huì)導(dǎo)致報(bào)錯(cuò);

  2. arr[i]['id']=i;arr[i]['title']=a; arr是一個(gè)空數(shù)組,所以arr[i]undefined,undefined['id']undefined['title']當(dāng)然會(huì)報(bào)錯(cuò);

修改如下:

ss = s.split(",");
console.log(ss)
arr = [];
for (var i=0; i<ss.length; i++) {
    a=ss[i]
    arr[i] = {
        id: i,
        title: a
    }
}
console.log(arr)
愛是癌 回答

help看了下
sort_values(self, return_indexer=False, ascending=True)
應(yīng)該是可以調(diào)整排列的順序的,修改ascending參數(shù) True升序排列,F(xiàn)alse降低序排列。

import pandas as pd
import numpy as np

dates = pd.date_range('1/1/2012', periods=5, freq='M')
help(dates)

dates.sort_values(ascending=False)

DatetimeIndex(['2012-05-31', '2012-04-30', '2012-03-31', '2012-02-29',
               '2012-01-31'],
              dtype='datetime64[ns]', freq='-1M')
              
dates.sort_values(ascending=True)

DatetimeIndex(['2012-01-31', '2012-02-29', '2012-03-31', '2012-04-30',
               '2012-05-31'],
              dtype='datetime64[ns]', freq='M')
孤島 回答
Caused by: org.apache.http.ProtocolException: Content-Length header already present

你是不是指定 Content-Length 頭了

我甘愿 回答

已實(shí)現(xiàn)

from matplotlib import colors
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
cmap = colors.ListedColormap(['white','gray','blue','yellow'])
bounds=[0, 2, 4, 6, 8]
norm = colors.BoundaryNorm(bounds, cmap.N)
data = np.array([[1,1,1,1,7,7,7,7], [1,1,1,1,1,1,1,5], [1,1,1,1,1,1,1,5], [1,1,1,3,1,1,1,5], [1,1,1,1,1,1,3,5]])
ax = sns.heatmap(data, cmap=cmap, norm=norm, linewidths=.5, linecolor='black', square=True, cbar=False)
sns.plt.annotate('S', (1.4, 3.4))
sns.plt.show()

圖片描述

大濕胸 回答

動(dòng)態(tài)計(jì)算,使隨機(jī)的期望值在理論上符合你的要求即可。
當(dāng)然,最后的結(jié)果,不可能是嚴(yán)格的 10 分鐘。(如果要嚴(yán)格也可以,只是后面的變化,可能是 0 而已)

很簡單的歸并問題,目前剩余時(shí)間 T (時(shí)間可以轉(zhuǎn)化為“循環(huán)次數(shù)”,“間隔時(shí)間”等),相差的量是 S,那么當(dāng)前的變化量是一個(gè)關(guān)于 TS 的函數(shù), 即 d(n) = f(T, S) ,變化之后,下一個(gè) d 就是 d(n+1) = f(T - t, S - d) 。進(jìn)一步,當(dāng) S <= 0 時(shí), d = 0 。


給個(gè)進(jìn)度條的例子,每次隨機(jī)值的變化程度是動(dòng)態(tài)分頁數(shù)據(jù)的那個(gè) stdDev 影響的:

<!DOCTYPE html>
<html lang="zh-cmn-Hans">
<head>
<meta charset="utf-8" />
<title>變化</title>
<link rel="stylesheet" type="text/css" href="" />
<script type="text/javascript" src="https://s.zys.me/js/jq/jquery.min.js"></script>
</head>
<body style="margin: 100px;">
  <div id="bg" style="width: 800px; height: 30px; background-color: gray;">
    <div id="bar" style="width: 50%; height: 30px; background-color: red;"></div>
  </div>

  <script type="text/javascript">
    // http://www.cnblogs.com/zztt/p/4025207.html
    // 抄的正態(tài)分布生成算法

    function getNumberInNormalDistribution(mean, stdDev){
      return mean + (randomNormalDistribution() * stdDev);
    }

    function randomNormalDistribution(){
      var u=0.0, v=0.0, w=0.0, c=0.0;
      do {
        //獲得兩個(gè)(-1,1)的獨(dú)立隨機(jī)變量
        u = Math.random() * 2 - 1.0;
        v = Math.random() * 2 - 1.0;
        w = u * u + v * v;
      } while( w == 0.0 || w >= 1.0 )
        //這里就是 Box-Muller轉(zhuǎn)換
      c = Math.sqrt( (-2 * Math.log(w)) / w );
      //返回2個(gè)標(biāo)準(zhǔn)正態(tài)分布的隨機(jī)數(shù),封裝進(jìn)一個(gè)數(shù)組返回
      //當(dāng)然,因?yàn)檫@個(gè)函數(shù)運(yùn)行較快,也可以扔掉一個(gè)
      //return [u*c,v*c];
      return u * c;
    }

  </script>

  <script type="text/javascript">



    // 假設(shè)整個(gè)變化過程為 5000 毫秒時(shí)間, 總長度是 800px
    var T = 5000;
    var D = 800;

    // 同時(shí), 我們定每 100 毫秒變化一次, 則整個(gè)過程執(zhí)行完是 5000 / 100 = 50 次的變化
    // 那么, 如果每次變化是平均的, 則期望值是 800 / 50 px 每次.
    var PER = 100;
    var N = T / PER;

    var $n = $('#bar');
    $n.width(0);

    var width = 0;
    function action(){
      var n = getNumberInNormalDistribution(D / N, 10);
      D -= n;
      if(D <= 0){ $n.width('800px'); over(); return }
      $n.width(width + n + 'px');
      width += n;
      N -= 1;
      if(N <= 0){ $n.width('800px'); over(); return }
      setTimeout(action, PER);
    }

    function over(){
      console.log('over');
      setTimeout(reset, 3000);
    }

    function reset() {
      T = 5000; D = 800; PER = 100; N = T / PER;
      $n.width(0);
      width = 0;
      action();
    }

    action();


  </script>
</body>
</html>
孤客 回答

了解下多表聯(lián)查,或者直接了解 join 的使用方法就好,這個(gè)問題并不難,是業(yè)務(wù)上常見的需求。

代理連接授權(quán)出錯(cuò),再多檢查試錯(cuò)下。

Caused by ProxyError('Cannot connect to proxy.', error('Tunnel connection failed: 407 Proxy Authentication Required'
老梗 回答

真正的端口是頁面加載完用 js 替換的。審查頁面元素有個(gè)加密的 mian.js :

eval(function (p, a, c, k, e, d) { e = function (c) { return (c < a ? "" : e(parseInt(c / a))) + ((c = c % a) > 35 ? String.fromCharCode(c + 29) : c.toString(36)) }; if (!''.replace(/^/, String)) { while (c--) d[e(c)] = k[c] || e(c); k = [function (e) { return d[e] }]; e = function () { return '\\w+' }; c = 1; }; while (c--) if (k[c]) p = p.replace(new RegExp('\\b' + e(c) + '\\b', 'g'), k[c]); return p; }('$(e(){$(\'\\f\\3\\g\\8\\1\\r\\p\\g\\k\')["\\4\\2\\q\\o"](e(u,h){5 7=$(h);5 j=7["\\i\\2\\1\\2"](\'\\a\\3\');5 9=l["\\3\\2\\8\\d\\4\\m\\b\\1"](7["\\i\\2\\1\\2"](\'\\a\'));5 c=j["\\d\\3\\n\\a\\1"](\'\\f\');t(5 6=0;6<c["\\n\\4\\b\\s\\1\\o"];6++){9-=l["\\3\\2\\8\\d\\4\\m\\b\\1"](c[6])}7["\\1\\4\\k\\1"](9)})})', 31, 31, '|x74|x61|x70|x65|var|d7|ClpoEy3|x72|TO5|x69|x6e|tVF6|x73|function|x2e|x6f|fnDKXroKU2|x64|jgemfCG4|x78|window|x49|x6c|x68|x62|x63|x2d|x67|for|wssP1'.split('|'), 0, {}))

在線解密一下得到:

$(function()
    {
    $('\x2e\x70\x6f\x72\x74\x2d\x62\x6f\x78')["\x65\x61\x63\x68"](function(wssP1,fnDKXroKU2)
        {
        var ClpoEy3=$(fnDKXroKU2);
        var jgemfCG4=ClpoEy3["\x64\x61\x74\x61"]('\x69\x70');
        var TO5=window["\x70\x61\x72\x73\x65\x49\x6e\x74"](ClpoEy3["\x64\x61\x74\x61"]('\x69'));
        var tVF6=jgemfCG4["\x73\x70\x6c\x69\x74"]('\x2e');
        for(var d7=0;
        d7<tVF6["\x6c\x65\x6e\x67\x74\x68"];
        d7++)
            {
            TO5-=window["\x70\x61\x72\x73\x65\x49\x6e\x74"](tVF6[d7])
        }
        ClpoEy3["\x74\x65\x78\x74"](TO5)
    }
    )
}
)

十六進(jìn)制轉(zhuǎn)為字符串之后得到:

$(function() {
    $('.port-box')["each"](function(wssP1, fnDKXroKU2) {
        var ClpoEy3 = $(fnDKXroKU2);
        var jgemfCG4 = ClpoEy3["data"]('ip');
        var TO5 = window["parseInt"](ClpoEy3["data"]('i'));
        var tVF6 = jgemfCG4["split"]('.');
        for (var d7 = 0; d7 < tVF6["length"]; d7++) {
            TO5 -= window["parseInt"](tVF6[d7])
        }
        ClpoEy3["text"](TO5)
    })
})

從代碼可以看出,真實(shí)的端口是 .prot-box 里 data-ip 屬性值 減去 ip 的四位數(shù)之和

幼梔 回答
$data = array('2018/04/16','2018/04/17','2018/04/18','2018/04/19','2018/04/20','2018/04/21','2018/04/28');
心沉 回答

node環(huán)境下用phantomjs是可以的。所有前端渲染的網(wǎng)站都適用。

以往的渲染頁面都是靜態(tài)的,給用戶看的都是加載好的,所以很容易爬,現(xiàn)在都是頁面動(dòng)態(tài)渲染的,需要有一個(gè)模擬環(huán)境,執(zhí)行后再爬取

蝶戀花 回答

換 IP 已經(jīng)超出 selenium 的范圍,常見的手法如重新?lián)芴?、更換代理服務(wù)器等等。

心癌 回答

可能有反爬蟲手段,selenium還是有些特征的,比如全局對象中會(huì)有一些特殊屬性。

殘淚 回答

不太明白你想做什么 可以封個(gè)函數(shù) 就如2樓的

淚染裳 回答

從截圖來看,這個(gè)cookie屬于請求頭request的cookie,使用瀏覽器的debug工具追蹤一下在哪里相應(yīng)了這個(gè)cookie的吧,這樣才能拿到cookie