鍍金池/ 問(wèn)答/人工智能  Python/ Word2Vec 訓(xùn)練英文文本如何按逗號(hào)分詞

Word2Vec 訓(xùn)練英文文本如何按逗號(hào)分詞

import logging
import os
import sys
import multiprocessing
 
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
 
if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)
 
    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
 
    # check and process input arguments
    if len(sys.argv) < 4:
        print(globals()['__doc__'] % locals())
        sys.exit(1)
    inp, outp1, outp2 = sys.argv[1:4]
 
    model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5,
                     workers=multiprocessing.cpu_count())
 
    # trim unneeded model memory = use(much) less RAM
    # model.init_sims(replace=True)
    model.save(outp1)
    model.wv.save_word2vec_format(outp2, binary=False)

訓(xùn)練集中是分好的,訓(xùn)練完詞匯表中是獨(dú)立的單詞
圖片描述

回答
編輯回答
兔囡囡

LineSentence類(lèi)的要求是:

Simple format: one sentence = one line; words already preprocessed and separated by     whitespace

你需要自己簡(jiǎn)單預(yù)處理一下。

現(xiàn)在比較流行的是doc2vec,有興趣可以看下:https://segmentfault.com/a/11...

2017年9月23日 05:33
編輯回答
念初

定位到from gensim.models.word2vec.LineSentence
將line = utils.to_unicode(line).split(' ')改為line = utils.to_unicode(line).split(',')

2018年8月8日 15:35