大資料ETL實踐探索（4）---- 之搜尋神器elastic search

阿新 • • 發佈：2018-12-09

3.本地檔案匯入aws elastic search

修改訪問策略，設定本地電腦的公網ip，這個經常會變化，每次使用時候需要設定一下

這裡寫圖片描述

安裝anancota
https://www.anaconda.com/download/

初始化環境，win10下開啟Anaconda Prompt 的命令列


conda create -n elasticsearch python=3.6
source activate elasticsearch 
pip install elasticsearch
pip install pandas

使用指令碼如下：windows獲取當前資料夾下所有csv並建立索引入es



from elasticsearch import helpers, Elasticsearch
import pandas as pd
from time import time

import win_unicode_console
win_unicode_console.enable()

import os  
  
def file_name(file_dir):   
    for root, dirs, files in os.walk(file_dir):  
        print(root) #當前目錄路徑  
        print(dirs) #當前路徑下所有子目錄   

        print(files) #當前路徑下所有非目錄子檔案  
        return [item for item in files if '.csv' in item]


root_path=os.getcwd()+'\\'

fileslist = file_name(root_path)

# size of the bulk
chunksize=50000

for file in fileslist:
    t0=time()
    f = open(root_path+file,'r', encoding='UTF-8') # read csv

    # 使用 pandas 解析csv 

    csvfile=pd.read_csv(f, iterator=True, chunksize=chunksize,low_memory=False) 

    # 初始化es
    es = Elasticsearch(["https://yoururl.amazonaws.com.cn"])

    # 初始化索引
    try :
        es.indices.delete(file.strip('.csv').lower())
    except :
        pass

    es.indices.create(file.strip('.csv').lower())

    # start bulk indexing 
    print("now indexing %s..."%(file))

    for i,df in enumerate(csvfile): 
        print(i)
        records=df.where(pd.notnull(df), None).T.to_dict()
        list_records=[records[it] for it in records]
        try :
            helpers.parallel_bulk(es, list_records, index=file.strip('.csv').lower(), doc_type=file.strip('.csv').lower(),thread_count=8)
        except :
            print("error!, skip records...")
            pass

    print("done in %.3fs"%(time()-t0))

上一段程式碼發現，資料錄入es時候有問題，由於並行錄入是懶載入的模式，所以資料居然沒錄進去，按照下面連結提供的思路，程式碼需要如下修改：

程式碼例項：
https://www.programcreek.com/python/example/104891/elasticsearch.helpers.parallel_bulk

參考帖子：
https://discuss.elastic.co/t/helpers-parallel-bulk-in-python-not-working/39498


from elasticsearch import helpers, Elasticsearch
import pandas as pd
from time import time

from elasticsearch.helpers import BulkIndexError
from elasticsearch.exceptions import TransportError,ConnectionTimeout,ConnectionError
import traceback  
import logging
logging.basicConfig(filename='log-for_.log',
                    format='%(asctime)s -%(name)s-%(levelname)s-%(module)s:%(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S %p',
                    level=logging.ERROR)


import win_unicode_console
win_unicode_console.enable()

import os  
  
def file_name(file_dir):   
    for root, dirs, files in os.walk(file_dir):  
        print(root) #當前目錄路徑  
        print(dirs) #當前路徑下所有子目錄  
        print(files) #當前路徑下所有非目錄子檔案  
        return [item for item in files if '.csv' in item]

#NAME = "PV_PROV_LOG"
root_path=os.getcwd()+'\\'
#csv_filename="%s.csv" % NAME



fileslist = file_name(root_path)

# size of the bulk
chunksize=1000

for file in fileslist:

    t0=time()
    # open csv file
    
    f = open(root_path+file,'r', encoding='UTF-8') # read csv

    # parse csv with pandas
    csvfile=pd.read_csv(f, iterator=True, chunksize=chunksize,low_memory=False) 

    # init ElasticSearch
    es = Elasticsearch(["..."])

    # init index
    try :
        es.indices.delete(file.strip('.csv').lower())
    except :
        pass

    es.indices.create(file.strip('.csv').lower())

    # start bulk indexing 
    print("now indexing %s..."%(file))

    for i,df in enumerate(csvfile): 
        print(i)
        records=df.where(pd.notnull(df), None).T.to_dict()
        list_records=[records[it] for it in records]
        #print(list_records)
        try :
            #helpers.bulk(es, list_records, index=file.strip('.csv').lower(), doc_type=file.strip('.csv').lower())
            for success, info in helpers.parallel_bulk(es, list_records, index=file.strip('.csv').lower(), doc_type=file.strip('.csv').lower(),thread_count=8):
                if not success:
                    print('A document failed:', info)
            #helpers.parallel_bulk(es, list_records, index=file.strip('.csv').lower(), doc_type=file.strip('.csv').lower(),thread_count=8)
        except ConnectionTimeout:
            logging.error("this is ES ConnectionTimeout ERROR \n %s"%str(traceback.format_exc()))
            logging.info('retry bulk es')
        except TransportError:
            
            logging.error("this is ES TransportERROR \n %s"%str(traceback.format_exc()))
            logging.info('retry bulk es')
        except ConnectionError:
            logging.error("this is ES ConnectionError ERROR \n %s"%str(traceback.format_exc()))
            logging.info('retry bulk es')
            
        except BulkIndexError:
            logging.error("this is ES BulkIndexError ERROR \n %s"%str(traceback.format_exc()))
            logging.info('retry bulk es')
            pass
        except Exception:
            logging.error("exception not match \n %s"%str(traceback.format_exc()))
            logging.error('retry bulk es')
            pass
        except :
            print("error!, skiping some records")
            print (list_records)
            print(json.loads(result))
            pass

    print("done in %.3fs"%(time()-t0))

大資料ETL實踐探索（4）---- 之搜尋神器elastic search

3.本地檔案匯入aws elastic search 修改訪問策略，設定本地電腦的公網ip，這個經常會變化，每次使用時候需要設定一下安裝anancota https://www.anaconda.com/download/ 初始化環境，win10下開啟Anaco

大資料ETL實踐探索（3）---- pyspark 之大資料ETL利器

5.spark dataframe 資料匯入Elasticsearch 5.1 dataframe 及環境初始化初始化， spark 第三方網站下載包：elasticsearch-spark-20_2.11-6.1.1.jar http://spark.apache.org/t

大資料ETL實踐探索（1）---- python 與oracle資料庫匯入匯出

文章大綱 ETL 簡介工具的選擇 1. oracle資料泵匯入匯出實戰 1.1 資料庫建立 1.2. installs Oracle 1.3 export / import data from oracle

大資料ETL實踐探索（2）---- python 與aws 互動

文章大綱本文主要使用python基於oracle和aws 相關元件進行一些基本的資料匯入匯出實戰，oracle使用資料泵impdp進行匯入操作，aws使用awscli進行上傳下載操作。本地檔案上傳至aws es，spark dataframe錄

我的軟考之路（六）——資料結構與演算法（4）之八大排序

排序是程式設計的基礎，在程式中會經常使用，好的排序方法可以幫助你提高程式執行的效率，所以學好排序，打好基礎，對於程式的優化會手到擒來。無論你的技術多麼強，如果沒有基礎也強不到哪去。

大資料專案實踐指南（總體思路）

做了三個完整的大資料專案後，我整理了一下大資料的專案實踐思路，這裡寫下總體思路。如果加油的人多，我願意將其詳細編寫為一本書，就叫《大資料專案實踐指南》吧？哪個出版社有興趣的話，可以聯絡我。徐建明 18971024137為什麼大多數企業都實施大資料專案? 1,希望進行更有

大資料Hadoop學習筆記（三）

1.HDFS架構講解 2.NameNode啟動過程 3.YARN架構組建功能詳解 4.MapReduce 程式設計模型 HDFS架構講解源自谷歌的GFS論文 HDFS： *抑鬱擴充套件的分散式系統 *執行在大量普通的鏈家機器上，提供容錯機制 *為

大資料Hadoop學習筆記（二）

Single Node Setup 官網地址 1. 本地模式 2.偽分散式模式 ************************* 本地模式 **************************** . grep input output ‘dfs[a-

大資料Hadoop學習筆記（一）

大資料Hadoop2.x hadoop用來分析儲存網路資料 MapReduce：對海量資料的處理、分散式。思想————> 分而治之，大資料集分為小的資料集，每個資料集進行邏輯業務處理合並統計資料結果（reduce）執行模式：本地模式和yarn模式 input—

大資料Hadoop學習筆記（五）

分散式部署本地模式Local Mode 分散式Distribute Mode 偽分散式一臺機器執行所有的守護程序從節點DN和NM只有一個完全分散式

大資料Hadoop學習筆記（四）

MapReduce執行過程 ======== step1 ： input InputFormat 讀取資料轉換成<key, value>

大資料Hadoop學習筆記（六）

HDFS HA 背景：在hadoop2.0之前，HDFS叢集中的NameNode存在單點故障（SPOF）對於只有一個NameNode的叢集，若NameNode機器出現故障，則整個叢集將無法使用，直到NameNode重新啟動 NameNode主要在一下兩方面影響

大資料入門學習筆記（叄）- 布式檔案系統HDFS

文章目錄 HDFS概述及設計目標什麼是HDFS HDFS的設計目標 HDFS架構 HDFS副本機制副本存放策略![在這裡插入圖片描述](https://img-blog.csdnimg.cn/20181

大資料入門學習筆記（貳）- 初識Hadoop

文章目錄 Hadoop概述 Hadoop能做什麼 Hadoop核心元件分散式檔案系統HDFS 分散式檔案系統HDDS 資源排程系統YARN 分散式計算框架MapReduce Had

大資料入門學習筆記（壹） - 大資料概述

文章目錄大資料故事大資料背景大資料基本概念大資料定義大資料4V特徵大資料要解決的問題大資料涉及到的技術大資料帶來的技術挑戰在技術架構上的挑戰其他挑戰

Java和大資料的結合學習（1）

一.Javase的學習 string ，stringbuffer ，stringbulider 包裝類 randrom函式和randrom類 final,成員內部類，區域性內部類，靜態內部類，匿名內部類，內部類的繼承異常的捕獲，處理，輸出以及丟擲

spark快速大資料分析學習筆記（1）

本文是《spark快速大資料分析學習》第三章學習筆記，文中大量摘抄書中原本，僅為個人學習筆記。 RDD基礎： RDD是一個不可變的分散式物件集合。每個RDD都被分為多個分割槽，這個分割槽執行在叢集的不同節點上。RDD可以包含Python、Java、Scala中任意型別的物件。建立RDD的方式：

電商大資料分析平臺專案（一）專案框架

開發可以在web專案中內嵌的js sdk。每當使用者瀏覽到網站頁面或者觸發某種事件時，會呼叫js程式碼，根據使用者cookie傳送一個session資訊這時到我們的nginx伺服器中。 nginx伺服器在接收到傳送的session後會將其寫入日誌檔案中記錄下來，這時監聽日誌檔案的flume會將session

一共81個，開源大資料處理工具彙總（下）

日誌收集系統　　一、Facebook Scribe 　　貢獻者：Facebook 　　簡介：Scribe是Facebook開源的日誌收集系統，在Facebook內部已經得到大量的應用。它能夠從各種日誌源上收集日誌，儲存到一箇中央儲存系統（可以是NFS，分散式檔案系

大資料儲存---HBase介紹（上）

本次主要介紹三部分： HBase簡介 HBase整體架構 HBase安裝和啟動 Hbase基本操作 HBase簡介 hbase是bigtable的開源java版本，是建立在hdfs之上。提供高可靠性、高效能、列儲存、可伸縮、實時讀寫nosql的資料庫系統

大資料ETL實踐探索（4）---- 之 搜尋神器elastic search

3.本地檔案匯入aws elastic search

相關推薦

大資料ETL實踐探索（4）---- 之搜尋神器elastic search