用MLSQL完成簡書圖片備份
前言
我今天正好想做兩個事,第一個是,我想把我簡書內容備份下來,但是官方提供的備份功能只能備份成markdown,然後發現圖片沒辦法備份。所以我需要把我簡書裡的所有圖片下載下來。
第二個是,我要下載一些jar包,但是隻有jar包的名字,沒有下載連結,我想生成這些連結然後用wget下載下來。
作為一個程式設計師,我肯定是要寫指令碼啦,可以是shell指令碼,可以是python或者任何語言都行。但是我不想在這上面浪費太多時間,所以我想到了能不能用幾條SQL直接搞定? 然後我實踐了下,還真的是可以。
備份簡書圖片
我首先通過簡書的功能把文件都備份下來,裡面全部都是markdown. 解壓後大概是這樣:

image.png
接著我把這些檔案上傳到到MLSQL Console裡:

image.png
接著下載:

image.png
配置好後執行,然後就會下載到MLSQL的主目錄裡了。
第一步,載入所有markdown文字:
-- with text we can get all lines in all markdown files. load text.`oww/*.md` as articleLines;
第二步,抽取圖片URL
-- extract markdown link  set imageUrl='''REGEXP_EXTRACT(value, "(?:!\\[(.*?)\\]\\((.*?)\\))",2)'''; set mdImage='''REGEXP_EXTRACT(value, "(?:!\\[(.*?)\\]\\((.*?)\\))",0)'''; -- use variable make the sql simple select ${imageUrl} as image_url,${mdImage} as mdImage fromarticleLines where length(${imageUrl})>0 as imageUrls;
第三步,下載所有圖片:
-- download all this images select crawler_request_image(image_url) as imageBin,mdImage from imageUrls as imageBins;
圖片下載比較慢,所以為了防止反覆下載圖片,我們把圖片儲存起來:
-- save them as parquet file save overwrite imageBins as parquet.`jianshu.images`;
接著呢,我們希望把這些圖片都儲存成圖片檔案。載入儲存的中間結果:
-- again we hope we can save the image as image file. load parquet.`jianshu.images` as newImageBins;
現在我們要抽取出圖片名稱,newImageBins裡的mdImage欄位格式是這樣的:

我現在想拿到 1063603-f9546ced3af8a9cb.png
,正常情況我們可以用SQL正則,這裡我偷懶,用點scala程式碼吧:
register ScriptUDF.`` as getFileName where lang="scala" and code=''' def apply(rawFileName:String):String={ rawFileName.split("upload_images/").last.split("\\?").head } ''' and udfType="udf";
我建立了一個getFileName的UDF函式,接著我就可以用了:
select getFileName(mdImage) as fileName,imageBin from newImageBins as newImageBinsWithNames; save overwrite newImageBins as image.`/tmp/images` where fileName="fileName" and imageColumn="imageBin";
getFileName是我們剛才建立的函式。最後儲存結果如下:

image.png
終於備份好了
獲取jar包連結
首先,我有如下的jar包要處理:
set abc=''' hadoop-annotations-2.7.3.jar hadoop-auth-2.7.3.jar hadoop-client-2.7.3.jar hadoop-common-2.7.3.jar hadoop-hdfs-2.7.3.jar hadoop-mapreduce-client-app-2.7.3.jar hadoop-mapreduce-client-common-2.7.3.jar hadoop-mapreduce-client-core-2.7.3.jar hadoop-mapreduce-client-jobclient-2.7.3.jar hadoop-mapreduce-client-shuffle-2.7.3.jar hadoop-yarn-api-2.7.3.jar hadoop-yarn-client-2.7.3.jar hadoop-yarn-common-2.7.3.jar hadoop-yarn-server-common-2.7.3.jar hadoop-yarn-server-web-proxy-2.7.3.jar ''';
我用csv的方式來載入這個文字:
load csvStr.`abc` as jack;
同樣的,我個人比較喜歡寫點scala程式碼,所以我又寫了個UDF:
register ScriptUDF.`` as link where code=''' def apply(s:String)={ val fileName = s.split("-").dropRight(1).mkString("-") s"""http://central.maven.org/maven2/org/apache/hadoop/${fileName}/3.2.0/${s.replaceAll("2.7.3","3.2.0")}""" } ''';
現在,可以生成連結了:
select link(_c0) from jack as output;

image.png
我把檔案儲存下來然後用wget命令下載,其實我們也可以用前面的image方式進行儲存。
附錄
備份圖片完整指令碼:
-- with text we can get all lines in all markdown files. load text.`oww/*.md` as articleLines; -- extract markdown link  set imageUrl='''REGEXP_EXTRACT(value, "(?:!\\[(.*?)\\]\\((.*?)\\))",2)'''; set mdImage='''REGEXP_EXTRACT(value, "(?:!\\[(.*?)\\]\\((.*?)\\))",0)'''; -- use variable make the sql simple select ${imageUrl} as image_url,${mdImage} as mdImage fromarticleLines where length(${imageUrl})>0 as imageUrls; -- download all this images select crawler_request_image(image_url) as imageBin,mdImage from imageUrls as imageBins; -- save them as parquet file save overwrite imageBins as parquet.`jianshu.images`; -- again we hope we can save the image as image file. load parquet.`jianshu.images` as newImageBins; register ScriptUDF.`` as getFileName where lang="scala" and code=''' def apply(rawFileName:String):String={ rawFileName.split("upload_images/").last.split("\\?").head } ''' and udfType="udf"; select getFileName(mdImage) as fileName,imageBin from newImageBins as newImageBinsWithNames; save overwrite newImageBinsWithNames as image.`/tmp/images` where fileName="fileName" and imageColumn="imageBin";
下載Jar包完整指令碼:
set abc=''' hadoop-annotations-2.7.3.jar hadoop-auth-2.7.3.jar hadoop-client-2.7.3.jar hadoop-common-2.7.3.jar hadoop-hdfs-2.7.3.jar hadoop-mapreduce-client-app-2.7.3.jar hadoop-mapreduce-client-common-2.7.3.jar hadoop-mapreduce-client-core-2.7.3.jar hadoop-mapreduce-client-jobclient-2.7.3.jar hadoop-mapreduce-client-shuffle-2.7.3.jar hadoop-yarn-api-2.7.3.jar hadoop-yarn-client-2.7.3.jar hadoop-yarn-common-2.7.3.jar hadoop-yarn-server-common-2.7.3.jar hadoop-yarn-server-web-proxy-2.7.3.jar '''; load csvStr.`abc` as jack; register ScriptUDF.`` as link where code=''' def apply(s:String)={ //http://central.maven.org/maven2/org/apache/hadoop/hadoop-aliyun/3.2.0/hadoop-aliyun-3.2.0.jar //http://central.maven.org/maven2/org/apache/hadoop/hadoop-client/3.2.0/hadoop-client-3.2.0.jar val fileName = s.split("-").dropRight(1).mkString("-") s"""http://central.maven.org/maven2/org/apache/hadoop/${fileName}/3.2.0/${s.replaceAll("2.7.3","3.2.0")}""" } '''; select link(_c0) from jack as output; load modelParams.`RandomForest` as output;