BAT面試上機題從3億個ip中找出訪問次數最多的IP詳解

阿新 • • 發佈：2019-01-16

我們面臨的問題有以下兩點：
1）資料量太大，無法在短時間內解決；
2）記憶體不夠，沒辦法裝下那麼多的資料。
而對應的辦法其實也就是分成1）針對時間，合適的演算法+合適的資料結構來提高處理效率；2）針對空間，就是分而治之，將大資料量拆分成多個比較小的資料片，然後對其各個資料片進行處理，最後再處理各個資料片的結果。
原文中也給出一個問題，"從3億個ip中訪問次數最多的IP"，就試著來解決一下吧。
1）首先，生成3億條資料，為了產生更多的重複ip，前面兩節就不變了，只隨機生成後面的2節。

	private static String generateIp() {
		return "192.168." + (int) (Math.random() * 255) + "."
				+ (int) (Math.random() * 255) + "\n";
	}
	private static void generateIpsFile() {
		File file = new File(FILE_NAME);
		try {
			FileWriter fileWriter = new FileWriter(file);
			for (int i = 0; i < MAX_NUM; i++) {
				fileWriter.write(generateIp());
			}
			fileWriter.close();
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}

1個char是一個Byte，每個ip大概是11Btye，所以生成的ip檔案，大概是3,500,000 KB，如下：

2）檔案生成了，那麼我們現在就要假設記憶體不是很夠，沒有辦法一次性裝入那麼多的資料，所以要先把檔案給拆分成多個小檔案。
在這裡採取的是就是Hash取模的方式，將字串的ip地址給轉換成一個長整數，並將這個數對3000取模，將模一樣的ip放到同一個檔案，這樣就能夠生成3000個小檔案，每個檔案就只有1M多，在這裡已經是足夠小的了。
首先是hash跟取模函式：

	private static String hash(String ip) {
		long numIp = ipToLong(ip);
		return String.valueOf(numIp % HASH_NUM);
	}
 
	private static long ipToLong(String strIp) {
		long[] ip = new long[4];
		int position1 = strIp.indexOf(".");
		int position2 = strIp.indexOf(".", position1 + 1);
		int position3 = strIp.indexOf(".", position2 + 1);
 
		ip[0] = Long.parseLong(strIp.substring(0, position1));
		ip[1] = Long.parseLong(strIp.substring(position1 + 1, position2));
		ip[2] = Long.parseLong(strIp.substring(position2 + 1, position3));
		ip[3] = Long.parseLong(strIp.substring(position3 + 1));
		return (ip[0] << 24) + (ip[1] << 16) + (ip[2] << 8) + ip[3];
	}

2.1）將字串的ip轉換成長整數
2.2）對HASH_NUM，這裡HASH_NUM = 3000；
下面是拆檔案的函式：

	private static void divideIpsFile() {
		File file = new File(FILE_NAME);
		Map<String, StringBuilder> map  = new HashMap<String,StringBuilder>();
		int count = 0;
		try {
			FileReader fileReader = new FileReader(file);
			BufferedReader br = new BufferedReader(fileReader);
			String ip;
			while ((ip = br.readLine()) != null) {
				String hashIp = hash(ip);
				if(map.containsKey(hashIp)){
					StringBuilder sb = (StringBuilder)map.get(hashIp);
					sb.append(ip).append("\n");
					map.put(hashIp, sb);
				}else{
					StringBuilder sb = new StringBuilder(ip);
					sb.append("\n");
					map.put(hashIp, sb);
				}
				count++;
				if(count == 4000000){
					Iterator<String> it = map.keySet().iterator();					
					while(it.hasNext()){
						String fileName = it.next();
						File ipFile = new File(FOLDER + "/" + fileName + ".txt");
						FileWriter fileWriter = new FileWriter(ipFile, true);
						StringBuilder sb = map.get(fileName);				
						fileWriter.write(sb.toString());;
						fileWriter.close();
					}
					count = 0;
					map.clear();
				}
			}
			br.close();
		} catch (FileNotFoundException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (Exception e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}

2.3）在這裡，我們如果每讀取一個ip，經過hash對映之後，就直接開啟檔案，將其加到對應的檔案末尾，那麼有3億條ip，我們就要讀寫檔案3億次，那IO開銷的時候就相當大，所以我們可以先拿一個Map放著，等到一定的規模之後，再統一寫進檔案，然後把map清空，繼續對映，這樣的話，就能夠提高折分的速度。而這個規模，就是根據能處理的記憶體來取的值的，如果記憶體夠大，這個值就可以設定大點，如果記憶體小，就要設定小一點的值，IO開銷跟記憶體大小，總是需要在這兩者之間的取個平衡點的。
可以看到，這樣我們拆分成了3000個小檔案，每個檔案只有1100KB左右，所耗的時間如下，17分鐘到18分鐘左右：

Start Divide Ips File: 06:18:11.103
End:                   06:25:44.134

而這種對映可以保證同樣的IP會對映到相同的檔案中，這樣後面在統計IP的時候，就可以保證在a檔案中不是最多次數的ip（即使是第2多），也不會出現在其它的檔案中。
3）檔案拆分了之後，接下來我們就要分別讀取這3000個小檔案，統計其中每個IP出現的次數。

	private static void calculate() {
		File folder = new File(FOLDER);
		File[] files = folder.listFiles();
		FileReader fileReader;
		BufferedReader br;
		for (File file : files) {
			try {
				fileReader = new FileReader(file);
				br = new BufferedReader(fileReader);
				String ip;
				Map<String, Integer> tmpMap = new HashMap<String, Integer>();
				while ((ip = br.readLine()) != null) {
					if (tmpMap.containsKey(ip)) {
						int count = tmpMap.get(ip);
						tmpMap.put(ip, count + 1);
					} else {
						tmpMap.put(ip, 0);
					}
				}	
				fileReader.close();
				br.close();
				count(tmpMap,map);
				tmpMap.clear();
			} catch (FileNotFoundException e) {
				// TODO Auto-generated catch block
				e.printStackTrace();
			} catch (IOException e) {
				// TODO Auto-generated catch block
				e.printStackTrace();
			}
		}
		
		count(map,finalMap);		
		Iterator<String> it = finalMap.keySet().iterator();
		while(it.hasNext()){
			String ip = it.next();
			System.out.println("result IP : " + ip + " | count = " + finalMap.get(ip));
		}
		
	}		
 
	private static void count(Map<String, Integer> pMap, Map<String, Integer> resultMap) {
		Iterator<Entry<String, Integer>> it = pMap.entrySet().iterator();
		int max = 0;
		String resultIp = "";
		while (it.hasNext()) {
			Entry<String, Integer> entry = (Entry<String, Integer>) it.next();
			if (entry.getValue() > max) {
				max = entry.getValue();
				resultIp = entry.getKey();
			}
		}
		resultMap.put(resultIp,max);	
	}

3.1）第一步要讀取每個檔案，將其中的ip放到一個Map中，然後呼叫count()方法，找出map中最大訪問次數的ip，將ip和最多訪問次數存到另外一個map中。
3.2）當3000個檔案都讀取完之後，我們就會產生一個有3000條記錄的map，裡面儲存了每個檔案中訪問次數最多的ip，我們再呼叫count()方法，找出這個map中訪問次數最大的ip，即這3000個檔案中，哪個檔案中的最高訪問量的IP，才是真正最高的，好像小組賽到決賽一樣。。。。
3.3）在這裡沒有用到什麼堆排序和快速排序，因為只需要一個最大值，所以只要拿當前的最大值跟接下來的值判斷就好，其實也相當跟只有一個元素的堆的堆頂元素比較。
下面就是我們的結果。

Start Calculate Ips: 06:37:51.088
result IP : 192.168.67.98 | count = 1707
End: 06:54:30.221

到這裡，我們就把這個ip給查找出來了。
其實理解了這個思路，其它的海量資料問題，雖然可能各個問題有各個問題的特殊之處，但總的思路我覺得應該是相似的。

BAT面試上機題從3億個ip中找出訪問次數最多的IP詳解

BAT面試上機題從3億個ip中找出訪問次數最多的IP詳解

從1億個ip中找出訪問次數最多的IP

從HashMap中找出出現次數最多的鍵

從一億個ip找出出現次數最多的IP(分治法)

100億個數字中找出最大的10個

python兩種方法實現從1000萬個隨機數中找出top n元素(附c語言版)

有10億個整數，要求選取重複次數最多的100個整數

十萬個數據,找出重複次數最多的十個資料並列印

python（dict字典相關知識以及小例子：生成一個列表，存放100個隨機整數，找出出現次數最多的數字）

一道算法題-從1到n整數中1出現的次數

如何從幾千個檔案中尋找出指定的內容

資料結構與演算法——有1億個整數，找出最大的1000個，要求時間越短越好，空間佔用越少越好

編寫程式從鍵盤得到三個整數，找出其中的最大數

100億個整數，找出中位數

C語言：從p所指字符串中找出ASCII碼最大的字符，將其放在第一個位置上，並將該字符前的原字符向後順序移動。

javascript實現：在N個字串中找出最長的公子串

從網易雲音樂中找出音樂外鏈製作背景音樂

TOP-K排序演算法，從海量不重複資料中找出最大/小的K個數

n個整數中找出連續m個數加和是最大Java版

n個整數中找出連續m個數加和是最大Python版

BAT面試上機題從3億個ip中找出訪問次數最多的IP詳解

相關推薦