1. 程式人生 > >資料分析系列精彩濃縮(二)

資料分析系列精彩濃縮(二)

資料分析系列精彩濃縮(二)

那麼我們有了UCI提供的datasets,我們怎麼Perfect operation呢?

  • First,we download a data file to the localhost , such as crx.data file

  • we will use pure python operation crx.data file

  • step are as follows

    • input : crx.data file

    • output : A 2-D list

    • it should look like

      >>> output
      [[data_0], [data_1], [data_2], ...]
    • individual data example

      >>> data_[0]
      ['b', 30.83, 0, 'u', 'g', 'w', 'v', 1.25, 't', 't', '01', 'f', 'g', '00202', 0, '+']
    • Mind the data types,Do't make all of them string.注意資料型別

  • my code is as follows,for reference only

     file_name = "E:\data\crx.data"
    data_ = open(file_name, 'r')
       # print(data_)
       lines = data_.readlines()
       output = []
       # never use built-in names unless you mean to replace it
       for list_str in lines:
           str_list = list_str[:-1].split(",")
           # keep it
           # str_list.remove(str_list[len(str_list)-1])
           data = []
           for substr in str_list:
               if substr.isdigit():
                   if len(substr) > 1 and substr.startswith('0'):
                       data.append(substr)
                   else:
                       substr = int(substr)
                       data.append(substr)
               else:
                   try:
                       current = float(substr)
                       data.append(current)
                   except ValueError as e:
                       if substr == '?':
                           substr = 'missing'
                       data.append(substr)
           output.append(data)
       return output
  • 通過上面的操作,我們就可以感覺到已經做和資料相關的事情了,the importance of data types

ok back to the point , before you do anything

It is important for you to at least have a rough idea of what kind of data you are dealing with. For instance, if you have read through all the files in the data folder and the description on the website, you should at least know that:

  • This dataset consists of 690 credit card applicants' personal information and whether or not they are approved for the credit card.

  • Each data entry has 15 attributes, and data types of each attribute are on the website

    • we see that A2, A3, A8, A11, A14, A15 are continuous (number)

    • All others are categorical (choices)

  • 37 cases (5%) have one or more missing values

  • This dataset has 2 classes, positive and negative, meaning approved and declined

If you haven't already read through all these information, go back and try to capture and understand your dataset first

Here is the link:

https://archive.ics.uci.edu/ml/datasets/Credit+Approval
  • 通過對資料檔案和網站上的描述(By describing data folders and website )

  • 我們已經瞭解了這些資料實際是幹什麼用的

  • 也知道了python解析出來的每條資料對應的屬性和分類

既然知道了這些資料的attribute and classify,那就期待進一步Perfect operation吧。。。

  • Decmber 28.2018

  •