1. 程式人生 > >upc 8378: Floating-Point Numbers(模擬浮點數運算)

upc 8378: Floating-Point Numbers(模擬浮點數運算)

8378: Floating-Point Numbers

時間限制: 1 Sec  記憶體限制: 128 MB
提交: 10  解決: 4
[提交] [狀態] [討論版] [命題人:admin]

題目描述

In this problem, we consider floating-point number formats, data representation formats to approximate real numbers on computers.
Scientific notation is a method to express a number, frequently used for numbers too large or too small to be written tersely in usual decimal form. In scientific notation, all numbers are written in the form m × 10e. Here, m (called significand) is a number greater than or equal to 1 and less than 10, and e (called exponent) is an integer. For example, a number 13.5 is equal to 1.35×101, so we can express it in scientific notation with significand 1.35 and exponent 1.
As binary number representation is convenient on computers, let's consider binary scientific notation with base two, instead of ten. In binary scientific notation, all numbers are written in the form m × 2e. Since the base is two, m is limited to be less than 2. For example, 13.5 is equal to 1.6875×23, so we can express it in binary scientific notation with significand 1.6875 and exponent 3. The significand 1.6875 is equal to 1 + 1/2 + 1/8 + 1/16, which is 1.10112 in binary notation. Similarly, the exponent 3 can be expressed as 112 in binary notation.
A floating-point number expresses a number in binary scientific notation in finite number of bits. Although the accuracy of the significand and the range of the exponent are limited by the number of bits, we can express numbers in a wide range with reasonably high accuracy.
In this problem, we consider a 64-bit floating-point number format, simplified from one actually used widely, in which only those numbers greater than or equal to 1 can be expressed. Here, the first 12 bits are used for the exponent and the remaining 52 bits for the significand. Let's denote the 64 bits of a floating-point number by b64...b1. With e an unsigned binary integer (b64...b53)2, and with m a binary fraction represented by the remaining 52 bits plus one (1.b52...b1)2, the floating-point number represents the number m × 2e.
We show below the bit string of the representation of 13.5 in the format described above.

In floating-point addition operations, the results have to be approximated by numbers representable in floating-point format. Here, we assume that the approximation is by truncation. When the sum of two floating-point numbers a and b is expressed in binary scientific notation as a + b = m × 2e (1 ≤ m < 2, 0 ≤ e < 212), the result of addition operation on them will be a floating-point number with its first 12 bits representing e as an unsigned integer and the remaining 52 bits representing the first 52 bits of the binary fraction of m.
A disadvantage of this approximation method is that the approximation error accumulates easily. To verify this, let's make an experiment of adding a floating-point number many times, as in the pseudocode shown below. Here, s and a are floating-point numbers, and the results of individual addition are approximated as described above.
s := a
for n times {
    s := s + a
}
For a given floating-point number a and a number of repetitions n, compute the bits of the floating-point number s when the above pseudocode finishes.

輸入

The input consists of at most 1000 datasets, each in the following format.

b52...b1 
n is the number of repetitions. (1 ≤ n ≤ 1018) For each i, bi is either 0 or 1. As for the floating-point number a in the pseudocode, the exponent is 0 and the significand is b52...b1.

The end of the input is indicated by a line containing a zero.

輸出

For each dataset, the 64 bits of the floating-point number s after finishing the pseudocode should be output as a sequence of 64 digits, each being 0 or 1 in one line.

樣例輸入

1
0000000000000000000000000000000000000000000000000000
2
0000000000000000000000000000000000000000000000000000
3
0000000000000000000000000000000000000000000000000000
4
0000000000000000000000000000000000000000000000000000
7
1101000000000000000000000000000000000000000000000000
100
1100011010100001100111100101000111001001111100101011
123456789
1010101010101010101010101010101010101010101010101010
1000000000000000000
1111111111111111111111111111111111111111111111111111
0

樣例輸出

0000000000010000000000000000000000000000000000000000000000000000
0000000000011000000000000000000000000000000000000000000000000000
0000000000100000000000000000000000000000000000000000000000000000
0000000000100100000000000000000000000000000000000000000000000000
0000000000111101000000000000000000000000000000000000000000000000
0000000001110110011010111011100001101110110010001001010101111111
0000000110111000100001110101011001000111100001010011110101011000
0000001101010000000000000000000000000000000000000000000000000000

來源/分類

ICPC Japan IISF 2018 

[提交] [狀態]

【題意】

看圖,解釋了小數的科學計數法儲存方式,前12位存指數,後52位存小數部分,整數部分預設為1。

給出n,和小數a的後52位,求n+1個a相加的結果(注意是逐個相加,並考慮每一步的誤差損失),用題目描述的儲存方式輸出。

【分析】

錯誤做法1:我第一遍直接用java大數計算出來 (n+1)*a 的具體值,然後從最高位的第二位開始數52位保留下來,作為小數,剩下的位數長度作為指數e。結果前7個樣例是對的,最後一個是錯的。錯誤:題目聲明瞭這種儲存方式是有誤差損失的,而我直接大數算出具體值是不損失的,故答案不符。

錯誤做法2:放棄大數之後,我想到用快速冪的方法實現這n次加法,然而得出的結果還是最後一個是錯的。原因:快速冪過程中,只會加倍,並不能考慮到逐個加a時產生的損失, 因為逐個加a時,沒加一次都有可能產生誤差,而用快速冪的思想只能在加倍是產生誤差,大概可以理解成 產生的誤差不一樣。

正確做法:考慮一個過程,我們是逐個加a,其實很多時候,我們加上一個a,數值的最高位可能並不變化,也就是不產生進位,那指數e肯定不會變化。

所以我們尋找臨界狀態,即現在的值 加上 幾個a,能恰好進位,這個用普通除法就能算出。然後我們直接讓數值加上這些a,進位。一旦進位,指數e就可以加1,而小數部分就需要右移一位了。 不斷重複這個過程,知道n用盡或者小數部分不足以產生影響。

小數部分直接用long long型變數存下來。

【程式碼】

/****
***author: winter2121
****/
#include<bits/stdc++.h>
#define rep(i,s,t) for(int i=s;i<=t;i++)
#define SI(i) scanf("%d",&i)
#define PI(i) printf("%d\n",i)
using namespace std;
typedef long long ll;
const int mod=1e9+7;
const int MAX=2e5+5;
const int INF=0x3f3f3f3f;
const double eps=1e-8;
int dir[9][2]={0,1,0,-1,1,0,-1,0, -1,-1,-1,1,1,-1,1,1};
template<class T>bool gmax(T &a,T b){return a<b?a=b,1:0;}
template<class T>bool gmin(T &a,T b){return a>b?a=b,1:0;}
template<class T>void gmod(T &a,T b){a=((a+b)%mod+mod)%mod;}
 
ll gcd(ll a,ll b){ while(b) b^=a^=b^=a%=b; return a;}
ll inv(ll b){return b==1?1:(mod-mod/b)*inv(mod%b)%mod;}
ll qpow(ll n,ll m)
{
    n%=mod;
    ll ans=1;
    for(;m;m>>=1)
    {
        if(m&1)ans=ans*n%mod;
        n=n*n%mod;
    }
    return ans;
}
 
int main()
{
    ll n;
    int bit;
    while(scanf("%lld",&n),n)
    {
        ll num=1ll<<52;
        for(int i=51;i>=0;i--)
        {
            scanf("%1d",&bit);
            num|=((ll)bit)<<i;
        }
        ll ans=0,res=num;
        while(n>0)
        {
         //   cout<<num<<endl;
            ll Tim=((1ll<<53)-res +num-1)/num; //Tim次就能加到1<<53
         //   cout<<n<<' '<<Tim<<endl;
            if(Tim>n) //超了
            {
                res+=n*num;
                break;
            }
            else
            {
                res+=Tim*num;
                ans++; //進位
                res>>=1; //產生了進位,冪加1,所以res右移1
                num>>=1; //冪漲了,所以最初的小數也就右移了
                if(num<1)break; //當num不足1時,不會再對後面的加法產生影響。
                n-=Tim;
            }
        }
        for(int i=11;i>=0;i--)
            printf("%d",((1<<i)&ans)?1:0);
        for(int i=51;i>=0;i--)
            printf("%d",((1ll<<i)&res)?1:0);
        printf("\n");
    }
    return 0;
}