Opened 14 years ago

Closed 12 years ago

#38 closed defect (fixed)

RasJ reports wrong query result

Reported by: (none) Owned by: Heinrich Stamerjohanns
Priority: major Milestone:
Component: java Version: 8.1
Keywords: Cc: Peter Baumann, Heinrich Stamerjohanns
Complexity: Medium

Description

When query
select add_cells(i[5:5]) from NN3_1 AS i
is executed via RasJ library, the following wrong result is returned:
result=-2.292137065441767E273

but the correct result is (obtained via rasql): 7835

The corresponding rasql invocation is:
./rasql -q "select add_cells(i[5:5]) from NN3_1 AS i" —out string

Attachments (2)

QueryExample.java (2.5 KB ) - added by Dimitar Misev 12 years ago.
proposed-patch.txt (1.4 KB ) - added by Heinrich Stamerjohanns 12 years ago.
Proposed patch to solve this problem

Download all attachments as: .zip

Change History (32)

comment:1 by cj, 14 years ago

Owner: set to cj
Priority: minorcritical
Status: newassigned

it's a problem with endianess. Still have to inverstigate if that's a rasdaman server problem.

comment:2 by Peter Baumann, 14 years ago

Owner: changed from cj to Dimitar Misev

comment:3 by Dimitar Misev, 13 years ago

Cc: Peter Baumann added
Status: assignedaccepted

I don't have the NN3_1 collection, is there any way to get it in order to test this?

comment:4 by Alex Dumitru, 12 years ago

Owner: changed from Dimitar Misev to Alex Dumitru
Status: acceptedassigned

comment:5 by Alex Dumitru, 12 years ago

Status: assignedaccepted

comment:6 by Dimitar Misev, 12 years ago

Priority: criticalmajor

comment:7 by Dimitar Misev, 12 years ago

Reporter: Andrei Aiordachioaie removed

Some stuff that may be helpful with this: http://darksleep.com/player/JavaAndUnsignedTypes.html
And I think we need some way to deal with #55, this workaround is not very pretty: wiki:FAQ#Overflowwhenunsignedbytesareretrievedwithrasj

comment:8 by j.oosthoek@…, 12 years ago

Just wondering, when will this be fixed you think?

by Dimitar Misev, 12 years ago

Attachment: QueryExample.java added

comment:9 by Dimitar Misev, 12 years ago

Cc: Heinrich Stamerjohanns added

I can't reproduce this bug, tried with attachment:QueryExample.java

To run the test, dump the attachment in source:java/src/examples and then

cd java
make examples
cd bin
java examples.QueryExample

The result seems fine on Debian Wheezy:

log> Executing query: select add_cells(<[0:2] 1.042323, 1.12, 1.8>) from mr2 as c
*** Result is: 3.962322950363159

Heinrich, can you run this example on Suse/CentOS?

Jelmer, can you please post the WCPS query and the corresponding rasql query (you can find it in the tomcat log).

comment:10 by Dimitar Misev, 12 years ago

Ok I reproduced it on Jelmer's planetserver, which is 64bit CentOS release 6.3 (Final):

log> Executing query: select add_cells(<[0:2] 1.042323, 1.12, 1.8>) from NIR as c
*** Result is: -2.9789158704612464E-217

And there I get the same result for rasql too:

$ rasql -q 'select add_cells(<[0:2] 1.042323, 1.12, 1.8>) from NIR as c' --out string

Result element 1: 2.42774e-317

So it seems like a server problem, not particular to Java.

comment:11 by Heinrich Stamerjohanns, 12 years ago

Actually the result from java and rasql were not the same, they seem to be rather arbitrary.
Running on CentOS 6.3 on my machine now yields these results:

Java:

log> Executing query: select add_cells(<[0:2] 1.042323, 1.12, 1.8>) from mr2 as c
*** Result is: 3.9741829939327397E-140

rasql:

$ rasql -q 'select add_cells(<[0:2] 1.042323, 1.12, 1.8>) from mr as c' --out string

Result element 1: 3.96232

So I do not have problems via rasql.

comment:12 by Heinrich Stamerjohanns, 12 years ago

Turning around the double in QueryExample.java actually fixes the problem.

log> Executing query: select add_cells(<[0:2] 1.042323, 1.12, 1.8>) from mr2 as c
*** Result is: 3.962322950363159

Still this does not explain the weird values in rasql. I guess there are some implicit conversions that take place?

                Double res = (Double) iter.next();
                long v = Double.doubleToLongBits(res);

                long v2 =   (((v>>56) & 0xffL) << 0)
                          | (((v>>48) & 0xffL) << 8)
                          | (((v>>40) & 0xffL) << 16)
                          | (((v>>32) & 0xffL) << 24)
                          | (((v>>24) & 0xffL) << 32)
                          | (((v>>16) & 0xffL) << 40)
                          | (((v>>8)  & 0xffL) << 48)
                          | (((v>>0)  & 0xffL) << 56);
                res = Double.longBitsToDouble(v2);
                System.out.println("*** Result is: " + res);

comment:13 by Dimitar Misev, 12 years ago

On planetserver I'm getting some pretty random results.. I'll try with the latest rasdaman, instead of the one from the RPMs.

$ rasql -q 'select add_cells(<[0:2] 1, 1, 1>) from NIR as c' --out string
rasql: rasdaman query tool v1.0, rasdaman v8 -- generated on 11.07.2012 18:29:26.
opening database RASBASE at localhost:7001...ok
Executing retrieval query...ok
Query result collection has 1 element(s):
  Result element 1: 50331648
rasql done.

in reply to:  13 ; comment:14 by Heinrich Stamerjohanns, 12 years ago

Query result collection has 1 element(s):

Result element 1: 50331648

That is actually 00000011 00000000 00000000 00000000.

So it as also just switched around. But what I do not understand is that on some machines
the rasql client seems to behave differently.

in reply to:  14 comment:15 by Heinrich Stamerjohanns, 12 years ago

the rasql client seems to behave differently.

…because I get 3 as a result, when I do the same query on CentOS 6.3 on my machine.

comment:16 by Dimitar Misev, 12 years ago

Seems like the VMs are a special case, if I'm not wrong on all the real machines we've tested it has worked fine?

comment:17 by Dimitar Misev, 12 years ago

Or maybe it's just a CentOS case..

in reply to:  17 comment:18 by Heinrich Stamerjohanns, 12 years ago

Replying to dmisev:

Or maybe it's just a CentOS case..

I am running CentOS also in a VM (VirtualBox), everything seems to be fine there.
Or have there been any changes within the last two months that could affect this?

comment:19 by Dimitar Misev, 12 years ago

Wow, I restarted rasdaman on planetserver and now it gives 3, so rasql is (somewhat) fine.

The java example still gives the same result though.

in reply to:  19 comment:20 by Heinrich Stamerjohanns, 12 years ago

Replying to dmisev:

Wow, I restarted rasdaman on planetserver and now it gives 3, so rasql is (somewhat) fine.

The java example still gives the same result though.

At least it seems to consistently not work. You can try the fix above. Then we
need to think how this can be solved in a cleaner way.

comment:21 by Dimitar Misev, 12 years ago

Yes but how can we determine when the bytes are actually in the wrong order?

comment:22 by Dimitar Misev, 12 years ago

I've hopefully narrowed down the problem to the getResponse method in source:java/src/rasj/rnp/RasRNPImplementation.java , something must be going wrong in there, probably in the case RESPONSE_SKALARS:

in reply to:  22 comment:23 by Heinrich Stamerjohanns, 12 years ago

Replying to dmisev:

I've hopefully narrowed down the problem to the getResponse method in source:java/src/rasj/rnp/RasRNPImplementation.java , something must be going wrong in there, probably in the case RESPONSE_SKALARS:

That was helpful…
but the byte array that arrives is actually swapped on CentOS 6.3 (64-bit) and openSUSE 12.2 (32-bit), so everything seems ok there. I might still be a problem of the server…

by Heinrich Stamerjohanns, 12 years ago

Attachment: proposed-patch.txt added

Proposed patch to solve this problem

comment:24 by Heinrich Stamerjohanns, 12 years ago

Owner: changed from Alex Dumitru to Heinrich Stamerjohanns

Actually the problem lies in servercomm/servercomm2.c, when the code tries to swap values
for requests with different endianness (typical java queries). I assume that older gcc-versions
choke how buffer is being reassigned.
To solve the problem, buffer is being assigned vie memcpy. Also gcc-lib directly supports hardware optimized byte-swapping, so use those functions instead.

The patch (see attachment) fixes the cases for floats and doubles, other types remain to be investigated.

comment:25 by Dimitar Misev, 12 years ago

Great, I'm glad you found the problem. I looked a bit around httpserver.cc and servercomm2.cc and it seems pretty messy, with many comments that indicate hacks in the code.. :)

Just one detail about the patch, maybe it's better to fix this in r_Endian (since this is the class that deals with endianness stuff) and then call an appropriate function?

comment:26 by Heinrich Stamerjohanns, 12 years ago

Actually I would rather want to remove the functions that byteswap the doubles and floats in r_Endian (or fix their return values) because they will create all subtle problems. Currently these functions try to store a byteswapped double (or float) into the same type again. The problem is that the values might not be doubles any more, because you may not store arbitrary bit patterns in doubles. If the byteswapped value is stored in a FPU register, this might create all kind of possible problems (normalizing, or creating a floating point exception or creating NaN values). So I do not think that using r_Endian as it is currently is the right way to go.

So I think it is ok to du the conversion directly there (it is just two lines). I do agree that
both files need cleanup though.

in reply to:  26 comment:27 by Dimitar Misev, 12 years ago

Replying to hstamerjohanns:

Actually I would rather want to remove the functions that byteswap the doubles and floats in r_Endian (or fix their return values) because they will create all subtle problems

yes this was my idea, either fix the current functions or put those two lines in new functions in r_Endian?

I do agree that both files need cleanup though.

I opened ticket #211 so eventually we'll take care about it.

I'll test your fix on planetserver, but I'm not convinced it will help because the wrong value doesn't seem to be connected to endianness there..

comment:28 by Dimitar Misev, 12 years ago

Actually it did fix it!! So we can close this ticket and #195 as well, once you submit the patch Heinrich.

in reply to:  28 comment:29 by Heinrich Stamerjohanns, 12 years ago

Replying to dmisev:

Actually it did fix it!! So we can close this ticket and #195 as well, once you submit the patch Heinrich.

Yes, the java queries ask for big-endian and want big-endian back (see RnpMessageHeader?.java:64), therefore the data needs to be converted. I have checked this in all detail that the codepath is taken and QueryExample?.java now returns the right result.

By the way I changed
Double res = (Double) iter.next();
to
Object res = iter.next();

then you can also test float and integer queries.

comment:30 by Dimitar Misev, 12 years ago

Resolution: fixed
Status: acceptedclosed
Note: See TracTickets for help on using tickets.