Start a Conversation

Unsolved

This post is more than 5 years old

2058

August 4th, 2011 09:00

Why not SHA1?

I'm curious.  Why was SHA0 selected over SHA1 for Atmos checksums?  I am  far from an encryption expert but a quick look online indicates that  SHA0 was discarded nearly 20 years ago and replaced with SHA1 due to a  significant flaw in the SHA0 algorithm.

222 Posts

August 4th, 2011 13:00

Mark,

Our checksum feature is designed to allow us to add other algorithms in the future, but the reason we started with SHA0 was based on a particular customer request.


Raj

56 Posts

August 18th, 2011 12:00

I'll second the request to add other schemes in the future.

Another neat thing would be to switch the way its handled. Right now, I have to calculate the checksum before I even send any data. This requires me to pass the whole object through the calculation and then send the data, meaning I have to pass through the data twice.  It would be nice if I could calculate the checksum as I'm sending the object, meaing I only have to pass through the data once.

The way to do that would be to tell Atmos to calculate the checksum using algorithm A and return the checksum as a header in the response. As I'm streaming that data out, I pass it through my checksum function inline. I then compare the checksum in the header returned with the one I calculated before sending it out over the line, and if they differ, then something went wrong and I should delete the object and try again.

Its too bad HTTP doesn't have "footers" as well as headers so we could send trailing metadata!

August 19th, 2011 02:00

Could you not manage the checksum yourself by adding it as metadata to the object.

You send the object to Atmosncalculating the checksum as you go and then add it as piece of custom metadata to the object.

Then when you read the object back you also read the custom metadata and calculate the checksum when reading back.

This way you can guarantee the data written to disk and retrieved across the network is exactly the same. You could also store the metadata locally with object ID.

56 Posts

August 19th, 2011 08:00

True. But I see 3 issues/inefficiencies:

  1. That would require a separate trailing call to ?metadata/user for each file, which is a bit inefficient.
  2. I have to read the whole object to recalc and guarantee the checksum. (I suppose that is true regardless, there is a note in the API that says "Client applications are responsible for performing checksum verifications on object reads. This begs a question, what is the value of this header if doing a range read, the hole file, or just the part read?)
  3. I imagine that the checksum is used internally by Atmos for the GeoProtect feature, in that they probably use it to to make sure primary or replica objects are not corrupt.

281 Posts

August 19th, 2011 12:00

Adam Marcionek wrote:

True. But I see 3 issues/inefficiencies:

  • That would require a separate trailing call to ?metadata/user for each file, which is a bit inefficient.
  • I have to read the whole object to recalc and guarantee the checksum. (I suppose that is true regardless, there is a note in the API that says "Client applications are responsible for performing checksum verifications on object reads. This begs a question, what is the value of this header if doing a range read, the hole file, or just the part read?)
  • I imagine that the checksum is used internally by Atmos for the GeoProtect feature, in that they probably use it to to make sure primary or replica objects are not corrupt.
    1. True, but you'll either have to send the checksum before or after sending the content.  If you don't want to compute it beforehand and send x-emc-wschecksum, you'll have to set your own header after.
    2. The x-emc-wschecksum returned from partial reads is always the value for the entire file.  So, no, it doesn't have much value for a range request.
    3. The checksum is only validated by the web service layer on create/update.  Once GeoProtect takes the data, it uses its own algorithms to checksum, verify, and rebuild the blocks depending on the configuration your policy uses.

    I have a couple more alternatives for you too:

    1. If you need random access to your data and want to checksum on read, you could create something like a bittorrent manifest file that computes checksums at a block level, store it in metadata (or an adjacent object) and use that to verify the individual blocks on read.
    2. If you communicate with Atmos using HTTPS, you should have some reasonable protection against payload corruption.  Each block transmitted has a MAC computed that is validated on the other side.
    3. Depending on which programming language you use, you can wrap input and output stream classes to compute checksums as the data is streaming.  This should remove the need to rewind the stream after computing the checksum.
    No Events found!

    Top