Start a Conversation

Unsolved

This post is more than 5 years old

C

14060

May 18th, 2010 03:00

getting md5 checksum on stored blobs?

Hi guys, is there any way to get the md5 chrecksum of stored clips other than streaming the file local to the disc and calculating it there?

I guess there must be some internal md5 generation to ensure the validity of blobs but I am not sure if this is accessible through the Java SDK?

  best regards, Chris

417 Posts

May 18th, 2010 04:00

Hi Christoph – no, there is no way of getting this via the SDK.

Regards, Graham L. Stuart

Centera SDK Architect

11 Posts

May 18th, 2010 06:00

???

Christoph, - you want the MD5 of the blob(s) right?  What's wrong with reading the CDF?  The the 1st 27 characters (minus #14) of the BlobID contains the MD5, in an EMC encoded format.  And if you want the md5 of the cdf - same thing.

Much faster than reading the blob back - if all you want is the MD5.  Just have to translate it to a 'more standard' encoding.  Msg me off-board if interested.  And, of course - if centera is segmenting a large (>100MB) blob, then you'd have that issue to resolve, as you'd have not one, but multiple MD5s for a single 'user' object.

..clark

clark@storageswitch.com

303.859.3321

May 19th, 2010 02:00

Hi Graham, as there is probably an MD5 somewhere internal in Centera; wouln't it make sense to pass this back via SDK - just as you provide the filesize?

417 Posts

May 19th, 2010 03:00

There is really no requirement to pass it back - it is used internally by the SDK, which already validates that the MD5 of the content read back matches the original internal MD5 of the blob ID. You get a BLOBID MISMATCH if it doesn’t, so this “validation” step that you want to do has already been done!

Regards, Graham L. Stuart

Centera SDK Architect

May 19th, 2010 04:00

Hi Graham, you are totally right and I don't ask for this feature because I don't trust Centera; its just the following situation:

We have migrated a customer from Jukebox to Centera and our customer wants to take snapshots of the migrated documents

to check if the migration was really successful. For this the customer has listings of the Jukebox files (with md5) and wants

them to compare with file listings on Centera.

So we need to get the md5 of all documents migrated to Centera for a specific amount of days (right now its seven days). I have

done a search and within the 7 days we have about 7.000.000 documents stored in Centera. Streaming all this documents

is - in my opinion - not very meaningful.

So a SDK method to get the (already existing) md5 would be a great feature. Right now you provide the filesize in the SDK

and I would like to know why you don't want to provide the md5 too.

What is the reason for not supplying this?

  c

417 Posts

May 19th, 2010 04:00

Hi Chris – you have an extreme use case and nobody ever really wanted it before!

The fact that we do not enable the blob ID to be retrieved via the SDK goes some way to explaining this. Also, it would break if you had multiple 100MB blobs – we’d need to do a massive MD5 over potentially GBs of data in order to return the MD5 accurately in each case.

Sorry! I’m afraid the only way to do this (without an EMC service engagement) is to work it the way that Clark suggested.

The fact that (even with the SDK) you would still have to open all the clips means that using RawOpen to get the XML is not significantly different anyway (in terms of Centera API overhead).

Regards, Graham L. Stuart

Centera SDK Architect

417 Posts

May 19th, 2010 05:00

Hi Chris – sorry, I meant RawRead. This will give you an XML representation of the CDF, and when you navigate to the Tag you will find the blob ID as one of the attributes. I do not have any code that will do this for you.

The MD5 is calculated as it streams and it forms part of the BlobID but is not exposed via SDK as you are not supposed to have any knowledge of the underlying blob ID. Also, it the cluster uses any naming scheme other than MD5 (and there are others – GM, MG, GM+D, etc) then you would not get back a pure MD5.

This information is EMC internal so I cannot disclose details of how the blob ID is calculated. If you are able to reverse engineer it then that is down to you.

In terms of performance for huge objects, we do not require to do it on the cluster so would not accept the performance hit purely for a Use Case like yours. If you choose to do it, then it is down to your application to take the hit.

I am sorry, but we will not be adding this type of functionality to the SDK as there is no compelling business reason for us to do so. The Centera Content Address uses and MD5, it is not purely an MD5, so you will need to calculate your own if your customer requires you to do this type of verification.

Regards, Graham L. Stuart

Centera SDK Architect

May 19th, 2010 05:00

Hi Graham, the following things are not clear to me

(a) is the MD5 checksum generated when streaming a blob initially to Centera? If so then t is already present and no recalculation

will be needed?

(b) also in the case of calculating  the md5 of 100mb blobs I don't see the performance problem; if Centera does not fetch the data then

I would have to do it myself - its just the question who has to do all this and I guess the client side is less performant because there are

networks between client and Centera. Last point on this: isn't it the clients responsibility to take care the system keeps stable? You could

also provide a method for < 100m blobs - these should be no problem?

(c) regarding Clarks suggestion: I don't have a BlobID when I traverse the tags and their blobs using the SDK? Can you send me a snipped

of Java code so that I know how to get the BlobId and the md5 out of it?

(d) RawOpen? There is a method

public void RawRead(java.io.OutputStream pStream)

on the FPClip object. Do you mean this? Is it documented somewhere what comes back: binary data, xml, ....?

  Chris

417 Posts

May 19th, 2010 06:00

I will try and get hold of the DTD / schema for you.

Regarding the streaming, the SDK can calculate it before it streams the data, but this is inefficient so we generally recommend CALCID STREAMING. The cluster calculates it as the data comes in. When the data transfer is complete, they compare values to ensure that the transmitted content matches what was received. This then forms part of the Content Address, but it can vary in format.

May 19th, 2010 06:00

Hi Graham, tthe "The MD5 is calculated as it streams" is what I wanted to know; so there is no md5 stored in Centera and this answers a lot

of questions.

Last question: regarding the "XML representation of the CDF": is there a XML Schema or DTD specifying the ormat of the documents

which are returned? I hope so and as it is passed out this should not be internal:)

  c

417 Posts

June 1st, 2010 06:00

Hi Christoph - .unfortunately not. The only DTD / Schema I have seen related to the Health Report.

June 1st, 2010 06:00

Hi Graham, did you get the XMl Schema/DTD from engineering?

  c

June 7th, 2010 00:00

Hi, so there is no way to get this definition? Is there some engineer/developer at Centera I could ask for it? Is this method officially

supported?

417 Posts

June 7th, 2010 06:00

Hi Christoph - I am not sure what you mean by "officially supported" in this context, or what you would gain by having the XML Schema.

Are you looking to create an XML document and use this to automatically ingest? This is most certainly not supported as there are no API calls to do that. Even if you were to try and creat one and use RawOpen, there are key pieces of metadata that you would be unable to generate.

June 7th, 2010 06:00

Hi, the thing is just that RawOpen is a call in the official SDK and it allowes me to query a XML document (as string) which I would like to read.

Reading XML normally involved checking it against a predefined XML Schema or DTD so that one can be sure it is well formed (and

also there are nice tools that can generate Java wrappers out of the XML Schema).

So I guess there must be such an XML Schema somewhere out because what should one do with a call whose result is not documented?

  c

No Events found!

Top