Madhusudanan

11 Posts

1042

December 8th, 2011 21:00

Please help.!!!

Anyone who can help!!

Array data is very popular in many application domains of database.

For example, a medical institute often maintains gene expression (e.g., microarray)

data for many patients for comparative medical studies and such data can be modeled as a 2-D

array (or a matrix). If we were to mapped that to a relational schema, it is something like:

Gene_data (patientID, geneNumber, expression_data)

We can imagine each tuple in this table is an element in a matrix with each row

corresponding to a patient and each column a gene. Such a database can hold

expression_data for tens of thousands of genes from tens of thousands of patients, leading

to many gigabytes of data in a single table. Popular queries against such databases

include:

(1) row_retrieval: get the expression_data for all genes from one patient;

(2) column_retrieval: get the expression_data of one particular gene from all

patients.

(3) Sub_matrix: get the expression_data of some chosen genes from a subset of

patients.

We can first assume that data are stored tuple-by-tuple following the above relational

Based on your knowledge of (raw) data organization and indexing, answer the

following questions:

If there is no index at all, and the size of the Gene_data table is M pages,

what are the costs of processing the above three queries?

If there is a B+-tree index on the key of the Gene_data table, how can we

use the index to improve query processing performance? And answer Question 1

with the new assumption about the B+-tree index.

I am sure you will not be satisfied with the performance cost you derived

in the previous questions. Design a new data processing engine that can

significantly improve the performance of such a database system. (Hint: data is

mostly static, then why do we have to keep only one copy of the data?) You can

draw diagrams and put as much details as you can to explain your ideas.

Responses(5)

LouisLu

161 Posts

0

December 8th, 2011 22:00

The query always ran slow without index in DB. But more answers on optimized data structure, we need to do some research and give you the feedback then.

LouisLu

161 Posts

0

December 9th, 2011 00:00

Continue...

Consdering the Low-cardinality columns (only three columns mentioned), the B-tree index struction has no benefits for the selective requirement.

The Bit-map index looks more fit for the less distinct columns value. Are you trying to build serveral bit-map index on columns? But you didn't mention the modification frequency on the table. Which may not be bit-map index strength.

Some performance tuning by index on http://www.dba-oracle.com/art_9i_indexing.htm.

Lu

Madhusudanan

11 Posts

0

December 9th, 2011 01:00

Thanks Louis.

As this is a scanerio, So Somethings are presumed,..:)..
a.Based on your knowledge of (raw) data organization and indexing, answer the following questions:

1.If there is no index at all, and the size of the Gene_data table is M pages, what are the costs of processing the above three queries?

2.If there is a B+-tree index on the key of the Gene_data table, how can we use the index to improve query processing performance? And answer Question 1 with the new assumption about the B+-tree index.

3.I am sure you will not be satisfied with the performance cost you derived in the previous questions. Design a new data processing engine that can significantly improve the performance of such a database system. (Hint: data is mostly static, then why do we have to keep only one copy of the data?) You can draw diagrams and put as much details as you can to explain your ideas.

LouisLu

161 Posts

0

December 9th, 2011 02:00

The costs really depends on access parttern, partition table, the size of pages and other factors.

Sometimes people thought adding index could cerntainly improve the performance but it is not ture.

slucido

109 Posts

0

December 12th, 2011 11:00

Unfortunately Madhusudanan the best answer might be outside of Oracle. In this case you might consider EMC Greenplum as this massively parallel and shared nothing architecture enable you to perform queries like, “get the expression_data for all genes from one patient” much faster than Oracle’s row retrieval. The capability of Greenplum to take a complex or extremely large query and split it up across computational resources and separate storage devices means much faster query results. In addition, from your description it sounds like massive data loads will be needed and in this respect Greenplums MPP architecture shines too. Some good links to look at include:

http://www.greenplum.com/community/

This link has some video presentations from the world’s first data scientist summit which may be of interest to you.

www.greenplum.com

Overall product page.

View All

No Events found!