2011-11-15

Tags: cassandra , 程式語言

這裡寫的Cassandra Secondary indexes功能是以1.0版為基準所驗証出來的結果。有些部份是以實際結果來反推它實作可能的方式,不知正確性為何。如果錯了,我之後再來修正吧...:P

針對Cassandra的Secondary indexes功能, 半官方文件是這麼說的

  1. column上可建立index
  2. 建立index的column可以用"="進行查詢(e.g. get employee where dept = 'rd' ,dept的column有建index,所以可以用"="查詢)
  3. 在查詢條件式中,最少要一個"="套用在有建立index的column,否則查詢會失敗
  4. column如果被建了index,它的column value不要有太多種可能性(e.g. country這個column name在我的系統中可能的值只'tw','us','jp';這就很適合拿來建index,因為可能產生的值不多)
看完上述官方的論述後,應該還是搞不懂建了index後有啥好處,覺的官方是在講 三小 外星語,會覺的index這功能很癈,只能作"="查詢,遜到爆了。

當然...真相一定不是這樣的,不然這功能早就被罵翻天了。

首先,先建立下述這個keyspace、column family,然後建立測試資料
create keyspace testks
with strategy_options=[{replication_factor:1}]
and placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy';

use testks;

create column family employee
with comparator = UTF8Type
and default_validation_class = UTF8Type
and column_metadata = [
{column_name : age,
validation_class : LongType},
{column_name : name,
validation_class : UTF8Type},
{column_name : dept,
validation_class : UTF8Type,
index_type : KEYS}
];
[default@testks] list employee;
Using default limit of 100
-------------------
RowKey: 3
=> (column=age, value=3, timestamp=1320997600912006)
=> (column=dept, value=rd, timestamp=1320997600913000)
=> (column=name, value=tester3, timestamp=1320997600912005)
-------------------
RowKey: 6
=> (column=age, value=6, timestamp=1320997600913008)
=> (column=dept, value=mis, timestamp=1320997600913009)
=> (column=name, value=tester6, timestamp=1320997600913007)
-------------------
RowKey: 5
=> (column=age, value=5, timestamp=1320997600913005)
=> (column=dept, value=rd, timestamp=1320997600913006)
=> (column=name, value=tester5, timestamp=1320997600913004)
-------------------
RowKey: 10
=> (column=age, value=10, timestamp=1320997600913020)
=> (column=dept, value=mis, timestamp=1320997600913021)
=> (column=name, value=tester10, timestamp=1320997600913019)
-------------------
RowKey: 8
=> (column=age, value=8, timestamp=1320997600913014)
=> (column=dept, value=mis, timestamp=1320997600913015)
=> (column=name, value=tester8, timestamp=1320997600913013)
-------------------
RowKey: 2
=> (column=age, value=2, timestamp=1320997600912003)
=> (column=dept, value=mis, timestamp=1320997600912004)
=> (column=name, value=tester2, timestamp=1320997600912002)
-------------------
RowKey: 1
=> (column=age, value=1, timestamp=1320997600912000)
=> (column=dept, value=rd, timestamp=1320997600912001)
=> (column=name, value=tester1, timestamp=1320997600897000)
-------------------
RowKey: 9
=> (column=age, value=9, timestamp=1320997600913017)
=> (column=dept, value=rd, timestamp=1320997600913018)
=> (column=name, value=tester9, timestamp=1320997600913016)
-------------------
RowKey: 4
=> (column=age, value=4, timestamp=1320997600913002)
=> (column=dept, value=mis, timestamp=1320997600913003)
=> (column=name, value=tester4, timestamp=1320997600913001)
-------------------
RowKey: 7
=> (column=age, value=7, timestamp=1320997600913011)
=> (column=dept, value=rd, timestamp=1320997600913012)
=> (column=name, value=tester7, timestamp=1320997600913010)

10 Rows Returned.

以下是結合官方跟自己實際試驗過後得到的結論
1.只以一個欄位當查詢條件式,但該欄位沒建立index的話,無法直接用"=、>、<..."查詢
[default@testks] get employee where age = 10;
No indexed columns present in index clause with operator EQ
[default@testks] get employee where age >= 10;
No indexed columns present in index clause with operator EQ
[default@testks] get employee where age <= 10;
No indexed columns present in index clause with operator EQ

2.只以一個欄位當查詢條件式,但該欄位有建立index的話,只能用"="查詢
[default@testks] get employee where dept = 'rd'; 
-------------------
RowKey: 3
=> (column=age, value=3, timestamp=1320997600912006)
=> (column=dept, value=rd, timestamp=1320997600913000)
=> (column=name, value=tester3, timestamp=1320997600912005)
-------------------
RowKey: 5
=> (column=age, value=5, timestamp=1320997600913005)
=> (column=dept, value=rd, timestamp=1320997600913006)
=> (column=name, value=tester5, timestamp=1320997600913004)
-------------------
RowKey: 1
=> (column=age, value=1, timestamp=1320997600912000)
=> (column=dept, value=rd, timestamp=1320997600912001)
=> (column=name, value=tester1, timestamp=1320997600897000)
-------------------
RowKey: 9
=> (column=age, value=9, timestamp=1320997600913017)
=> (column=dept, value=rd, timestamp=1320997600913018)
=> (column=name, value=tester9, timestamp=1320997600913016)
-------------------
RowKey: 7
=> (column=age, value=7, timestamp=1320997600913011)
=> (column=dept, value=rd, timestamp=1320997600913012)
=> (column=name, value=tester7, timestamp=1320997600913010)

5 Rows Returned.
[default@testks] get employee where dept >= 'rd';
No indexed columns present in index clause with operator EQ
[default@testks] get employee where dept <= 'rd';
No indexed columns present in index clause with operator EQ

3.有多個欄位當查詢條件式;只要其中一個欄位有建立index,而且該欄位在查詢時是以"="條件進行查詢;其它欄位的查詢條件不管怎麼下,都不受限制("=、>、<..."都可以成功執行)。也就是說可以進行大家超想要的RangeQuery(e.g. age > 1 and age < 10)功能
[default@testks] get employee where dept = 'rd' and age >=1 and age <=5 and name = 'tester5';
-------------------
RowKey: 5
=> (column=age, value=5, timestamp=1320997600913005)
=> (column=dept, value=rd, timestamp=1320997600913006)
=> (column=name, value=tester5, timestamp=1320997600913004)

1 Row Returned.
[default@testks] get employee where age >=1 and age <=5 and name = 'tester5' and dept = 'rd';
-------------------
RowKey: 5
=> (column=age, value=5, timestamp=1320997600913005)
=> (column=dept, value=rd, timestamp=1320997600913006)
=> (column=name, value=tester5, timestamp=1320997600913004)

1 Row Returned.
[default@testks] get employee where dept = 'rd' and age >=1 and age <=3;
-------------------
RowKey: 3
=> (column=age, value=3, timestamp=1320997600912006)
=> (column=dept, value=rd, timestamp=1320997600913000)
=> (column=name, value=tester3, timestamp=1320997600912005)
-------------------
RowKey: 1
=> (column=age, value=1, timestamp=1320997600912000)
=> (column=dept, value=rd, timestamp=1320997600912001)
=> (column=name, value=tester1, timestamp=1320997600897000)

2 Rows Returned.

4.由第3點得証, 查詢條件式如果出現多個欄位的查詢條件,只要其中一個有建立index,而且該欄位用"="查詢,其它欄位不建index也可以愛怎麼查就怎麼查。這部份的特性跟GoogleAppEnginge上的Datastore有很大的不同(Datastore裡,多個欄位的查詢條件要能成功,必需手動建立複合Key的索引,很麻煩)。

"據說"(官方並未有文件說明)這類查詢在系統內的運作方式是先找出符合index欄位查詢條件的所有資料,接著用其它欄位的查詢條件做再次的篩選,所以才有辦法進行RangeQuery。但是這種查詢有個問題,那就是index欄位查詢條件查出的資料量不能太巨量,不然每次的查詢對系統而言應該會很吃Memory。

5.column如果建了index,這個column的值不要有太多種,這有它的道理存在。因為值如果太多,會造成index資料檔變的非常巨大。如果將測試資料的age column也建了index,會發現它的index資料檔明顯比dept column的index資料檔要大很多(age的值有10種,1~10;dept的值只有2種,'mis','rd')
employee.employee_age_idx-h-1-Data.db 741bytes
employee.employee_dept_idx-h-1-Data.db 266bytes
我不知道一個查詢條件中同時有好儿個欄位有建index是否對查詢(e.g "get employee where dept = 'rd' and age = 1",dept與age都建立了index)有明顯的加速效果,但是可以很確定系統維護愈多的index,負擔會愈重。所以index適量就好,不是每個重要的欄位都一定要加index,有必要時再加就好。

6.Secondary indexes功能可以讓你輕鬆實現RangeQuery的功能,請多善用這功能,會降低很多人工手動維謢index的負擔。不過...除了RangeQuery的功能外,想有Sorting的功能,Cassandra目前是辦不到的。