Cassandra Secondary indexes心得筆記

2011-11-15

這裡寫的Cassandra Secondary indexes功能是以1.0版為基準所驗証出來的結果。有些部份是以實際結果來反推它實作可能的方式，不知正確性為何。如果錯了，我之後再來修正吧...:P

針對Cassandra的Secondary indexes功能，半官方文件是這麼說的

column上可建立index
建立index的column可以用"="進行查詢(e.g. get employee where dept = 'rd' ，dept的column有建index，所以可以用"="查詢)
在查詢條件式中，最少要一個"="套用在有建立index的column，否則查詢會失敗
column如果被建了index，它的column value不要有太多種可能性(e.g. country這個column name在我的系統中可能的值只'tw','us','jp'；這就很適合拿來建index，因為可能產生的值不多)

看完上述官方的論述後，應該還是搞不懂建了index後有啥好處，覺的官方是在講三小外星語，會覺的index這功能很癈，只能作"="查詢，遜到爆了。

當然...真相一定不是這樣的，不然這功能早就被罵翻天了。

首先，先建立下述這個keyspace、column family，然後建立測試資料

create keyspace testks
    with strategy_options=[{replication_factor:1}]
    and placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy';

use testks;
 
create column family employee
    with comparator = UTF8Type
    and default_validation_class = UTF8Type
    and column_metadata = [
 {column_name : age,
     validation_class : LongType}, 
 {column_name : name,
     validation_class : UTF8Type}, 
 {column_name : dept,
     validation_class : UTF8Type,
     index_type : KEYS}  
    ];

[default@testks] list employee;
Using default limit of 100
-------------------
RowKey: 3
=> (column=age, value=3, timestamp=1320997600912006)
=> (column=dept, value=rd, timestamp=1320997600913000)
=> (column=name, value=tester3, timestamp=1320997600912005)
-------------------
RowKey: 6
=> (column=age, value=6, timestamp=1320997600913008)
=> (column=dept, value=mis, timestamp=1320997600913009)
=> (column=name, value=tester6, timestamp=1320997600913007)
-------------------
RowKey: 5
=> (column=age, value=5, timestamp=1320997600913005)
=> (column=dept, value=rd, timestamp=1320997600913006)
=> (column=name, value=tester5, timestamp=1320997600913004)
-------------------
RowKey: 10
=> (column=age, value=10, timestamp=1320997600913020)
=> (column=dept, value=mis, timestamp=1320997600913021)
=> (column=name, value=tester10, timestamp=1320997600913019)
-------------------
RowKey: 8
=> (column=age, value=8, timestamp=1320997600913014)
=> (column=dept, value=mis, timestamp=1320997600913015)
=> (column=name, value=tester8, timestamp=1320997600913013)
-------------------
RowKey: 2
=> (column=age, value=2, timestamp=1320997600912003)
=> (column=dept, value=mis, timestamp=1320997600912004)
=> (column=name, value=tester2, timestamp=1320997600912002)
-------------------
RowKey: 1
=> (column=age, value=1, timestamp=1320997600912000)
=> (column=dept, value=rd, timestamp=1320997600912001)
=> (column=name, value=tester1, timestamp=1320997600897000)
-------------------
RowKey: 9
=> (column=age, value=9, timestamp=1320997600913017)
=> (column=dept, value=rd, timestamp=1320997600913018)
=> (column=name, value=tester9, timestamp=1320997600913016)
-------------------
RowKey: 4
=> (column=age, value=4, timestamp=1320997600913002)
=> (column=dept, value=mis, timestamp=1320997600913003)
=> (column=name, value=tester4, timestamp=1320997600913001)
-------------------
RowKey: 7
=> (column=age, value=7, timestamp=1320997600913011)
=> (column=dept, value=rd, timestamp=1320997600913012)
=> (column=name, value=tester7, timestamp=1320997600913010)

10 Rows Returned.

以下是結合官方跟自己實際試驗過後得到的結論
1.只以一個欄位當查詢條件式，但該欄位沒建立index的話，無法直接用"=、>、<..."查詢

[default@testks] get employee where age = 10;
No indexed columns present in index clause with operator EQ
[default@testks] get employee where age >= 10;
No indexed columns present in index clause with operator EQ
[default@testks] get employee where age <= 10;
No indexed columns present in index clause with operator EQ

2.只以一個欄位當查詢條件式，但該欄位有建立index的話，只能用"="查詢

[default@testks] get employee where dept = 'rd'; 
-------------------
RowKey: 3
=> (column=age, value=3, timestamp=1320997600912006)
=> (column=dept, value=rd, timestamp=1320997600913000)
=> (column=name, value=tester3, timestamp=1320997600912005)
-------------------
RowKey: 5
=> (column=age, value=5, timestamp=1320997600913005)
=> (column=dept, value=rd, timestamp=1320997600913006)
=> (column=name, value=tester5, timestamp=1320997600913004)
-------------------
RowKey: 1
=> (column=age, value=1, timestamp=1320997600912000)
=> (column=dept, value=rd, timestamp=1320997600912001)
=> (column=name, value=tester1, timestamp=1320997600897000)
-------------------
RowKey: 9
=> (column=age, value=9, timestamp=1320997600913017)
=> (column=dept, value=rd, timestamp=1320997600913018)
=> (column=name, value=tester9, timestamp=1320997600913016)
-------------------
RowKey: 7
=> (column=age, value=7, timestamp=1320997600913011)
=> (column=dept, value=rd, timestamp=1320997600913012)
=> (column=name, value=tester7, timestamp=1320997600913010)

5 Rows Returned.
[default@testks] get employee where dept >= 'rd';
No indexed columns present in index clause with operator EQ
[default@testks] get employee where dept <= 'rd';
No indexed columns present in index clause with operator EQ

3.有多個欄位當查詢條件式；只要其中一個欄位有建立index，而且該欄位在查詢時是以"="條件進行查詢；其它欄位的查詢條件不管怎麼下，都不受限制("=、>、<..."都可以成功執行)。也就是說可以進行大家超想要的RangeQuery(e.g. age > 1 and age < 10)功能

[default@testks] get employee where dept = 'rd' and age >=1 and age <=5 and name = 'tester5';
-------------------
RowKey: 5
=> (column=age, value=5, timestamp=1320997600913005)
=> (column=dept, value=rd, timestamp=1320997600913006)
=> (column=name, value=tester5, timestamp=1320997600913004)

1 Row Returned.
[default@testks] get employee where age >=1 and age <=5 and name = 'tester5' and dept = 'rd';
-------------------
RowKey: 5
=> (column=age, value=5, timestamp=1320997600913005)
=> (column=dept, value=rd, timestamp=1320997600913006)
=> (column=name, value=tester5, timestamp=1320997600913004)

1 Row Returned.
[default@testks] get employee where dept = 'rd' and age >=1 and age <=3;                     
-------------------
RowKey: 3
=> (column=age, value=3, timestamp=1320997600912006)
=> (column=dept, value=rd, timestamp=1320997600913000)
=> (column=name, value=tester3, timestamp=1320997600912005)
-------------------
RowKey: 1
=> (column=age, value=1, timestamp=1320997600912000)
=> (column=dept, value=rd, timestamp=1320997600912001)
=> (column=name, value=tester1, timestamp=1320997600897000)

2 Rows Returned.

4.由第3點得証，

查詢條件式如果出現多個欄位的查詢條件，只要其中一個有建立index，而且該欄位用"="查詢，其它欄位不建index也可以愛怎麼查就怎麼查

。這部份的特性跟GoogleAppEnginge上的Datastore有很大的不同(Datastore裡，多個欄位的查詢條件要能成功，必需手動建立複合Key的索引，很麻煩)。

"據說"(官方並未有文件說明)這類查詢在系統內的運作方式是先找出符合index欄位查詢條件的所有資料，接著用其它欄位的查詢條件做再次的篩選，所以才有辦法進行RangeQuery。但是這種查詢有個問題，那就是index欄位查詢條件查出的資料量不能太巨量，不然每次的查詢對系統而言應該會很吃Memory。

5.column如果建了index，這個column的值不要有太多種，這有它的道理存在。因為值如果太多，會造成index資料檔變的非常巨大。如果將測試資料的age column也建了index，會發現它的index資料檔明顯比dept column的index資料檔要大很多(age的值有10種，1~10；dept的值只有2種，'mis','rd')

employee.employee_age_idx-h-1-Data.db 741bytes
employee.employee_dept_idx-h-1-Data.db 266bytes

我不知道一個查詢條件中同時有好儿個欄位有建index是否對查詢(e.g "get employee where dept = 'rd' and age = 1"，dept與age都建立了index)有明顯的加速效果，但是可以很確定系統維護愈多的index，負擔會愈重。所以index適量就好，不是每個重要的欄位都一定要加index，有必要時再加就好。

6.Secondary indexes功能可以讓你輕鬆實現RangeQuery的功能，請多善用這功能，會降低很多人工手動維謢index的負擔。不過...除了RangeQuery的功能外，想有Sorting的功能，Cassandra目前是辦不到的。