yasserg · yasserg · Jul 8, 2015 · Jul 8, 2015 · Jul 8, 2015 · Jul 8, 2015
diff --git a/README.md b/README.md
@@ -0,0 +1,97 @@
+jforests is a Java library that implements many tree-based learning algorithms.
+
+jforests can be used for regression, classification and ranking problems. The latest release can be downloaded from  https://github.com/yasserg/jforests/releases
+
+The following tutorial shows how jforests can be used for learning a ranking model using the LambdaMART algorithm.
+
+#Learning to Rank with LambdaMART
+
+##Data Sets Format
+jforests uses the following format for its input data sets (same as the one used in SVMLight):
+
+```
+<line> .=. <relevance> qid:<qid> <feature>:<value> ... <feature>:<value> 
+<relevance> .=. <integer>
+<qid> .=. <positive integer>
+<feature> .=. <positive integer>
+<value> .=. <float>
+```
+
+For this tutorial, we will use the sample data set which is available <a href="https://github.com/yasserg/jforests/blob/master/jforests/src/main/resources/sample-ranking-data.zip">here</a>.
+
+
+##Converting Data Sets to Binary Format
+In order to speed up the computations, jforests converts its input data sets to binary format. We are assuming that you have unzipped the above sample data set in a folder and are currently on that folder. You should have also <a href="https://github.com/yasserg/jforests/releases">downloaded</a> the latest jforests jar file and renamed it to 'jforests.jar' and put it in the same folder.
+
+The following command can be used for converting data sets to binary format:
+
+	java -jar jforests.jar --cmd=generate-bin --ranking --folder . --file train.txt --file valid.txt --file test.txt
+
+As this command shows, we are converting 'train.txt', 'valid.txt', and 'test.txt' to binary format. As a result 'train.bin', 'valid.bin', and 'test.bin' are generated.
+
+##Learning the Ranking Model
+Once the input data sets are converted to the binary format, a ranking model can be trained on them.
+
+First you need to specify the parameters of your machine learning algorithm. The following is a sample set of parameters for the LambdaMART algorithm:
+
+```
+trees.num-leaves=7
+trees.min-instance-percentage-per-leaf=0.25
+boosting.learning-rate=0.05
+boosting.sub-sampling=0.3
+trees.feature-sampling=0.3
+
+boosting.num-trees=2000
+learning.algorithm=LambdaMART-RegressionTree
+learning.evaluation-metric=NDCG
+
+params.print-intermediate-valid-measurements=true
+```
+
+Create a 'ranking.properties' file in the current folder and save the above config in it.
+
+Then the following command can be used for training a LambdaMART ensemble and storing it in the 'ensemble.txt' file:
+
+	java -jar jforests.jar --cmd=train --ranking --config-file ranking.properties --train-file train.bin --validation-file valid.bin --output-model ensemble.txt
+
+##Predicting Scores of Documents
+Once you have the LambdaMART ensemble, you can use it for predicting scores of test documents. The following command performs this step and stores the results in the 'predcitions.txt' file.
+
+	java -jar jforests.jar --cmd=predict --ranking --model-file ensemble.txt --tree-type RegressionTree --test-file test.bin --output-file predictions.txt
+
+Scores can then be used for measuring NDCG or other information retrieval measures.
+
+## Advanced Ranking Options
+
+Jforests can be configured to change the used measure for LambdaMART using the `learning.evaluation-metric` entry in the `ranking.properties` file. Currently, NDCG is supported, as well as risk-sensitive evaluation measures such as URisk and TRisk - see <a href="RiskSensitiveLambdaMART.md">RiskSensitiveLambdaMART</a>.
+
+#Source Code
+Source code is are available from the Github  repository: https://github.com/yasserg/jforests
+
+#Citation Policy
+If you use jforests for a research purpose, please use the following citation:
+
+Y. Ganjisaffar, R. Caruana, C. Lopes, *Bagging Gradient-Boosted Trees for High Precision, Low Variance Ranking Models*, in SIGIR 2011, Beijing, China.
+
+Bibtex:
+```
+@inproceedings{Ganji:2011:SIGIR,
+	author = {Yasser Ganjisaffar and Rich Caruana and Cristina Lopes},
+	title = {Bagging Gradient-Boosted Trees for High Precision, Low Variance Ranking Models},
+	booktitle = {Proceedings of the 34th international ACM SIGIR conference on Research and development in Information},
+	series = {SIGIR '11},
+	year = {2011},
+	isbn = {978-1-4503-0757-4},
+	location = {Beijing, China},
+	pages = {85--94},
+	numpages = {10},
+	doi = {http://doi.acm.org/10.1145/2009916.2009932},
+	acmid = {2009932},
+	publisher = {ACM},
+	address = {New York, NY, USA},
+}
+```
+
+If you use risk-sensitive learning to rank, please see <a href="RiskSensitiveLambdaMART.md">RiskSensitiveLambdaMART</a> for citation information.
+
+
diff --git a/RiskSensitiveLambdaMART.md b/RiskSensitiveLambdaMART.md
@@ -0,0 +1,37 @@
+#RiskSensitiveLambdaMART.md
+##Introduction
+
+The aim of risk-sensitive retrieval is to reduce the _risk_ that the system could perform worse than a baseline (e.g. BM25) for a given query. 
+
+Wang et al. (1) proposed that LambdaMART could be adapted to be more robust, or risk-sensitive, by adaptation of the loss-function. In particular, their URisk measure, which weights decreases in effectiveness (e.g. in terms of NDCG) more strongly than gains can be integrated into LambdaMART. The alpha >0 parameter defines how must emphasis down-side risk (losses compared to the baseline) obtain compared to relative gains.
+
+Dincer et al (2) proposed new _adaptive_ risk-sensitive measures for LambdaMART based on the t-test statistic, namely SARO & FARO: Semi-Adaptive Risk-sensitive Optimisation (SARO) weights more strongly the alpha value for topics which exhibit significant downside risk. On the other hand, for Fully-Adaptive Risk-sensitive Optimisation (FARO), the alpha value for topics is varied according to the amount of risk (down-side or up-side) observed for the topic.
+
+As of version 0.5, Jforests supports URisk, SARO & FARO.
+
+#Usage
+
+*NB:* Citation: If you use the risk-sensitive evaluation in Jforests, please cite (2): B.T Dincer, C Macdonald, I Ounis. Hypothesis Testing for Risk-Sensitive Evaluation of Retrieval Systems. In Proceedings of SIGIR 2014.
+
+To deploy URisk with alpha=1, based on the NDCG evaluation measure, the ranking.properties should be altered as follows:
+```
+#learning.evaluation-metric=NDCG
+learning.evaluation-metric=URiskAwareEval:1:NDCG
+```
+
+To deploy SARO:
+```
+#learning.evaluation-metric=NDCG
+learning.evaluation-metric=TRiskAwareSAROEval:1:NDCG
+```
+
+To deploy FARO:
+```
+#learning.evaluation-metric=NDCG
+learning.evaluation-metric=TRiskAwareFAROEval:1:NDCG
+```
+
+#References
+(1) L Wang, P N Bennett, and K Collins-Thompson. Robust ranking models via risk-sensitive optimization. In Proceedings of SIGIR 2012. http://doi.acm.org/10.1145/2348283.2348385
+
+(2) B T Dincer, C Macdonald, I Ounis. Hypothesis Testing for Risk-Sensitive Evaluation of Retrieval Systems. In Proceedings of SIGIR 2014.