Hi guys first of all many thanks for the reply’s Background:
I am in a scientific competition to attempt to elucidate the cause/causes of a particular type of cancer. The cancer causes for this type of cancer are already known however generating an efficient algorithm that can do it with this smaller example could enable us to perform the same process of a range of cancers. So the first part of the competition is to try and identify which person in a group of people has the cancer based on 120 different genetic markers. Of course I need to also consider combinations of different markers e.g. app1 with kat2. It is generally accepted that the cause of this cancer is multifactorial and I expect that the model may identify several different biochemical mechanisms for the cancer to develop.My previous model:
My previous model was working extremely well however it was created using excel and solver. In other words couldn’t be scaled up. This model calculated the % of people who had cancer against the number of times the patient had a copy of the gene e.g. a patient with 5 x app1 gene could have a 4.7% chance of developing said cancer. In this first model I didn’t manage to include every gene before the sheet died (literally couldn’t open it). However my method was as follows: Perform a 6th order polynomial regression of the data and calculate the % chance of the patient based on that one gene. Then I proceeded to do this for all of the genes. I then simply added all of the % chances together and this achieved a fairly accurate model. However I improved this further using excels evolutionary algorithm (part of their solver package). I used the following method (x*n)+(y*n)+(z*n)+(xa*n)+(xb*n) where n is a number between 1 and 0 and x-xb are the % chances of developing the cancer as defined by the polynomial regression. This is the part that I would leave running for days in order to optimise the model, this evolutionary algorithm adjusting n and therefore adjusting the weightings to each of the regressions of my parameters.
With regards to the database querying, I am starting to think that it is much simpler to keep the data in csv format. I have started looking at numpy & Scikit as two key tools in this process. Any guidance, comments or general advice would be greatly appreciated. Since I really really want this to work.