-
Notifications
You must be signed in to change notification settings - Fork 83
Alignment of sequences consisting of arbitrary 64 bit integers #3271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I got it to the point of having a class
But I have so far been unable to get it to pass this one:
I am getting the following error messages (simplified for clarity) :
My
Please help! Perhaps add a complete example to the documentation showing an alignment of sequences consisting of arbitrary integers? |
After adding member function
|
Hey there, Did you find https://docs.seqan.de/seqan3/main_user/howto_write_an_alphabet.html ? From a quick glance, Unlike other unsigned integer types, The header names the first reason: There is no However, The (semi)alphabet problem with Regarding alignment, maybe @rrahn has some idea. |
For my purposes it would be sufficient if we can get this to work for 63-bit integers represented as 64-bit integers with the most significant bit set to zero. The size of this alphabet is 2^63 and so is representable as a 64-bit integer. I did peruse the portion of the documentation you linked to as well as the API reference documentation. |
Ok, so now I have an alphabet class that can represent all 63-bit integers and passes all these four static assertions:
However, when I try to compute an alignment with this, the compiler complains about the scoring scheme:
I am using |
Hi @paoloshasta, any class modelling the seqan3::scoring_scheme_for concept should do it. |
Thank you, that got me one step ahead. I have a scoring scheme that passes this assertion, and with that that problem is gone.
But now I bump into a new problem:
Do I also need to define a new gap class? I looked at the documentation under |
I could not find anything in the documentation about |
I don't think it's currently possible to define an alphabet that would work. Putting the problems with defining
So it seems like the alignment only handles small alphabets (up to Regarding your use case, from what I understand, you take a random set of k-mers, each of which gets a rank. For 8000 k-mers that results in an alphabet of size 8000. Does that sound about right? Edit: I have one question though. |
That description of a typical 8000 symbol alphabet size is obsolete and refers to sequencing technology from 5 years ago. Today, the reads are much more accurate and therefore the markers are much longer. They are typically 30 bases in length and without RLE (Run-Length-Encoding), which means that the total number of possible marker k-mers is 4^30 = 2^60. And they will probably need to be longer in the future, which means that I will probably have to switch to 128-bit integers soon. But even if I consider only the distinct marker k-mers that appear in a given assembly, as you suggest, their total number is still very high - high enough that it is not feasible to store a lookup table. In addition, using a lookup table, even if it were possible, would incur a cache miss penalty every time a marker k-mer has to be converted to sequence. Instead, with the current approach that operation is trivial and requires just some bit operations which are very fast. Regarding your question: when a read is converted to a marker representation, it becomes a sequence of integers (60-bit integers if using marker length 30). Later, when I want to align two reads, instead of aligning their base sequences I align their marker k-mers. Those sequences are typically 20 times shorter and so the alignment is much faster, and in addition the alphabet is much longer than the original 4-symbol alphabet, which means that identical symbols rarely appear in the two reads outside of their actual alignment. This reduces alignment uncertainty, especially in repeat-rich regions. It is unfortunate that Seqan3 is not able to handle such a large alphabet. I had no problem doing this with SeqAn2: I just add 100 to the integers being aligned, and so there is no possibility of collision with the value (45) used by Seqan2 to represent gaps. And there is no possibility of overflow if all of my integers are actually 60 bits long. However I recently discovered bugs in Seqan2 banded alignments, which cannot be fixed because that code is no longer being maintained. Hence my decision to explore Seqan3. I may have to look for another solution or write some custom code. |
Thanks for clarifying!
Well, it wouldn't really work (well) anyway for any alphabet with more than 256 letters.
Yes, that's another reason. It's only really designed for small alphabets.
There will be some modularization in the future, including, as far as I know, alphabet-free alignment.
We do maintain Seqan2. We do bug fixes and newer compiler support. However, we don't develop new things for Seqan2 (feature freeze). |
Great, I will do that. Please confirm that this is the correct repository: |
It is |
Great. I will soon file an issue on that repository, attaching a simple test program that reproduces the problem. Thank you for your prompt and competent help on this. |
I am trying to use Seqan3 to align two sequences consisting of arbitrary 64-bit integers. That is, the alphabet consists of all possible 2^64 64-bit integers. I was able to do this without problems in Seqan2, but after digging in the Seqan3 documentation I was not able to find out how to do this. Can you point me to an example or give me some ideas on how to do this? Your
seqan3::custom::alphabet
does not seem to do the job, and it does not support 64-bit integers anyway. I know I somehow need to define a new alphabet, but the documentation to do this is a bit opaque and seems to be geared towards small alphabets.This is needed for the Shasta assembler, in which each read is represented as a sequence of markers. See here for more information.
The text was updated successfully, but these errors were encountered: